29. Pandas的数据分组-groupby函数

在SQL语言里有group by功能,在Pandas里有groupby函数与之功能相对应。DataFrame数据对象经groupby()之后有ngroups和groups等属性,本质是DataFrame类的子类DataFrameGroupBy的实例对象。ngroups反应的是分组的个数,而groups类似dict结构,key是分组的index或label,value则为index或label所对应的分组数据。size函数则是可以返回所有分组的字节大小。count函数可以统计分组后各列数据项个数。get_group函数可以返回指定组的数据信息。而discribe函数可以返回分组后的数据的统计数据。

  • 基于单列的分组。
import pandas as pd
idx = [101,101,101,102,102,102,103,103,103]
name = ["apple","pearl","orange", "apple","pearl","orange","apple","pearl","orange"]
price = [1.0,2.0,3.0,4.00,5.0,6.0,7.0,8.0,9.0]
df0 = pd.DataFrame({ "fruit": name, "price" : price, "supplier" :idx})
print "*" * 30
print df0
dg =  df0.groupby("fruit")
print "*" * 30
for n, g in dg:
    print "group_name:", n, "\n|",g,"|" 
print "*" * 30
for n,_ in dg:
    print "group_name:", n, "\n|",dg.get_group(n),"|" 
print "*" * 30
print dg.describe()
print "*" * 30

程序执行结果:

******************************
    fruit  price  supplier
0   apple      1       101
1   pearl      2       101
2  orange      3       101
3   apple      4       102
4   pearl      5       102
5  orange      6       102
6   apple      7       103
7   pearl      8       103
8  orange      9       103
******************************
group_name: apple 
|    fruit  price  supplier
0  apple      1       101
3  apple      4       102
6  apple      7       103 |
group_name: orange 
|     fruit  price  supplier
2  orange      3       101
5  orange      6       102
8  orange      9       103 |
group_name: pearl 
|    fruit  price  supplier
1  pearl      2       101
4  pearl      5       102
7  pearl      8       103 |
******************************
group_name: apple 
|    fruit  price  supplier
0  apple      1       101
3  apple      4       102
6  apple      7       103 |
group_name: orange 
|     fruit  price  supplier
2  orange      3       101
5  orange      6       102
8  orange      9       103 |
group_name: pearl 
|    fruit  price  supplier
1  pearl      2       101
4  pearl      5       102
7  pearl      8       103 |
******************************
              price  supplier
fruit                        
apple  count    3.0       3.0
       mean     4.0     102.0
       std      3.0       1.0
       min      1.0     101.0
       25%      2.5     101.5
       50%      4.0     102.0
       75%      5.5     102.5
       max      7.0     103.0
orange count    3.0       3.0
       mean     6.0     102.0
       std      3.0       1.0
       min      3.0     101.0
       25%      4.5     101.5
       50%      6.0     102.0
       75%      7.5     102.5
       max      9.0     103.0
pearl  count    3.0       3.0
       mean     5.0     102.0
       std      3.0       1.0
       min      2.0     101.0
       25%      3.5     101.5
       50%      5.0     102.0
       75%      6.5     102.5
       max      8.0     103.0
******************************
  • 多列数据分组。在groupby函数内给出一个列表,列表里列出多列列名。
import pandas as pd
idx = [101,101,101,102,102,102,103,103,103]
name = ["apple","pearl","orange", "apple","pearl","orange","apple","pearl","orange"]
price = [1.0,2.0,3.0,4.00,5.0,6.0,7.0,8.0,9.0]
df0 = pd.DataFrame({ "fruit": name, "price" : price, "supplier" :idx})
print "*" * 30
print df0
print "*" * 30
dg2 =  df0.groupby(["fruit", "supplier"])
for n, g in dg2:
    print "multiGroup on:", n, "\n|",g ,"|"
print "*" * 30

程序里基于["fruit", "supplier"]两列进行分组。执行结果如下:

******************************
    fruit  price  supplier
0   apple      1       101
1   pearl      2       101
2  orange      3       101
3   apple      4       102
4   pearl      5       102
5  orange      6       102
6   apple      7       103
7   pearl      8       103
8  orange      9       103
******************************
multiGroup on: ('apple', 101) 
|    fruit  price  supplier
0  apple      1       101 |
multiGroup on: ('apple', 102) 
|    fruit  price  supplier
3  apple      4       102 |
multiGroup on: ('apple', 103) 
|    fruit  price  supplier
6  apple      7       103 |
multiGroup on: ('orange', 101) 
|     fruit  price  supplier
2  orange      3       101 |
multiGroup on: ('orange', 102) 
|     fruit  price  supplier
5  orange      6       102 |
multiGroup on: ('orange', 103) 
|     fruit  price  supplier
8  orange      9       103 |
multiGroup on: ('pearl', 101) 
|    fruit  price  supplier
1  pearl      2       101 |
multiGroup on: ('pearl', 102) 
|    fruit  price  supplier
4  pearl      5       102 |
multiGroup on: ('pearl', 103) 
|    fruit  price  supplier
7  pearl      8       103 |
******************************
  • 使用set_index进行分组。
import pandas as pd
idx = [101,101,101,102,102,102,103,103,103]
name = ["apple","pearl","orange", "apple","pearl","orange","apple","pearl","orange"]
price = [1.0,2.0,3.0,4.00,5.0,6.0,7.0,8.0,9.0]
df0 = pd.DataFrame({ "fruit": name, "price" : price, "supplier" :idx})
print "*" * 30
print df0
print "*" * 30
dg2 =  df0.groupby(["fruit", "supplier"])
for n, g in dg2:
    print "multiGroup on:", n, "\n|",g ,"|"
print "*" * 30
si = df0.set_index(["fruit", "supplier"])
print si
print "*" * 30

程序的执行结果:

******************************
    fruit  price  supplier
0   apple      1       101
...
8  orange      9       103
******************************
multiGroup on: ('apple', 101) 
|    fruit  price  supplier
0  apple      1       101 |
...
multiGroup on: ('pearl', 103) 
|    fruit  price  supplier
7  pearl      8       103 |
******************************
                 price
fruit  supplier       
apple  101           1
...
orange 103           9
******************************