29. Pandas的数据分组-groupby函数
在SQL语言里有group by功能,在Pandas里有groupby函数与之功能相对应。DataFrame数据对象经groupby()之后有ngroups和groups等属性,本质是DataFrame类的子类DataFrameGroupBy的实例对象。ngroups反应的是分组的个数,而groups类似dict结构,key是分组的index或label,value则为index或label所对应的分组数据。size函数则是可以返回所有分组的字节大小。count函数可以统计分组后各列数据项个数。get_group函数可以返回指定组的数据信息。而discribe函数可以返回分组后的数据的统计数据。
- 基于单列的分组。
import pandas as pd
idx = [101,101,101,102,102,102,103,103,103]
name = ["apple","pearl","orange", "apple","pearl","orange","apple","pearl","orange"]
price = [1.0,2.0,3.0,4.00,5.0,6.0,7.0,8.0,9.0]
df0 = pd.DataFrame({ "fruit": name, "price" : price, "supplier" :idx})
print "*" * 30
print df0
dg = df0.groupby("fruit")
print "*" * 30
for n, g in dg:
print "group_name:", n, "\n|",g,"|"
print "*" * 30
for n,_ in dg:
print "group_name:", n, "\n|",dg.get_group(n),"|"
print "*" * 30
print dg.describe()
print "*" * 30
程序执行结果:
******************************
fruit price supplier
0 apple 1 101
1 pearl 2 101
2 orange 3 101
3 apple 4 102
4 pearl 5 102
5 orange 6 102
6 apple 7 103
7 pearl 8 103
8 orange 9 103
******************************
group_name: apple
| fruit price supplier
0 apple 1 101
3 apple 4 102
6 apple 7 103 |
group_name: orange
| fruit price supplier
2 orange 3 101
5 orange 6 102
8 orange 9 103 |
group_name: pearl
| fruit price supplier
1 pearl 2 101
4 pearl 5 102
7 pearl 8 103 |
******************************
group_name: apple
| fruit price supplier
0 apple 1 101
3 apple 4 102
6 apple 7 103 |
group_name: orange
| fruit price supplier
2 orange 3 101
5 orange 6 102
8 orange 9 103 |
group_name: pearl
| fruit price supplier
1 pearl 2 101
4 pearl 5 102
7 pearl 8 103 |
******************************
price supplier
fruit
apple count 3.0 3.0
mean 4.0 102.0
std 3.0 1.0
min 1.0 101.0
25% 2.5 101.5
50% 4.0 102.0
75% 5.5 102.5
max 7.0 103.0
orange count 3.0 3.0
mean 6.0 102.0
std 3.0 1.0
min 3.0 101.0
25% 4.5 101.5
50% 6.0 102.0
75% 7.5 102.5
max 9.0 103.0
pearl count 3.0 3.0
mean 5.0 102.0
std 3.0 1.0
min 2.0 101.0
25% 3.5 101.5
50% 5.0 102.0
75% 6.5 102.5
max 8.0 103.0
******************************
- 多列数据分组。在groupby函数内给出一个列表,列表里列出多列列名。
import pandas as pd
idx = [101,101,101,102,102,102,103,103,103]
name = ["apple","pearl","orange", "apple","pearl","orange","apple","pearl","orange"]
price = [1.0,2.0,3.0,4.00,5.0,6.0,7.0,8.0,9.0]
df0 = pd.DataFrame({ "fruit": name, "price" : price, "supplier" :idx})
print "*" * 30
print df0
print "*" * 30
dg2 = df0.groupby(["fruit", "supplier"])
for n, g in dg2:
print "multiGroup on:", n, "\n|",g ,"|"
print "*" * 30
程序里基于["fruit", "supplier"]
两列进行分组。执行结果如下:
******************************
fruit price supplier
0 apple 1 101
1 pearl 2 101
2 orange 3 101
3 apple 4 102
4 pearl 5 102
5 orange 6 102
6 apple 7 103
7 pearl 8 103
8 orange 9 103
******************************
multiGroup on: ('apple', 101)
| fruit price supplier
0 apple 1 101 |
multiGroup on: ('apple', 102)
| fruit price supplier
3 apple 4 102 |
multiGroup on: ('apple', 103)
| fruit price supplier
6 apple 7 103 |
multiGroup on: ('orange', 101)
| fruit price supplier
2 orange 3 101 |
multiGroup on: ('orange', 102)
| fruit price supplier
5 orange 6 102 |
multiGroup on: ('orange', 103)
| fruit price supplier
8 orange 9 103 |
multiGroup on: ('pearl', 101)
| fruit price supplier
1 pearl 2 101 |
multiGroup on: ('pearl', 102)
| fruit price supplier
4 pearl 5 102 |
multiGroup on: ('pearl', 103)
| fruit price supplier
7 pearl 8 103 |
******************************
- 使用set_index进行分组。
import pandas as pd
idx = [101,101,101,102,102,102,103,103,103]
name = ["apple","pearl","orange", "apple","pearl","orange","apple","pearl","orange"]
price = [1.0,2.0,3.0,4.00,5.0,6.0,7.0,8.0,9.0]
df0 = pd.DataFrame({ "fruit": name, "price" : price, "supplier" :idx})
print "*" * 30
print df0
print "*" * 30
dg2 = df0.groupby(["fruit", "supplier"])
for n, g in dg2:
print "multiGroup on:", n, "\n|",g ,"|"
print "*" * 30
si = df0.set_index(["fruit", "supplier"])
print si
print "*" * 30
程序的执行结果:
******************************
fruit price supplier
0 apple 1 101
...
8 orange 9 103
******************************
multiGroup on: ('apple', 101)
| fruit price supplier
0 apple 1 101 |
...
multiGroup on: ('pearl', 103)
| fruit price supplier
7 pearl 8 103 |
******************************
price
fruit supplier
apple 101 1
...
orange 103 9
******************************