Python For Data Analysis-八章第一节
《Python For Data Analysis》的第八章主要讨论的是数据的组合、合并和变形等问题。
17.1 层次化索引
层次化索引(hierarchical indexing)是pandas的一项重要功能, 有点像表格的合并单元格,它使你能在一个轴上拥有多个(两个以上)索引级别。有了层次化索引之后,可以很容易的访问分块数据以及做一些基于组的操作(group-based),比如做一个数据透视表(pivot table)。
以某学生的课表为例,可以很好的解释层次化索引结构的作用。
上图是下面的程序最后导出的csv文件用excel类软件打开的结果。这是一张课程表timetable,有上午、下午和晚上的各个课程信息,完全由程序生成。程序代码如下所示:
import pandas as pd
import numpy as np
print "-" * 40
course_name = ["Math", "English","Music", "Physics", "History",
"Law", "Botany", "Science", "Art", ""]
print course_name, "# course_name"
print "-" * 40
course_number = range(len(course_name))
print course_number, "# course_number"
print "-" * 40
d_map = dict(zip(course_number, course_name))
print d_map, "# dict for map"
# random number for timetable
print "-" * 40
dd = np.random.randint(0,len(course_name), size = 30).reshape((6,5))
df = pd.DataFrame(dd,
index=[['morning', 'morning', 'afternoon', 'afternoon', 'evening', 'evening'],
["1~2", "3~4", "5~6", "7~8", "9~10", "11~12"]],
columns=[["Mon", "Tues", "Wed", "Thur", "Fri"],
[1, 2, 3, 4, 5]])
print df, "# df random timetable in number"
print "-" * 40
for i in df.columns:
df.update(df[i].map(d_map))
print df, "# df timetable in names"
print "-" * 40
print df.loc["morning", "3~4"]
print "-" * 40
df.to_csv("time.csv")
执行结果:
----------------------------------------
['Math', 'English', 'Music', 'Physics', 'History', 'Law', 'Botany', 'Science', 'Art', ''] # course_name
----------------------------------------
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9] # course_number
----------------------------------------
{0: 'Math', 1: 'English', 2: 'Music', 3: 'Physics', 4: 'History', 5: 'Law', 6: 'Botany', 7: 'Science', 8: 'Art', 9: ''} # dict for map
----------------------------------------
Mon Tues Wed Thur Fri
1 2 3 4 5
morning 1~2 9 4 6 8 0
3~4 8 2 4 6 1
afternoon 5~6 8 7 9 1 2
7~8 5 6 9 9 3
evening 9~10 7 1 7 2 0
11~12 6 9 8 7 3 # df random timetable in number
----------------------------------------
Mon Tues Wed Thur Fri
1 2 3 4 5
morning 1~2 History Botany Art Math
3~4 Art Music History Botany English
afternoon 5~6 Art Science English Music
7~8 Law Botany Physics
evening 9~10 Science English Science Music Math
11~12 Botany Art Science Physics # df timetable in names
----------------------------------------
Mon 1 Art
Tues 2 Music
Wed 3 History
Thur 4 Botany
Fri 5 English
Name: (morning, 3~4), dtype: object
----------------------------------------
来解释一下程序的各个部分:
1). 首先是列出所有课程信息,这里用course_name来记录所有的课程的名字。
course_name = ["Math", "English","Music", "Physics", "History",
"Law", "Botany", "Science", "Art", ""]
2). 课程编号
course_number = range(len(course_name))
设置课程编号的目的是后续随机构建一个5天6节课的课表,先用数字来填充每节课,之后借助字典
d_map = dict(zip(course_number, course_name))
使用map函数将课程编号换成对应的课程名字。
3). 生成随机课表二维矩阵。
dd = np.random.randint(0,len(course_name), size = 30).reshape((6,5))
语句reshape((6,5)
是5天,每天6节课。
4). 构建多层次索引的dataframe数据df。
df = pd.DataFrame(dd,
index=[['morning', 'morning', 'afternoon', 'afternoon', 'evening', 'evening'],
["1~2", "3~4", "5~6", "7~8", "9~10", "11~12"]],
columns=[["Mon", "Tues", "Wed", "Thur", "Fri"],
[1, 2, 3, 4, 5]])
5). 借助map函数将每列的课程编号换成课程名字。
for i in df.columns:
df.update(df[i].map(d_map))
6). 打印上午三四节的课程信息。
print df.loc["morning", "3~4"]
7). 最后将df数据输出到.csv文件,
df.to_csv("time.csv")
这里需要注意的是
print "-" * 40
for i in df.columns:
df.update(df[i].map(d_map))
print df, "# df timetable in names"
print "-" * 40
的打印结果
----------------------------------------
Mon Tues Wed Thur Fri
1 2 3 4 5
morning 1~2 History Botany Art Math
3~4 Art Music History Botany English
afternoon 5~6 Art Science English Music
7~8 Law Botany Physics
evening 9~10 Science English Science Music Math
11~12 Botany Art Science Physics # df timetable in names
----------------------------------------
和上图不一致,这是pandas输出处理了一下,从图可看出,其实每行都有两个索引,为了美观第一个行索引没有每行都打印出来。
- 层次化索引的dataframe可以为其index和column均设置名字name。
import pandas as pd
import numpy as np
print "-" * 40
course_name = ["Math", "English","Music", "Physics", "History",
"Law", "Botany", "Science", "Art", ""]
print course_name, "# course_name"
print "-" * 40
course_number = range(len(course_name))
print course_number, "# course_number"
print "-" * 40
d_map = dict(zip(course_number, course_name))
print d_map, "# dict for map"
# random number for timetable
print "-" * 40
dd = np.random.randint(0,len(course_name), size = 30).reshape((6,5))
df = pd.DataFrame(dd,
index=[['morning', 'morning', 'afternoon', 'afternoon', 'evening', 'evening'],
["1~2", "3~4", "5~6", "7~8", "9~10", "11~12"]],
columns=[["Mon", "Tues", "Wed", "Thur", "Fri"],
[1, 2, 3, 4, 5]])
print df, "# df random timetable in number"
print "-" * 40
for i in df.columns:
df.update(df[i].map(d_map))
print df, "# df timetable in names"
df.index.names = ["part", "section"]
df.columns.names = ["week", "day"]
print "-" * 40
print df, "# df with names"
print "-" * 40
由于是随机生成的课程矩阵,所以结果有不一致:
----------------------------------------
['Math', 'English', 'Music', 'Physics', 'History', 'Law', 'Botany', 'Science', 'Art', ''] # course_name
----------------------------------------
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9] # course_number
----------------------------------------
{0: 'Math', 1: 'English', 2: 'Music', 3: 'Physics', 4: 'History', 5: 'Law', 6: 'Botany', 7: 'Science', 8: 'Art', 9: ''} # dict for map
----------------------------------------
Mon Tues Wed Thur Fri
1 2 3 4 5
morning 1~2 3 4 5 9 1
3~4 3 2 1 6 2
afternoon 5~6 0 1 8 6 8
7~8 2 0 6 5 6
evening 9~10 9 5 4 8 7
11~12 9 5 0 6 7 # df random timetable in number
----------------------------------------
Mon Tues Wed Thur Fri
1 2 3 4 5
morning 1~2 Physics History Law English
3~4 Physics Music English Botany Music
afternoon 5~6 Math English Art Botany Art
7~8 Music Math Botany Law Botany
evening 9~10 Law History Art Science
11~12 Law Math Botany Science # df timetable in names
----------------------------------------
week Mon Tues Wed Thur Fri
day 1 2 3 4 5
part section
morning 1~2 Physics History Law English
3~4 Physics Music English Botany Music
afternoon 5~6 Math English Art Botany Art
7~8 Music Math Botany Law Botany
evening 9~10 Law History Art Science
11~12 Law Math Botany Science # df with names
----------------------------------------
语句
df.index.names = ["part", "section"]
df.columns.names = ["week", "day"]
可以给index和columns设置名字,便于后续函数使用。
----------------------------------------
week Mon Tues Wed Thur Fri
day 1 2 3 4 5
part section
morning 1~2 Physics History Law English
3~4 Physics Music English Botany Music
afternoon 5~6 Math English Art Botany Art
7~8 Music Math Botany Law Botany
evening 9~10 Law History Art Science
11~12 Law Math Botany Science # df with names
----------------------------------------
17.2 调整索引级别
pandas 的swaplevel函数可以调整层次化索引的级别。
import pandas as pd
import numpy as np
course_name = ["Math", "English","Music", "Physics", "History",
"Law", "Botany", "Science", "Art", ""]
course_number = range(len(course_name))
d_map = dict(zip(course_number, course_name))
dd = np.random.randint(0,len(course_name), size = 30).reshape((6,5))
df = pd.DataFrame(dd,
index=[['morning', 'morning', 'afternoon', 'afternoon', 'evening', 'evening'],
["1~2", "3~4", "5~6", "7~8", "9~10", "11~12"]],
columns=[["Mon", "Tues", "Wed", "Thur", "Fri"],
[1, 2, 3, 4, 5]])
for i in df.columns:
df.update(df[i].map(d_map))
df.index.names = ["part", "section"]
df.columns.names = ["week", "day"]
print "-" * 40
print df, "# df with names"
print "-" * 40
print df.swaplevel("part", "section"), "# swaplevel"
print "-" * 40
执行结果:
----------------------------------------
week Mon Tues Wed Thur Fri
day 1 2 3 4 5
part section
morning 1~2 Music Art Physics Art Math
3~4 History Physics Art Art Physics
afternoon 5~6 History Math Science Math Botany
7~8 Art Botany Music Music Math
evening 9~10 Music Botany Law English
11~12 Art History History Botany Music # df with names
----------------------------------------
week Mon Tues Wed Thur Fri
day 1 2 3 4 5
section part
1~2 morning Music Art Physics Art Math
3~4 morning History Physics Art Art Physics
5~6 afternoon History Math Science Math Botany
7~8 afternoon Art Botany Music Music Math
9~10 evening Music Botany Law English
11~12 evening Art History History Botany Music # swaplevel
----------------------------------------
17.3 索引分级重排数据
由于层次化索引,以行为例,数据按行号排序sort_index是用最外层的还是最内层作为排序标准?可以选择不同的级别来进行数据排序,可以通过level来指定,最外层的行索引为0。
import pandas as pd
import numpy as np
course_name = ["Math", "English","Music", "Physics", "History",
"Law", "Botany", "Science", "Art", ""]
course_number = range(len(course_name))
d_map = dict(zip(course_number, course_name))
dd = np.random.randint(0,len(course_name), size = 30).reshape((6,5))
df = pd.DataFrame(dd,
index=[['morning', 'morning', 'afternoon', 'afternoon', 'evening', 'evening'],
["1~2", "3~4", "5~6", "7~8", "9~10", "11~12"]],
columns=[["Mon", "Tues", "Wed", "Thur", "Fri"],
[1, 2, 3, 4, 5]])
for i in df.columns:
df.update(df[i].map(d_map))
df.index.names = ["part", "section"]
df.columns.names = ["week", "day"]
print "-" * 40
print df, "# df with names"
print "-" * 40
print df.sort_index(level = 0), "# level = 0"
print "-" * 40
print df.sort_index(level = 1), "# level = 1"
print "-" * 40
执行结果:
----------------------------------------
week Mon Tues Wed Thur Fri
day 1 2 3 4 5
part section
morning 1~2 History Math Math Law Art
3~4 Botany Art Math Music History
afternoon 5~6 English Art History English Physics
7~8 Music Law Art Physics Music
evening 9~10 History Art Physics History
11~12 History Science Law Law # df with names
----------------------------------------
week Mon Tues Wed Thur Fri
day 1 2 3 4 5
part section
afternoon 5~6 English Art History English Physics
7~8 Music Law Art Physics Music
evening 11~12 History Science Law Law
9~10 History Art Physics History
morning 1~2 History Math Math Law Art
3~4 Botany Art Math Music History # level = 0
----------------------------------------
week Mon Tues Wed Thur Fri
day 1 2 3 4 5
part section
evening 11~12 History Science Law Law
morning 1~2 History Math Math Law Art
3~4 Botany Art Math Music History
afternoon 5~6 English Art History English Physics
7~8 Music Law Art Physics Music
evening 9~10 History Art Physics History # level = 1
----------------------------------------
level=0
意思是按"part"
下的值来排序数据,level=1
意思是按"section"
下的值来排序数据。
* sort_index函数默认(axis=0)是行排序,也可列排序(axis=1)。
import pandas as pd
import numpy as np
course_name = ["Math", "English","Music", "Physics", "History",
"Law", "Botany", "Science", "Art", ""]
course_number = range(len(course_name))
d_map = dict(zip(course_number, course_name))
dd = np.random.randint(0,len(course_name), size = 30).reshape((6,5))
df = pd.DataFrame(dd,
index=[['morning', 'morning', 'afternoon', 'afternoon', 'evening', 'evening'],
["1~2", "3~4", "5~6", "7~8", "9~10", "11~12"]],
columns=[["Mon", "Tues", "Wed", "Thur", "Fri"],
[1, 2, 3, 4, 5]])
for i in df.columns:
df.update(df[i].map(d_map))
df.index.names = ["part", "section"]
df.columns.names = ["week", "day"]
print "-" * 40
print df, "# df with names"
print "-" * 40
print df.sort_index(axis = 0), "# axis = 0"
print "-" * 40
print df.sort_index(axis = 1), "# axis = 1"
print "-" * 40
print df.sort_index(axis = 1, level = 1,ascending=False), "# axis = 1"
print "-" * 40
执行结果:
----------------------------------------
week Mon Tues Wed Thur Fri
day 1 2 3 4 5
part section
morning 1~2 Science History Art Music Math
3~4 Science Math History Music
afternoon 5~6 English Art English Music Law
7~8 History Botany Music Science Music
evening 9~10 English Science Math Physics History
11~12 Physics Music Music Music English # df with names
----------------------------------------
week Mon Tues Wed Thur Fri
day 1 2 3 4 5
part section
afternoon 5~6 English Art English Music Law
7~8 History Botany Music Science Music
evening 11~12 Physics Music Music Music English
9~10 English Science Math Physics History
morning 1~2 Science History Art Music Math
3~4 Science Math History Music # axis = 0
----------------------------------------
week Fri Mon Thur Tues Wed
day 5 1 4 2 3
part section
morning 1~2 Math Science Music History Art
3~4 Music Science History Math
afternoon 5~6 Law English Music Art English
7~8 Music History Science Botany Music
evening 9~10 History English Physics Science Math
11~12 English Physics Music Music Music # axis = 1
----------------------------------------
week Fri Thur Wed Tues Mon
day 5 4 3 2 1
part section
morning 1~2 Math Music Art History Science
3~4 Music History Math Science
afternoon 5~6 Law Music English Art English
7~8 Music Science Music Botany History
evening 9~10 History Physics Math Science English
11~12 English Music Music Music Physics # axis = 1, level = 1
----------------------------------------
17.4 层次化索引的分级统计
由于数据的索引层次化了,在对数据进行统计的时候,例如求和,也可以通过level参数来选择依据内层索引还是外层索引来统计结果。
import pandas as pd
import numpy as np
course_name = ["Math", "English","Music", "Physics", "History",
"Law", "Botany", "Science", "Art", ""]
course_number = range(len(course_name))
d_map = dict(zip(course_number, course_name))
dd = np.random.randint(0,len(course_name), size = 30).reshape((6,5))
df = pd.DataFrame(dd,
index=[['morning', 'morning', 'afternoon', 'afternoon', 'evening', 'evening'],
["1~2", "3~4", "5~6", "7~8", "9~10", "11~12"]],
columns=[["Mon", "Tues", "Wed", "Thur", "Fri"],
[1, 1, 2, 2, 3]])
print "-" * 40
print df
df.index.names = ["part", "section"]
df.columns.names = ["week", "day"]
print "-" * 40
print df.sum(level = 0), "# level = 0"
print "-" * 40
print df.sum(level = 1, axis = 1), "# level = 1, axis = 1"
print "-" * 40
执行结果:
----------------------------------------
Mon Tues Wed Thur Fri
1 1 2 2 3
morning 1~2 2 5 3 0 4
3~4 9 4 5 9 2
afternoon 5~6 8 9 3 4 9
7~8 3 7 3 8 1
evening 9~10 3 0 6 4 7
11~12 5 2 0 3 5 # df
----------------------------------------
week Mon Tues Wed Thur Fri
day 1 1 2 2 3
part
morning 11 9 8 9 6
afternoon 11 16 6 12 10
evening 8 2 6 7 12 # level = 0
----------------------------------------
day 1 2 3
part section
morning 1~2 7 3 4
3~4 13 14 2
afternoon 5~6 17 7 9
7~8 10 11 1
evening 9~10 3 10 7
11~12 7 3 5 # level = 1, axis = 1
----------------------------------------
17.5 去层次化索引
去层次化索引是想把层次化的变为非层次化的,二级将为一及,只有一层索引。还是以iris.csv数据为例,这个数据可以用行号和类型名分别为外层索引和内层索引先做成层次化索引的dataframe,然后去层次化。
import pandas as pd
import numpy as np
df = pd.read_csv("iris.csv", sep = ",", header = None, names = ["sepal_l","sepal_w", "petal_l", "petal_w", "class"])
print "-" * 40
print df[:10]
df.index.names = ["no."]
df["row_no."] = np.arange(len(df["sepal_l"]))
print "-" * 40
print df[:10], "# df"
df2 = df.set_index(["class","row_no."])
print "-" * 40
print df2[:10], "set_index"
print "-" * 40
print df2.index.levels[0], "# row index"
print "-" * 40
print df2.reset_index(level = 0)[:10], "reset level=0"
print "-" * 40
print df2.reset_index(level = 1)[:10], "reset level=1"
print "-" * 40
执行结果:
----------------------------------------
sepal_l sepal_w petal_l petal_w class
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
5 5.4 3.9 1.7 0.4 Iris-setosa
6 4.6 3.4 1.4 0.3 Iris-setosa
7 5.0 3.4 1.5 0.2 Iris-setosa
8 4.4 2.9 1.4 0.2 Iris-setosa
9 4.9 3.1 1.5 0.1 Iris-setosa
----------------------------------------
sepal_l sepal_w petal_l petal_w class row_no.
no.
0 5.1 3.5 1.4 0.2 Iris-setosa 0
1 4.9 3.0 1.4 0.2 Iris-setosa 1
2 4.7 3.2 1.3 0.2 Iris-setosa 2
3 4.6 3.1 1.5 0.2 Iris-setosa 3
4 5.0 3.6 1.4 0.2 Iris-setosa 4
5 5.4 3.9 1.7 0.4 Iris-setosa 5
6 4.6 3.4 1.4 0.3 Iris-setosa 6
7 5.0 3.4 1.5 0.2 Iris-setosa 7
8 4.4 2.9 1.4 0.2 Iris-setosa 8
9 4.9 3.1 1.5 0.1 Iris-setosa 9 # df
----------------------------------------
sepal_l sepal_w petal_l petal_w
class row_no.
Iris-setosa 0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2
5 5.4 3.9 1.7 0.4
6 4.6 3.4 1.4 0.3
7 5.0 3.4 1.5 0.2
8 4.4 2.9 1.4 0.2
9 4.9 3.1 1.5 0.1 set_index
----------------------------------------
Index([u'Iris-setosa', u'Iris-versicolor', u'Iris-virginica'], dtype='object', name=u'class') # row index
----------------------------------------
class sepal_l sepal_w petal_l petal_w
row_no.
0 Iris-setosa 5.1 3.5 1.4 0.2
1 Iris-setosa 4.9 3.0 1.4 0.2
2 Iris-setosa 4.7 3.2 1.3 0.2
3 Iris-setosa 4.6 3.1 1.5 0.2
4 Iris-setosa 5.0 3.6 1.4 0.2
5 Iris-setosa 5.4 3.9 1.7 0.4
6 Iris-setosa 4.6 3.4 1.4 0.3
7 Iris-setosa 5.0 3.4 1.5 0.2
8 Iris-setosa 4.4 2.9 1.4 0.2
9 Iris-setosa 4.9 3.1 1.5 0.1 reset level=0
----------------------------------------
row_no. sepal_l sepal_w petal_l petal_w
class
Iris-setosa 0 5.1 3.5 1.4 0.2
Iris-setosa 1 4.9 3.0 1.4 0.2
Iris-setosa 2 4.7 3.2 1.3 0.2
Iris-setosa 3 4.6 3.1 1.5 0.2
Iris-setosa 4 5.0 3.6 1.4 0.2
Iris-setosa 5 5.4 3.9 1.7 0.4
Iris-setosa 6 4.6 3.4 1.4 0.3
Iris-setosa 7 5.0 3.4 1.5 0.2
Iris-setosa 8 4.4 2.9 1.4 0.2
Iris-setosa 9 4.9 3.1 1.5 0.1 reset level=1
----------------------------------------
语句df2 = df.set_index(["class","row_no."])
将最后两列"class"和"row_no."作为两级索引,得到
----------------------------------------
sepal_l sepal_w petal_l petal_w
class row_no.
Iris-setosa 0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2
5 5.4 3.9 1.7 0.4
6 4.6 3.4 1.4 0.3
7 5.0 3.4 1.5 0.2
8 4.4 2.9 1.4 0.2
9 4.9 3.1 1.5 0.1 set_index
----------------------------------------
这样的结果。语句print df2.reset_index(level = 0)
是以内层索引为索引,去掉外层的索引,得到
----------------------------------------
class sepal_l sepal_w petal_l petal_w
row_no.
0 Iris-setosa 5.1 3.5 1.4 0.2
1 Iris-setosa 4.9 3.0 1.4 0.2
2 Iris-setosa 4.7 3.2 1.3 0.2
3 Iris-setosa 4.6 3.1 1.5 0.2
4 Iris-setosa 5.0 3.6 1.4 0.2
5 Iris-setosa 5.4 3.9 1.7 0.4
6 Iris-setosa 4.6 3.4 1.4 0.3
7 Iris-setosa 5.0 3.4 1.5 0.2
8 Iris-setosa 4.4 2.9 1.4 0.2
9 Iris-setosa 4.9 3.1 1.5 0.1 reset level=0
----------------------------------------
而语句df2.reset_index(level = 1)
则是以外层做为索引,去掉内层索引,得到:
----------------------------------------
row_no. sepal_l sepal_w petal_l petal_w
class
Iris-setosa 0 5.1 3.5 1.4 0.2
Iris-setosa 1 4.9 3.0 1.4 0.2
Iris-setosa 2 4.7 3.2 1.3 0.2
Iris-setosa 3 4.6 3.1 1.5 0.2
Iris-setosa 4 5.0 3.6 1.4 0.2
Iris-setosa 5 5.4 3.9 1.7 0.4
Iris-setosa 6 4.6 3.4 1.4 0.3
Iris-setosa 7 5.0 3.4 1.5 0.2
Iris-setosa 8 4.4 2.9 1.4 0.2
Iris-setosa 9 4.9 3.1 1.5 0.1 reset level=1
----------------------------------------