Python For Data Analysis-五章第一节
《Python For Data Analysis》的第五章的第一节主要对pandas模块进行必要的介绍,讨论pandas里两种存储数据的结构series和dataframe。
7.1 pandas的Series数据
pandas的Series数据类型可以存储一系列的数据,形式和Python的list列表、NumPy的array数组有些类似,都是数据的集合,对于list和array均可整形的索引来访问数据里的某项数据,Series也可以这样。但与两者区别的是Series在创建的时候可以额外为每个数据指定一个标识index,可以是数字、字符、字符串等,那么可以借助这特别指定的标识就可以找到某位置上的数据,这个好像有点像Python的字典了,但字典是hash型的无序的,而Series则是有序的。
import pandas as pd
s1 = pd.Series([12.3, 1, -3.5])
print s1, "# s1"
执行结果:
0 12.3
1 1.0
2 -3.5
dtype: float64 # s1
执行结果最左侧输出就是index。
- 如果在创建Series时没指定特别的index,默认会以整形作为index。接下来看看访问Series数据项是否和list、array一样?
import pandas as pd
s1 = pd.Series([12.3, 1, -3.5])
print s1, "# s1"
print s1[1], "# s1[0]"
s1[0] = 95.27
print s1, "# s1"
执行结果:
0 12.3
1 1.0
2 -3.5
dtype: float64 # s1
1.0 # s1[0]
0 95.27
1 1.00
2 -3.50
dtype: float64 # s1
对于series的读和写可以和list列表、array一样通过方括号的方式来使用。切片好使么?
import pandas as pd
s1 = pd.Series([12.3, 1, -3.5, "9527"])
print s1, "# s1"
print s1[1:], "# s1[1:]"
print type(s1[1]), "# type of s1[1]"
print type(s1[3]), "# type of s1[3]"
执行结果:
0 12.3
1 1
2 -3.5
3 9527
dtype: object # s1
1 1
2 -3.5
3 9527
dtype: object # s1[1:]
<type 'int'> # type of s1[1]
<type 'str'> # type of s1[3]
这样看来series和list、array很像啊!
- 接下来体验一下不一样的,给Series指定一个index。
import pandas as pd
s1 = pd.Series([12.3, 1, -3.5, "9527"], index = [2,3,4,5])
print s1, "# s1"
s2 = pd.Series([12.3, 1, -3.5, "9527"], index = "a b c d".split())
print s2, "# s2"
结果是:
2 12.3
3 1
4 -3.5
5 9527
dtype: object # s1
a 12.3
b 1
c -3.5
d 9527
dtype: object # s2
这个时候访问当个元素值会怎样?切片会怎样?
import pandas as pd
s1 = pd.Series([12.3, 1, -3.5, "9527"], index = [2,3,4,5])
print "-" * 20
print s1, "# s1"
print s1[5], "# s1[5]"
print s1[2:], "# s1[2:]"
s2 = pd.Series([12.3, 1, -3.5, "9527"], index = "c1 b2 a3 d4".split())
print "-" * 20
print s2, "# s2"
print s2['c1'], "# s2['c1']"
print s2["c1" : "a3"], '# s2["c1" : "a3"]'
执行结果:
--------------------
2 12.3
3 1
4 -3.5
5 9527
dtype: object # s1
9527 # s1[5]
4 -3.5
5 9527
dtype: object # s1[2:]
--------------------
c1 12.3
b2 1
a3 -3.5
d4 9527
dtype: object # s2
12.3 # s2['c1']
c1 12.3
b2 1
a3 -3.5
dtype: object # s2["c1" : "a3"]
请注意$s1$的index是$[2,3,4,5]$,不是从0开始的。对于s1[5]
的理解,显然5大于$s1$的长度4,如果$s1$没有指定index,默认的index最大为3。
还要注意的是$s1[2:]$的结果!结果没有得到$[12.3, 1, -3.5, "9527"]$而是$[-3.5, "9527"]$。所以用int型作为index要注意和默认从0开始的index的使用。
- Series的index、values属性,在创建Series的时候给出值,指定index,那么自然会想到Series的属性应该记录这些数据。
import pandas as pd
s1 = pd.Series([12.3, 1, -3.5, "9527"], index = [2,3,4,5])
print "-" * 20
print s1, "# s1"
print "index:", s1.index
print "value:", s1.values
s2 = pd.Series([12.3, 1, -3.5, "9527"], index = "c1 b2 a3 d4".split())
print "-" * 20
print s2, "# s2"
print "index:", s2.index
print "value:", s2.values
执行结果:
--------------------
2 12.3
3 1
4 -3.5
5 9527
dtype: object # s1
index: Int64Index([2, 3, 4, 5], dtype='int64')
value: [12.3 1 -3.5 '9527']
--------------------
c1 12.3
b2 1
a3 -3.5
d4 9527
dtype: object # s2
index: Index([u'c1', u'b2', u'a3', u'd4'], dtype='object')
value: [12.3 1 -3.5 '9527']
- Series还有点儿像字典呢@!@。
import pandas as pd
import numpy as np
ind = "c1 b2 a3 d4".split()
val = [4, 2, 1, 3]
s = pd.Series(val, index = ind)
d = dict(zip(ind, val))
print d, "# dict of d"
print "4 in d ?", 4 in d
print '"b2" in d?', "b2" in d
print "-" * 20
print s2, "# s2"
print "4 in s ?", 4 in s
print '"b2" in s?',"b2" in s
执行结果:
{'c1': 4, 'd4': 3, 'a3': 1, 'b2': 2} # dict of d
4 in d ? False
"b2" in d? True
--------------------
c1 4
b2 2
a3 1
d4 3
dtype: int64 # s2
4 in s ? False
"b2" in s? True
当然可以将字典转为Series。默认就用字典的key作为Series的index,当然也可自行指定index,如果index和字典的key不完全一致,以index为准建立Series,字典里的key和index匹配上的Series的index对应的值为字典key对应的值,Series的index没有和字典key匹配上的Series的index的所对应的值为空。
import pandas as pd
import numpy as np
indd = "c1 b2 a3 e4".split()
val = [4, 2, 1, 3]
d = dict(zip(indd, val))
inds = "c1 b2 a3 d4".split()
s = pd.Series(d, index = inds)
print d, "# dict of d"
print "-" * 20
print s, "# s"
执行结果:
{'e4': 3, 'c1': 4, 'a3': 1, 'b2': 2} # dict of d
--------------------
c1 4.0
b2 2.0
a3 1.0
d4 NaN
dtype: float64 # s
注意Series的index的$d4$没在字典的key里,所以Series数据s的index为$d4$的值为空即NaN。可以用pandas的isnull函数和notnull函数判断Series各个数据项是否为空?
import pandas as pd
import numpy as np
indd = "c1 b2 a3 e4".split()
val = [4, 2, 1, 3]
d = dict(zip(indd, val))
inds = "c1 b2 a3 d4".split()
s = pd.Series(d, index = inds)
print d, "# dict of d"
print "-" * 20
print s, "# s"
print pd.isnull(s), "# pd.isnull(s)"
print pd.notnull(s), "# pd.notnull(s)"
执行结果:
{'e4': 3, 'c1': 4, 'a3': 1, 'b2': 2} # dict of d
--------------------
c1 4.0
b2 2.0
a3 1.0
d4 NaN
dtype: float64 # s
c1 False
b2 False
a3 False
d4 True
dtype: bool # pd.isnull(s)
c1 True
b2 True
a3 True
d4 False
dtype: bool # pd.notnull(s)
- 使用NumPy的布尔选择和函数,由于pandas基础是NumPy,那么NumPy的array的算术运算、布尔数组选择等特性均可在pandas的series上使用。
import pandas as pd
import numpy as np
s2 = pd.Series([4, 2, 1, 3], index = "c1 b2 a3 d4".split())
print "-" * 20
print s2, "# s2"
print s2[["a3", "b2", "d4"]], '# s2[["a3", "b2", "d4"]]'
print s2[s2 > 2], "# s2[s2 > 2]"
print s2 ** 2, "# s2 ** 2"
print np.sqrt(s2), "# np.sqrt(s2)"
执行结果:
--------------------
c1 4
b2 2
a3 1
d4 3
dtype: int64 # s2
a3 1
b2 2
d4 3
dtype: int64 # s2[["a3", "b2", "d4"]]
c1 4
d4 3
dtype: int64 # s2[s2 > 2]
c1 16
b2 4
a3 1
d4 9
dtype: int64 # s2 ** 2
c1 2.000000
b2 1.414214
a3 1.000000
d4 1.732051
dtype: float64 # np.sqrt(s2)
7.2 pandas的DataFrame数据
pandas的Dataframe数据类型是二维表格型的数据,有行列的概念,和NumPy的二维数组的意思差不多,但区别在于行列除了用整形来做为坐标外,还可以用其他数据类型作为行列的标识,例如字符串作为行和列数据的坐标索引,这样DataFrame数据类型就和日常Excel等表格工具作出的表格,或MySql等数据库的表格的形式基本一致了。
- 在pandas里有很多的创建dataframe数据的方法,例如字典其值等长的列表,NumPy的二维数组均可成功创建DataFrame数据,字典的key作为列的index,其值作为该列下的数据。
import pandas as pd
d = {"name":"jack tom mike john".split(),
"age":[17,18, 19, 18]}
df = pd.DataFrame(d)
print df, "# dataframe"
执行结果:
age name
0 17 jack
1 18 tom
2 19 mike
3 18 john # dataframe
从结果可以看出,行缺少索引,可以在DataFrame函数指定一下行的index。
import pandas as pd
d = {"name":"jack tom mike john".split(),
"age":[17,18, 19, 18]}
df = pd.DataFrame(d)
print df, "# dataframe"
ind = [9521, 9526, 9527, 9530]
df1 = pd.DataFrame(d, index = ind)
print df1, "# df1 dataframe"
执行结果:
age name
0 17 jack
1 18 tom
2 19 mike
3 18 john # dataframe
age name
9521 17 jack
9526 18 tom
9527 19 mike
9530 18 john # df1 dataframe
- 指定列名、行名创建DataFrame数据。用DataFrame函数的index参数指定行名,columns参数指定列名。
import pandas as pd
cols = [ "age", "name"]
ind = [9521, 9526, 9527, 9530]
d = {"name":"jack tom mike john".split(),
"age":[17,18, 19, 18]}
df = pd.DataFrame(d, index = ind, columns = cols)
print "-" * 20
print df, "# df"
df.index.set_names("id", inplace=True)
print "-" * 20
print df, "# df"
print "-" * 20
执行结果:
--------------------
age name
9521 17 jack
9526 18 tom
9527 19 mike
9530 18 john # df
--------------------
age name
id
9521 17 jack
9526 18 tom
9527 19 mike
9530 18 john # df
--------------------
- 已有列的数据访问,可以通过索引的方式即:
dataframe[列名字]
或者属性方式:
dataframe.列名字
示例如下:
import pandas as pd
d = {"name":"jack tom mike john".split(),
"age":[17,18, 19, 18]}
df = pd.DataFrame(d)
print "-" * 20
print df, "# dataframe"
print "-" * 20
ind = [9521, 9526, 9527, 9530]
df1 = pd.DataFrame(d, index = ind)
print df1, "# df1 dataframe"
print "-" * 20
print df1["name"], '# df1["name"]'
print "-" * 20
print df1.name, '# df1.name'
print "-" * 20
print df1["age"], '# df1["age"]'
print "-" * 20
print df1.age, '# df1.age'
print "-" * 20
执行结果:
--------------------
age name
0 17 jack
1 18 tom
2 19 mike
3 18 john # dataframe
--------------------
age name
9521 17 jack
9526 18 tom
9527 19 mike
9530 18 john # df1 dataframe
--------------------
9521 jack
9526 tom
9527 mike
9530 john
Name: name, dtype: object # df1["name"]
--------------------
9521 jack
9526 tom
9527 mike
9530 john
Name: name, dtype: object # df1.name
--------------------
9521 17
9526 18
9527 19
9530 18
Name: age, dtype: int64 # df1["age"]
--------------------
9521 17
9526 18
9527 19
9530 18
Name: age, dtype: int64 # df1.age
--------------------
- 列数据的修改,可以通过索引对整列赋值一个数据(scalar标量),也可是一个等长度的array数组。
import pandas as pd
import numpy as np
cols = [ "age", "name"]
ind = [9521, 9526, 9527, 9530]
d = {"name":"jack tom mike john".split(),
"age":[17,18, 19, 18]}
df = pd.DataFrame(d, index = ind, columns = cols)
print "-" * 20
print df, "# df"
print "-" * 20
df["age"] = 20
print df, '# df["age"] = 20'
print "-" * 20
df.age = np.arange(len(df.age))
print df, '# df.age = np.arange(len(df.age))'
print "-" * 20
执行结果:
--------------------
age name
9521 17 jack
9526 18 tom
9527 19 mike
9530 18 john # df
--------------------
age name
9521 20 jack
9526 20 tom
9527 20 mike
9530 20 john # df["age"] = 20
--------------------
age name
9521 0 jack
9526 1 tom
9527 2 mike
9530 3 john # df.age = np.arange(len(df.age))
--------------------
也可用series来修改某列的值,与用array来修改值的区别,Series是可以有index的,那么series里没有dataframe里的index的地方数据填充空即NaN。
import pandas as pd
cols = [ "age", "name"]
ind = [9521, 9526, 9527, 9530]
d = {"name":"jack tom mike john".split(),
"age":[17,18, 19, 18]}
df = pd.DataFrame(d, index = ind, columns = cols)
print "-" * 20
print df, "# df"
print "-" * 20
s = pd.Series([22,23,20], index = [9527, 9588, 9526])
df["age"] = s
print df, '# df["age"] = s'
print "-" * 20
执行结果:
--------------------
age name
9521 17 jack
9526 18 tom
9527 19 mike
9530 18 john # df
--------------------
age name
9521 NaN jack
9526 20.0 tom
9527 22.0 mike
9530 NaN john # df["age"] = series
--------------------
series里多的数据不会添加到dataframe里去,值修改匹配的数据和填空操作。
- 增加新列,如果行上无数据填空即可,不会影响原dataframe的index或者说行数。删除列可以用del方法:
del dataframe[列名]
下面是增加列的示例。
import pandas as pd
cols = [ "age", "name"]
ind = [9521, 9526, 9527, 9530]
d = {"name":"jack tom mike john".split(),
"age":[17,18, 19, 18]}
df = pd.DataFrame(d, index = ind, columns = cols)
print "-" * 20
print df, "# df"
s = pd.Series([72,83,90], index = [9527, 9588, 9526])
print "-" * 20
print s, "$ s"
df["math"] = s
print "-" * 20
print df, "# df"
执行结果:
--------------------
age name
9521 17 jack
9526 18 tom
9527 19 mike
9530 18 john # df
--------------------
9527 72
9588 83
9526 90
dtype: int64 $ s
--------------------
age name math
9521 17 jack NaN
9526 18 tom 90.0
9527 19 mike 72.0
9530 18 john NaN # df
注意series数据s里的9588项的数据没有添加到df里。
import pandas as pd
cols = [ "age", "name"]
ind = [9521, 9526, 9527, 9530]
d = {"name":"jack tom mike john".split(),
"age":[17,18, 19, 18]}
df = pd.DataFrame(d, index = ind, columns = cols)
print "-" * 20
print df, "# df"
s = pd.Series([72,83,90], index = [9527, 9588, 9526])
print "-" * 20
print s, "$ s"
df["math"] = s
print "-" * 20
print df, "# df"
del df['math']
print df, "# df"
- 行数据访问可以通过dataframe数据特有的loc属性获取。
dataframe.loc[行名字]
例如:
import pandas as pd
d = {"name":"jack tom mike john".split(),
"age":[17,18, 19, 18]}
ind = [9521, 9526, 9527, 9530]
df = pd.DataFrame(d, index = ind)
print "-" * 20
print df, "# df"
print "-" * 20
print df.index, "# df.index"
print "-" * 20
print df.loc[9527], "# df.loc[9527]"
print "-" * 20
print df.loc[9527].name, df.loc[9527].age
执行结果:
--------------------
age name
9521 17 jack
9526 18 tom
9527 19 mike
9530 18 john # df
--------------------
Int64Index([9521, 9526, 9527, 9530], dtype='int64') # df.index
--------------------
age 19
name mike
Name: 9527, dtype: object # df.loc[9527]
--------------------
9527 19
7.3 index对象
series和dataframe的index是不可修改的数据。pandas提供了一些index对象的方法函数可以获得两个series或dataframe的index的交集、并集等集合结果,也可调用insert、delete、drop等创建新的index数据即不影响原series或dataframe。
import pandas as pd
cols = [ "age", "name"]
ind = pd.Index([9521, 9526, 9527, 9530])
print "-" * 20
print ind, type(ind), "# index obj"
print type(ind), "# index obj"
d = {"name":"jack tom mike john".split(),
"age":[17,18, 19, 18]}
df = pd.DataFrame(d, index = ind, columns = cols)
print "-" * 20
print df, "# df"
s = pd.Series([72,83,90], index = [9527, 9588, 9526])
print "-" * 20
print s, "$ s"
print "-" * 20
print pd.Index.intersection(ind, s.index), "# intersection"
print pd.Index.difference(ind, s.index), "# difference"
print pd.Index.difference(s.index, ind), "# difference"
t = s.index.drop(9588)
print "-" * 20
print t, "# t = s.index.drop(9588)"
print s.index, "# s.index"
执行结果:
--------------------
Int64Index([9521, 9526, 9527, 9530], dtype='int64') # index obj
<class 'pandas.core.indexes.numeric.Int64Index'> # index obj
--------------------
age name
9521 17 jack
9526 18 tom
9527 19 mike
9530 18 john # df
--------------------
9527 72
9588 83
9526 90
dtype: int64 $ s
--------------------
Int64Index([9526, 9527], dtype='int64') # intersection
Int64Index([9521, 9530], dtype='int64') # difference
Int64Index([9588], dtype='int64') # difference
--------------------
Int64Index([9527, 9526], dtype='int64') # t = s.index.drop(9588)
Int64Index([9527, 9588, 9526], dtype='int64') # s.index
- reindex函数,可以基于原series或dataframe的数据使用新index产生新数据内容,也就是说不影响原数据。
import pandas as pd
cols = [ "age", "name"]
ind_old = pd.Index([9521, 9526, 9527, 9530])
ind_new = pd.Index([9521, 9526, 9527, 9522])
d = {"name":"jack tom mike john".split(),
"age":[17,18, 19, 18]}
dfo = pd.DataFrame(d, index = ind_old, columns = cols)
print "-" * 20
print dfo, "# df ind_old"
dfn = dfo.reindex(ind_new)
print "-" * 20
print dfo, "# df ind_old"
print "-" * 20
print dfn, "# df ind_new"
执行结果:
--------------------
age name
9521 17 jack
9526 18 tom
9527 19 mike
9530 18 john # df ind_old
--------------------
age name
9521 17 jack
9526 18 tom
9527 19 mike
9530 18 john # df ind_old
--------------------
age name
9521 17.0 jack
9526 18.0 tom
9527 19.0 mike
9522 NaN NaN # df ind_new