Python For Data Analysis-五章第一节

《Python For Data Analysis》的第五章的第一节主要对pandas模块进行必要的介绍,讨论pandas里两种存储数据的结构series和dataframe。

7.1 pandas的Series数据

pandas的Series数据类型可以存储一系列的数据,形式和Python的list列表、NumPy的array数组有些类似,都是数据的集合,对于list和array均可整形的索引来访问数据里的某项数据,Series也可以这样。但与两者区别的是Series在创建的时候可以额外为每个数据指定一个标识index,可以是数字、字符、字符串等,那么可以借助这特别指定的标识就可以找到某位置上的数据,这个好像有点像Python的字典了,但字典是hash型的无序的,而Series则是有序的。

import pandas as pd
s1 = pd.Series([12.3, 1, -3.5])
print s1, "# s1"

执行结果:

0    12.3
1     1.0
2    -3.5
dtype: float64 # s1

执行结果最左侧输出就是index。

  • 如果在创建Series时没指定特别的index,默认会以整形作为index。接下来看看访问Series数据项是否和list、array一样?
import pandas as pd
s1 = pd.Series([12.3, 1, -3.5])
print s1, "# s1"
print s1[1], "# s1[0]"
s1[0] = 95.27
print s1, "# s1"

执行结果:

0    12.3
1     1.0
2    -3.5
dtype: float64 # s1
1.0 # s1[0]
0    95.27
1     1.00
2    -3.50
dtype: float64 # s1

对于series的读和写可以和list列表、array一样通过方括号的方式来使用。切片好使么?

import pandas as pd
s1 = pd.Series([12.3, 1, -3.5, "9527"])
print s1, "# s1"
print s1[1:], "# s1[1:]"
print type(s1[1]), "# type of s1[1]"
print type(s1[3]), "# type of s1[3]"

执行结果:

0    12.3
1       1
2    -3.5
3    9527
dtype: object # s1
1       1
2    -3.5
3    9527
dtype: object # s1[1:]
<type 'int'> # type of s1[1]
<type 'str'> # type of s1[3]

这样看来series和list、array很像啊!

  • 接下来体验一下不一样的,给Series指定一个index。
import pandas as pd
s1 = pd.Series([12.3, 1, -3.5, "9527"], index = [2,3,4,5])
print s1, "# s1"
s2 = pd.Series([12.3, 1, -3.5, "9527"], index = "a b c d".split())
print s2, "# s2"

结果是:

2    12.3
3       1
4    -3.5
5    9527
dtype: object # s1
a    12.3
b       1
c    -3.5
d    9527
dtype: object # s2

这个时候访问当个元素值会怎样?切片会怎样?

import pandas as pd
s1 = pd.Series([12.3, 1, -3.5, "9527"], index = [2,3,4,5])
print "-" * 20
print s1, "# s1"
print s1[5], "# s1[5]"
print s1[2:], "# s1[2:]"
s2 = pd.Series([12.3, 1, -3.5, "9527"], index = "c1 b2 a3 d4".split())
print "-" * 20
print s2, "# s2"
print s2['c1'], "# s2['c1']"
print s2["c1" : "a3"], '# s2["c1" : "a3"]'

执行结果:

--------------------
2    12.3
3       1
4    -3.5
5    9527
dtype: object # s1
9527 # s1[5]
4    -3.5
5    9527
dtype: object # s1[2:]
--------------------
c1    12.3
b2       1
a3    -3.5
d4    9527
dtype: object # s2
12.3 # s2['c1']
c1    12.3
b2       1
a3    -3.5
dtype: object # s2["c1" : "a3"]

请注意$s1$的index是$[2,3,4,5]$,不是从0开始的。对于s1[5]的理解,显然5大于$s1$的长度4,如果$s1$没有指定index,默认的index最大为3。 还要注意的是$s1[2:]$的结果!结果没有得到$[12.3, 1, -3.5, "9527"]$而是$[-3.5, "9527"]$。所以用int型作为index要注意和默认从0开始的index的使用。

  • Series的index、values属性,在创建Series的时候给出值,指定index,那么自然会想到Series的属性应该记录这些数据。
import pandas as pd
s1 = pd.Series([12.3, 1, -3.5, "9527"], index = [2,3,4,5])
print "-" * 20
print s1, "# s1"
print "index:", s1.index
print "value:", s1.values
s2 = pd.Series([12.3, 1, -3.5, "9527"], index = "c1 b2 a3 d4".split())
print "-" * 20
print s2, "# s2"
print "index:", s2.index
print "value:", s2.values

执行结果:

--------------------
2    12.3
3       1
4    -3.5
5    9527
dtype: object # s1
index: Int64Index([2, 3, 4, 5], dtype='int64')
value: [12.3 1 -3.5 '9527']
--------------------
c1    12.3
b2       1
a3    -3.5
d4    9527
dtype: object # s2
index: Index([u'c1', u'b2', u'a3', u'd4'], dtype='object')
value: [12.3 1 -3.5 '9527']
  • Series还有点儿像字典呢@!@。
import pandas as pd
import numpy as np
ind = "c1 b2 a3 d4".split()
val = [4, 2, 1, 3]
s = pd.Series(val, index = ind)
d = dict(zip(ind, val))
print d, "# dict of d"
print "4 in d   ?", 4 in d
print '"b2" in d?', "b2" in d
print "-" * 20
print s2, "# s2"
print "4 in s   ?", 4 in s
print '"b2" in s?',"b2" in s

执行结果:

{'c1': 4, 'd4': 3, 'a3': 1, 'b2': 2} # dict of d
4 in d   ? False
"b2" in d? True
--------------------
c1    4
b2    2
a3    1
d4    3
dtype: int64 # s2
4 in s   ? False
"b2" in s? True

当然可以将字典转为Series。默认就用字典的key作为Series的index,当然也可自行指定index,如果index和字典的key不完全一致,以index为准建立Series,字典里的key和index匹配上的Series的index对应的值为字典key对应的值,Series的index没有和字典key匹配上的Series的index的所对应的值为空。

import pandas as pd
import numpy as np
indd = "c1 b2 a3 e4".split()
val = [4, 2, 1, 3]
d = dict(zip(indd, val))
inds = "c1 b2 a3 d4".split()
s = pd.Series(d, index = inds)
print d, "# dict of d"
print "-" * 20
print s, "# s"

执行结果:

{'e4': 3, 'c1': 4, 'a3': 1, 'b2': 2} # dict of d
--------------------
c1    4.0
b2    2.0
a3    1.0
d4    NaN
dtype: float64 # s

注意Series的index的$d4$没在字典的key里,所以Series数据s的index为$d4$的值为空即NaN。可以用pandas的isnull函数和notnull函数判断Series各个数据项是否为空?

import pandas as pd
import numpy as np

indd = "c1 b2 a3 e4".split()
val = [4, 2, 1, 3]
d = dict(zip(indd, val))
inds = "c1 b2 a3 d4".split()
s = pd.Series(d, index = inds)
print d, "# dict of d"
print "-" * 20
print s, "# s"
print pd.isnull(s), "# pd.isnull(s)"
print pd.notnull(s), "# pd.notnull(s)"

执行结果:

{'e4': 3, 'c1': 4, 'a3': 1, 'b2': 2} # dict of d
--------------------
c1    4.0
b2    2.0
a3    1.0
d4    NaN
dtype: float64 # s
c1    False
b2    False
a3    False
d4     True
dtype: bool # pd.isnull(s)
c1     True
b2     True
a3     True
d4    False
dtype: bool # pd.notnull(s)
import pandas as pd
import numpy as np
s2 = pd.Series([4, 2, 1, 3], index = "c1 b2 a3 d4".split())
print "-" * 20
print s2, "# s2"
print s2[["a3", "b2", "d4"]], '# s2[["a3", "b2", "d4"]]'
print s2[s2 > 2], "# s2[s2 > 2]"
print s2 ** 2, "# s2 ** 2"
print np.sqrt(s2), "# np.sqrt(s2)"

执行结果:

--------------------
c1    4
b2    2
a3    1
d4    3
dtype: int64 # s2
a3    1
b2    2
d4    3
dtype: int64 # s2[["a3", "b2", "d4"]]
c1    4
d4    3
dtype: int64 # s2[s2 > 2]
c1    16
b2     4
a3     1
d4     9
dtype: int64 # s2 ** 2
c1    2.000000
b2    1.414214
a3    1.000000
d4    1.732051
dtype: float64 # np.sqrt(s2)

7.2 pandas的DataFrame数据

pandas的Dataframe数据类型是二维表格型的数据,有行列的概念,和NumPy的二维数组的意思差不多,但区别在于行列除了用整形来做为坐标外,还可以用其他数据类型作为行列的标识,例如字符串作为行和列数据的坐标索引,这样DataFrame数据类型就和日常Excel等表格工具作出的表格,或MySql等数据库的表格的形式基本一致了。

  • 在pandas里有很多的创建dataframe数据的方法,例如字典其值等长的列表,NumPy的二维数组均可成功创建DataFrame数据,字典的key作为列的index,其值作为该列下的数据。
import pandas as pd
d = {"name":"jack tom mike john".split(),
     "age":[17,18, 19, 18]}
df = pd.DataFrame(d)
print df, "# dataframe"

执行结果:

   age  name
0   17  jack
1   18   tom
2   19  mike
3   18  john # dataframe

从结果可以看出,行缺少索引,可以在DataFrame函数指定一下行的index。

import pandas as pd
d = {"name":"jack tom mike john".split(),
     "age":[17,18, 19, 18]}
df = pd.DataFrame(d)
print df, "# dataframe"
ind = [9521, 9526, 9527, 9530]
df1 = pd.DataFrame(d, index = ind)
print df1, "# df1 dataframe"

执行结果:

   age  name
0   17  jack
1   18   tom
2   19  mike
3   18  john # dataframe
      age  name
9521   17  jack
9526   18   tom
9527   19  mike
9530   18  john # df1 dataframe
  • 指定列名、行名创建DataFrame数据。用DataFrame函数的index参数指定行名,columns参数指定列名。
import pandas as pd
cols = [ "age", "name"]
ind = [9521, 9526, 9527, 9530]
d = {"name":"jack tom mike john".split(),
     "age":[17,18, 19, 18]}
df = pd.DataFrame(d, index = ind, columns = cols)
print "-" * 20
print df, "# df"
df.index.set_names("id", inplace=True)
print "-" * 20
print df, "# df"
print "-" * 20

执行结果:

--------------------
      age  name
9521   17  jack
9526   18   tom
9527   19  mike
9530   18  john # df
--------------------
      age  name
id             
9521   17  jack
9526   18   tom
9527   19  mike
9530   18  john # df
--------------------
  • 已有列的数据访问,可以通过索引的方式即:
dataframe[列名字]

或者属性方式:

dataframe.列名字

示例如下:

import pandas as pd
d = {"name":"jack tom mike john".split(),
     "age":[17,18, 19, 18]}
df = pd.DataFrame(d)
print "-" * 20
print df, "# dataframe"
print "-" * 20
ind = [9521, 9526, 9527, 9530]
df1 = pd.DataFrame(d, index = ind)
print df1, "# df1 dataframe"
print "-" * 20
print df1["name"], '# df1["name"]'
print "-" * 20
print df1.name, '# df1.name'
print "-" * 20
print df1["age"], '# df1["age"]'
print "-" * 20
print df1.age, '# df1.age'
print "-" * 20

执行结果:

--------------------
   age  name
0   17  jack
1   18   tom
2   19  mike
3   18  john # dataframe
--------------------
      age  name
9521   17  jack
9526   18   tom
9527   19  mike
9530   18  john # df1 dataframe
--------------------
9521    jack
9526     tom
9527    mike
9530    john
Name: name, dtype: object # df1["name"]
--------------------
9521    jack
9526     tom
9527    mike
9530    john
Name: name, dtype: object # df1.name
--------------------
9521    17
9526    18
9527    19
9530    18
Name: age, dtype: int64 # df1["age"]
--------------------
9521    17
9526    18
9527    19
9530    18
Name: age, dtype: int64 # df1.age
--------------------
  • 列数据的修改,可以通过索引对整列赋值一个数据(scalar标量),也可是一个等长度的array数组。
import pandas as pd
import numpy as np
cols = [ "age", "name"]
ind = [9521, 9526, 9527, 9530]
d = {"name":"jack tom mike john".split(),
     "age":[17,18, 19, 18]}
df = pd.DataFrame(d, index = ind, columns = cols)
print "-" * 20
print df, "# df"
print "-" * 20
df["age"] = 20
print df, '# df["age"] = 20'
print "-" * 20
df.age = np.arange(len(df.age))
print df, '# df.age = np.arange(len(df.age))'
print "-" * 20

执行结果:

--------------------
      age  name
9521   17  jack
9526   18   tom
9527   19  mike
9530   18  john # df
--------------------
      age  name
9521   20  jack
9526   20   tom
9527   20  mike
9530   20  john # df["age"] = 20
--------------------
      age  name
9521    0  jack
9526    1   tom
9527    2  mike
9530    3  john # df.age = np.arange(len(df.age))
--------------------

也可用series来修改某列的值,与用array来修改值的区别,Series是可以有index的,那么series里没有dataframe里的index的地方数据填充空即NaN。

import pandas as pd
cols = [ "age", "name"]
ind = [9521, 9526, 9527, 9530]
d = {"name":"jack tom mike john".split(),
     "age":[17,18, 19, 18]}
df = pd.DataFrame(d, index = ind, columns = cols)
print "-" * 20
print df, "# df"
print "-" * 20
s = pd.Series([22,23,20], index = [9527, 9588, 9526])
df["age"] = s
print df, '# df["age"] = s'
print "-" * 20

执行结果:

--------------------
      age  name
9521   17  jack
9526   18   tom
9527   19  mike
9530   18  john # df
--------------------
       age  name
9521   NaN  jack
9526  20.0   tom
9527  22.0  mike
9530   NaN  john # df["age"] = series
--------------------

series里多的数据不会添加到dataframe里去,值修改匹配的数据和填空操作。

  • 增加新列,如果行上无数据填空即可,不会影响原dataframe的index或者说行数。删除列可以用del方法:
del dataframe[列名]

下面是增加列的示例。

import pandas as pd
cols = [ "age", "name"]
ind = [9521, 9526, 9527, 9530]
d = {"name":"jack tom mike john".split(),
     "age":[17,18, 19, 18]}
df = pd.DataFrame(d, index = ind, columns = cols)
print "-" * 20
print df, "# df"
s = pd.Series([72,83,90], index = [9527, 9588, 9526])
print "-" * 20
print s, "$ s"
df["math"] = s
print "-" * 20
print df, "# df"

执行结果:

--------------------
      age  name
9521   17  jack
9526   18   tom
9527   19  mike
9530   18  john # df
--------------------
9527    72
9588    83
9526    90
dtype: int64 $ s
--------------------
      age  name  math
9521   17  jack   NaN
9526   18   tom  90.0
9527   19  mike  72.0
9530   18  john   NaN # df

注意series数据s里的9588项的数据没有添加到df里。

import pandas as pd
cols = [ "age", "name"]
ind = [9521, 9526, 9527, 9530]
d = {"name":"jack tom mike john".split(),
     "age":[17,18, 19, 18]}
df = pd.DataFrame(d, index = ind, columns = cols)
print "-" * 20
print df, "# df"
s = pd.Series([72,83,90], index = [9527, 9588, 9526])
print "-" * 20
print s, "$ s"
df["math"] = s
print "-" * 20
print df, "# df"
del df['math']
print df, "# df"
  • 行数据访问可以通过dataframe数据特有的loc属性获取。
dataframe.loc[行名字]

例如:

import pandas as pd
d = {"name":"jack tom mike john".split(),
     "age":[17,18, 19, 18]}
ind = [9521, 9526, 9527, 9530]
df = pd.DataFrame(d, index = ind)
print "-" * 20
print df, "# df"
print "-" * 20
print df.index, "# df.index"
print "-" * 20
print df.loc[9527], "# df.loc[9527]"
print "-" * 20
print df.loc[9527].name, df.loc[9527].age

执行结果:

--------------------
      age  name
9521   17  jack
9526   18   tom
9527   19  mike
9530   18  john # df
--------------------
Int64Index([9521, 9526, 9527, 9530], dtype='int64') # df.index
--------------------
age       19
name    mike
Name: 9527, dtype: object # df.loc[9527]
--------------------
9527 19

7.3 index对象

series和dataframe的index是不可修改的数据。pandas提供了一些index对象的方法函数可以获得两个series或dataframe的index的交集、并集等集合结果,也可调用insert、delete、drop等创建新的index数据即不影响原series或dataframe。

import pandas as pd
cols = [ "age", "name"]
ind = pd.Index([9521, 9526, 9527, 9530])
print "-" * 20
print ind, type(ind), "# index obj"
print type(ind), "# index obj"
d = {"name":"jack tom mike john".split(),
     "age":[17,18, 19, 18]}
df = pd.DataFrame(d, index = ind, columns = cols)
print "-" * 20
print df, "# df"
s = pd.Series([72,83,90], index = [9527, 9588, 9526])
print "-" * 20
print s, "$ s"
print "-" * 20
print pd.Index.intersection(ind, s.index), "# intersection"
print pd.Index.difference(ind, s.index), "# difference"
print pd.Index.difference(s.index, ind), "# difference"
t = s.index.drop(9588)
print "-" * 20
print t, "# t = s.index.drop(9588)"
print s.index,  "# s.index"

执行结果:

--------------------
Int64Index([9521, 9526, 9527, 9530], dtype='int64') # index obj
<class 'pandas.core.indexes.numeric.Int64Index'> # index obj
--------------------
      age  name
9521   17  jack
9526   18   tom
9527   19  mike
9530   18  john # df
--------------------
9527    72
9588    83
9526    90
dtype: int64 $ s
--------------------
Int64Index([9526, 9527], dtype='int64') # intersection
Int64Index([9521, 9530], dtype='int64') # difference
Int64Index([9588], dtype='int64') # difference
--------------------
Int64Index([9527, 9526], dtype='int64') # t = s.index.drop(9588)
Int64Index([9527, 9588, 9526], dtype='int64') # s.index
  • reindex函数,可以基于原series或dataframe的数据使用新index产生新数据内容,也就是说不影响原数据。
import pandas as pd
cols = [ "age", "name"]
ind_old = pd.Index([9521, 9526, 9527, 9530])
ind_new = pd.Index([9521, 9526, 9527, 9522])
d = {"name":"jack tom mike john".split(),
     "age":[17,18, 19, 18]}
dfo = pd.DataFrame(d, index = ind_old, columns = cols)
print "-" * 20
print dfo, "# df ind_old"
dfn = dfo.reindex(ind_new)
print "-" * 20
print dfo, "# df ind_old"
print "-" * 20
print dfn, "# df ind_new"

执行结果:

--------------------
      age  name
9521   17  jack
9526   18   tom
9527   19  mike
9530   18  john # df ind_old
--------------------
      age  name
9521   17  jack
9526   18   tom
9527   19  mike
9530   18  john # df ind_old
--------------------
       age  name
9521  17.0  jack
9526  18.0   tom
9527  19.0  mike
9522   NaN   NaN # df ind_new