Python For Data Analysis-五章第二节
《Python For Data Analysis》的第五章的第二节太长了本页面是第二节的余下部分。
9.1 数据的算术运算
由于pandas常用series和dataframe两种数据类型,故pandas的数据运算有series和series间、dataframe和dataframe间以及series和dataframe间的算术运算。
- 同型数据,即series和series、dataframe和dataframe间的算术运算。 对于pandas的series或dataframe的同类型两个数据间可进行算术运算(+-*/),这是很panda会自动匹配索引相同(行列均有)的数据项间进行算术运算,如果两个数据索引匹配不上(不重合)自动填空,如果调用相应的算术运算函数可以使用fill_value来先填充指定值,在做相应的算术运算。 下面以求和为例来说明一下pandas数据类型的算术运算。
import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.arange(24).reshape((6,4)), columns = list("abcd"), index=list("opqrst"))
df2 = pd.DataFrame(np.arange(25).reshape((5,5)), columns = list("bxced"), index=list("ijkst"))
print "-" * 40
print df1, "# df1"
print "-" * 40
print df2, "# df2"
print "-" * 40
print df1 + df2, "# df1 + df2"
print "-" * 40
print df1.add(df2), "# df1.add"
print "-" * 40
print df1.add(df2, fill_value=0), "# df1 add fill_value"
print "-" * 40
执行结果:
----------------------------------------
a b c d
o 0 1 2 3
p 4 5 6 7
q 8 9 10 11
r 12 13 14 15
s 16 17 18 19
t 20 21 22 23 # df1
----------------------------------------
b x c e d
i 0 1 2 3 4
j 5 6 7 8 9
k 10 11 12 13 14
s 15 16 17 18 19
t 20 21 22 23 24 # df2
----------------------------------------
a b c d e x
i NaN NaN NaN NaN NaN NaN
j NaN NaN NaN NaN NaN NaN
k NaN NaN NaN NaN NaN NaN
o NaN NaN NaN NaN NaN NaN
p NaN NaN NaN NaN NaN NaN
q NaN NaN NaN NaN NaN NaN
r NaN NaN NaN NaN NaN NaN
s NaN 32.0 35.0 38.0 NaN NaN
t NaN 41.0 44.0 47.0 NaN NaN # df1 + df2
----------------------------------------
a b c d e x
i NaN NaN NaN NaN NaN NaN
j NaN NaN NaN NaN NaN NaN
k NaN NaN NaN NaN NaN NaN
o NaN NaN NaN NaN NaN NaN
p NaN NaN NaN NaN NaN NaN
q NaN NaN NaN NaN NaN NaN
r NaN NaN NaN NaN NaN NaN
s NaN 32.0 35.0 38.0 NaN NaN
t NaN 41.0 44.0 47.0 NaN NaN # df1.add
----------------------------------------
a b c d e x
i NaN 0.0 2.0 4.0 3.0 1.0
j NaN 5.0 7.0 9.0 8.0 6.0
k NaN 10.0 12.0 14.0 13.0 11.0
o 0.0 1.0 2.0 3.0 NaN NaN
p 4.0 5.0 6.0 7.0 NaN NaN
q 8.0 9.0 10.0 11.0 NaN NaN
r 12.0 13.0 14.0 15.0 NaN NaN
s 16.0 32.0 35.0 38.0 18.0 16.0
t 20.0 41.0 44.0 47.0 23.0 21.0 # df1 add fill_value
----------------------------------------
语句df1.add(df2, fill_value=100)
是先填充再求和,从程序结果可以看出结果的dataframe的体型变大了,似乎是两个dataframe的外连接集。
例子里的df1和df2行有's'、't'是共同的,列有'b'、'c'、'd'是共同都有的。
加法总结规律:
1). 首先对两个dataframe做全集。
2). 两个dataframe坐标(label和index)上有交集,对应运算后作为该坐标上的值(例子里df1和df2的s、t行b、c、d列这部分)。
3). 一个dataframe某坐标上有数据另一个没有数据,用fill_value指定值填充另一个dataframe没有数数据项,然后做运算(例子里的df2的ijk行、st行的xe列)。
4). 余下(各自原来没有的列)填空NaN(例如df1里没有ex列、df2没有a列)。
- 异型间算术运算,即series和dataframe间的算运行。 由于两种不同维度的数据类型,一般用add、sub等算术函数时通常会使用函数的axis参数指定series和dataframe的行还是列进行运算,和numpy里一维和二维数组间的操作一样,对低维度的数据也使用了广播机制。
import pandas as pd
import numpy as np
arr1 = np.ones(5)
arr5 = np.ones([5,5])
print arr1
print arr5
print arr5 - arr1
执行结果:
[1. 1. 1. 1. 1.]# arr1
[[1. 1. 1. 1. 1.]
[1. 1. 1. 1. 1.]
[1. 1. 1. 1. 1.]
[1. 1. 1. 1. 1.]
[1. 1. 1. 1. 1.]]# arr5
[[0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0.]]# arr5 -arr1
而对于series和dataframe也是一样,一个是一维的另一个是二维的,计算方式和array的一维二维混算相似。
import pandas as pd
import numpy as np
ind = list("abcd")
col = list("efg")
#print ind, "# ind"
#print col, "# col"
s = pd.Series(np.ones(3), index = col)
print "-" * 40
print s, "# s(series)"
d = pd.DataFrame(np.ones((4,3)) * 2, index = ind, columns = col)
print "-" * 40
print d, "# d(dataframe)"
print "-" * 40
print d - s, "# d - s"
print "-" * 40
print d.sub(s), "# d.sub(s)"
print "-" * 40
print d.sub(s, axis = 'columns'), "# d.sub(s, axis = 'columns')"
print "-" * 40
s1 = pd.Series(np.ones(4), index = ind)
print s1, "# series"
print "-" * 40
print d.sub(s1, axis = 'index'), "# d.sub(s1, axis = 'index')"
s1 = pd.Series(np.ones(4), index = ind)
print "-" * 40
执行结果:
----------------------------------------
e 1.0
f 1.0
g 1.0
dtype: float64 # s(series)
----------------------------------------
e f g
a 2.0 2.0 2.0
b 2.0 2.0 2.0
c 2.0 2.0 2.0
d 2.0 2.0 2.0 # d(dataframe)
----------------------------------------
e f g
a 1.0 1.0 1.0
b 1.0 1.0 1.0
c 1.0 1.0 1.0
d 1.0 1.0 1.0 # d - s
----------------------------------------
e f g
a 1.0 1.0 1.0
b 1.0 1.0 1.0
c 1.0 1.0 1.0
d 1.0 1.0 1.0 # d.sub(s)
----------------------------------------
e f g
a 1.0 1.0 1.0
b 1.0 1.0 1.0
c 1.0 1.0 1.0
d 1.0 1.0 1.0 # d.sub(s, axis = 'columns')
----------------------------------------
a 1.0
b 1.0
c 1.0
d 1.0
dtype: float64 # series
----------------------------------------
e f g
a 1.0 1.0 1.0
b 1.0 1.0 1.0
c 1.0 1.0 1.0
d 1.0 1.0 1.0 # d.sub(s1, axis = 'index')
----------------------------------------
series和dataframe异构混合运算默认是按行计算(series和dataframe的每一行进行运算)的即d.sub(s, axis = 'columns')
语句的含义和没用axis参数的语句含义一样。如果用series和dataframe每列做运算需要指定axis = 'index'
。
9.2 apply函数
appply函数可以作用于一维数据series或者dataframe的行或者列。
import pandas as pd
import numpy as np
import random
ind = list("ijklmn")
col = list("abcd")
t = range(24)
random.shuffle(t)
val = np.array(t).reshape((6,4))
#print t, val, "# val"
print "-" * 40
d = pd.DataFrame(val, index = ind, columns = col)
print d, "# d"
print "-" * 40
f = lambda x : max(x)
#print f([2,1,4,3])
print "-" * 40
print d.apply(f, axis = "columns"), "# applied on rows"
print d.apply(f, axis = "index"), "# applied on cols"
print "-" * 40
执行结果:
----------------------------------------
a b c d
i 11 19 1 16
j 0 7 10 3
k 14 9 20 8
l 17 22 18 13
m 12 23 4 2
n 21 6 5 15 # d
----------------------------------------
i 19
j 10
k 20
l 22
m 23
n 21
dtype: int64 # applied on rows
----------------------------------------
a 21
b 23
c 20
d 16
dtype: int64 # applied on cols
----------------------------------------
函数f = lambda x : max(x)
是实现找出x集合里的最大值。d.apply(f, axis = "columns")
实现找到d这个dataframe的每行的最大值,语句d.apply(f, axis = "index")
通过apply函数可以找到dataframe变量d的每列最大值。
9.3 applymap函数
和NumPy的array一样,dataframe数据也是element-wise机制的,即如果两个dataframe相加,是对应位置数据相加,也可以用一些函数应用在dataframe的每个位置上的数据,例如applymap函数作用每个位置上的数据。
import pandas as pd
import numpy as np
import random
ind = list("ijklmn")
col = list("abcd")
t = range(24)
random.shuffle(t)
val = np.array(t).reshape((6,4))
#print t, val, "# val"
print "-" * 40
d = pd.DataFrame(val, index = ind, columns = col)
print d, "# d"
print "-" * 40
print d.applymap(lambda x: x + 10), "# d.applymap"
print "-" * 40
执行结果
----------------------------------------
a b c d
i 21 22 19 12
j 18 17 20 9
k 16 13 2 8
l 3 11 7 23
m 10 5 6 4
n 15 0 1 14 # d
----------------------------------------
a b c d
i 31 32 29 22
j 28 27 30 19
k 26 23 12 18
l 13 21 17 33
m 20 15 16 14
n 25 10 11 24 # d.applymap
----------------------------------------
9.4 排序
- sort_index函数可以对dataframe数据进行label行、index列的重新排序得到新的dataframe。
import pandas as pd
import numpy as np
import random
ind = list("nimljk")
col = list("dbac")
t = range(24)
random.shuffle(t)
val = np.array(t).reshape((6,4))
#print t, val, "# val"
print "-" * 40
d = pd.DataFrame(val, index = ind, columns = col)
print d, "# d"
print "-" * 40
print d.sort_index(axis = 0), "# d.sort_index rows"
print "-" * 40
print d.sort_index(axis = 1), "# d.sort_index cols"
print "-" * 40
print d.sort_index(axis = 0).sort_index(axis = 1), "# d.sort_index all"
print "-" * 40
执行结果:
----------------------------------------
d b a c
n 0 9 22 13
i 23 3 10 7
m 17 21 15 1
l 4 18 8 14
j 19 2 20 5
k 11 6 12 16 # d
----------------------------------------
d b a c
i 23 3 10 7
j 19 2 20 5
k 11 6 12 16
l 4 18 8 14
m 17 21 15 1
n 0 9 22 13 # d.sort_index rows
----------------------------------------
a b c d
n 22 9 13 0
i 10 3 7 23
m 15 21 1 17
l 8 18 14 4
j 20 2 5 19
k 12 6 16 11 # d.sort_index cols
----------------------------------------
a b c d
i 10 3 7 23
j 20 2 5 19
k 12 6 16 11
l 8 18 14 4
m 15 21 1 17
n 22 9 13 0 # d.sort_index all
----------------------------------------
sort_index函数里的ascending可以指定升序或者降序输出。
- sort_values可以依据数据值进行排序。通过参数by指定某列或多列作为排序主键进行纵向(行)排序输出。
import pandas as pd
import numpy as np
import random
ind = list("nimljk")
col = list("dbac")
t = range(24)
random.shuffle(t)
val = np.array(t).reshape((6,4))
#print t, val, "# val"
print "-" * 40
d = pd.DataFrame(val, index = ind, columns = col)
print d, "# d"
print "-" * 40
print d.sort_values(by = "b"), "# d.sort_values a col:b"
print "-" * 40
print d.sort_values(by = ['b', 'c']), "# d.sort_values cols:b,c"
print "-" * 40
执行结果:
----------------------------------------
d b a c
n 1 0 1 2
i 1 1 3 0
m 2 2 0 3
l 0 3 0 3
j 3 0 1 2
k 1 3 2 2 # d
----------------------------------------
d b a c
n 1 0 1 2
j 3 0 1 2
i 1 1 3 0
m 2 2 0 3
l 0 3 0 3
k 1 3 2 2 # d.sort_values a col:b
----------------------------------------
d b a c
n 1 0 1 2
j 3 0 1 2
i 1 1 3 0
m 2 2 0 3
k 1 3 2 2
l 0 3 0 3 # d.sort_values cols:b,c
----------------------------------------
当b列有相同数据的时候,c列排序得到最终结果。注意结果中的
k 1 3 2 2
l 0 3 0 3 # d.sort_values cols:b,c
- 也可指定sort_values的axis=1来横向(列)数据排序。
import pandas as pd
import numpy as np
import random
ind = list("nimljk")
col = list("dbac")
t = range(4) * 6
random.shuffle(t)
val = np.array(t).reshape((6,4))
#print t, val, "# val"
print "-" * 40
d = pd.DataFrame(val, index = ind, columns = col)
print d, "# d"
print "-" * 40
print d.sort_values(by = "n", axis = 1), "# d.sort_values a col:n"
print "-" * 40
print d.sort_values(by = ['n', 'm'], axis = 1), "# d.sort_values cols:n,m"
print "-" * 40
执行结果:
----------------------------------------
d b a c
n 3 0 3 3
i 2 1 0 3
m 3 3 2 0
l 2 2 1 0
j 0 1 0 1
k 2 1 1 2 # d
----------------------------------------
b d a c
n 0 3 3 3
i 1 2 0 3
m 3 3 2 0
l 2 2 1 0
j 1 0 0 1
k 1 2 1 2 # d.sort_values a row:n
----------------------------------------
b c a d
n 0 3 3 3
i 1 3 0 2
m 3 0 2 3
l 2 0 1 2
j 1 1 0 0
k 1 2 1 2 # d.sort_values rows:n,m
----------------------------------------
语句d.sort_values(by = ['n', 'm'], axis = 1)
先以n行数据为基础进行排序,但n行有数据相同的时候再以m行为依据进行排序,最后得到排序结果。
n 0 3 3 3
i 1 3 0 2
m 3 0 2 3
9.5 Axis Indexes with Duplicate Labels
Axis Indexes with Duplicate Labels的可多个相同的行label的内容在append函数提及、展示过,这里就不累述了。