Python For Data Analysis-七章第一节
《Python For Data Analysis》的第七章的主要围绕如何处理缺失数据、重复数据和字符串处理以及为数据分形做一些必要的预处理展开。第七章的第一节主要讨论研究对缺失数据的处理方法。
14.1 缺失数据
在pandas里会对字段上的缺失数据统一用NaN (Not a Number)来标识出来,可以用isnull来判断seriese或dataframe的数据里是否有NaN?用dropna删除。
import pandas as pd
import numpy as np
s = pd.Series([1,3, np.nan, 5])
print s, "# s"
print s.isnull(), "# s.isnull()"
执行结果:
0 1.0
1 3.0
2 NaN
3 5.0
dtype: float64 # s
0 False
1 False
2 True
3 False
dtype: bool # s.isnull()
- dropna函数可以删除数据里的NaN数据,但如果想影响数据本身需使用
inplace=True
。
import pandas as pd
import numpy as np
s = pd.Series([1,3, np.nan, 5])
print "-" * 40
print s, "# s"
print "-" * 40
print s.isnull(), "# s.isnull()"
print "-" * 40
print s.dropna(),"# s.drop()"
print "-" * 40
print s, "# s"
print "-" * 40
s.dropna(inplace=True)
print s, "# s.dropna(inplace=True)"
print "-" * 40
执行结果:
----------------------------------------
0 1.0
1 3.0
2 NaN
3 5.0
dtype: float64 # s
----------------------------------------
0 False
1 False
2 True
3 False
dtype: bool # s.isnull()
----------------------------------------
0 1.0
1 3.0
3 5.0
dtype: float64 # s.drop()
----------------------------------------
0 1.0
1 3.0
2 NaN
3 5.0
dtype: float64 # s
----------------------------------------
0 1.0
1 3.0
3 5.0
dtype: float64 # s.dropna(inplace=True)
----------------------------------------
对于dataframe也是也一样,如果行内或者列内有NaN的话,正行或者整列会被删除,如果没有使用inplace=True
执行结果是删除后的数据拷贝,而原数据不受影响。
import pandas as pd
import numpy as np
val = np.arange(24).reshape((6,4))
ind = list("opqrst")
col = list("abcd")
df1 = pd.DataFrame(val, columns = col, index = ind)
df1['a']['q'] = np.nan
df1.loc['r','c'] = np.nan
df1.loc['s','b'] = np.nan
print "-" * 40
print df1, "# df1"
print "-" * 40
print df1.dropna(), "# axis = 0"
print "-" * 40
print df1.dropna(axis = 1), "# axis = 1"
df1.dropna(inplace= True)
print "-" * 40
print df1, "# inplace = True"
print "-" * 40
执行结果:
----------------------------------------
a b c d
o 0.0 1.0 2.0 3
p 4.0 5.0 6.0 7
q NaN 9.0 10.0 11
r 12.0 13.0 NaN 15
s 16.0 NaN 18.0 19
t 20.0 21.0 22.0 23 # df1
----------------------------------------
a b c d
o 0.0 1.0 2.0 3
p 4.0 5.0 6.0 7
t 20.0 21.0 22.0 23 # axis = 0
----------------------------------------
d
o 3
p 7
q 11
r 15
s 19
t 23 # axis = 1
----------------------------------------
a b c d
o 0.0 1.0 2.0 3
p 4.0 5.0 6.0 7
t 20.0 21.0 22.0 23 # inplace = True
----------------------------------------
14.2 过滤NaN值
数据里的NaN数据是没有用的情况下可以用很多方式得到非NaN数据的原数据的拷贝,例如之前的drop函数可以对dataframe过滤掉NaN数据、对Series数据使用isnull执行结果的boolean array布尔数组可以对series数据过滤NaN数据。 上节里的drop函数会对行内或列内有NaN的行、列整体删除有些太霸道了,删除的太多,可以使用how形参的另一个值all(默认是any即只要有一个及以上的NaN就删除该行、列),当正行或者整列都是NaN的行或者列才删除。
import pandas as pd
import numpy as np
val = np.arange(24).reshape((6,4))
ind = list("opqrst")
col = list("abcd")
df1 = pd.DataFrame(val, columns = col, index = ind)
df1['a']['q'] = np.nan
df1.loc['r','c'] = np.nan
df1.loc['s','b'] = np.nan
df1.loc['o','b'] = np.nan
df1.loc['t',:] = np.nan
print "-" * 40
print df1, "# df1"
print "-" * 40
print df1.dropna(), "# how = 'any' by default"
print "-" * 40
print df1.dropna(how = 'all'), "# how = 'all' axis = 0"
df1.loc[:,'e'] = np.nan
print "-" * 40
print df1, "# df1 add col 'e' all NaN"
print "-" * 40
print df1.dropna(how = 'all', axis = 1), "# how = 'all' axis = 1"
print "-" * 40
执行结果:
----------------------------------------
a b c d
o 0.0 NaN 2.0 3.0
p 4.0 5.0 6.0 7.0
q NaN 9.0 10.0 11.0
r 12.0 13.0 NaN 15.0
s 16.0 NaN 18.0 19.0
t NaN NaN NaN NaN # df1
----------------------------------------
a b c d
p 4.0 5.0 6.0 7.0 # how = 'any' by default
----------------------------------------
a b c d
o 0.0 NaN 2.0 3.0
p 4.0 5.0 6.0 7.0
q NaN 9.0 10.0 11.0
r 12.0 13.0 NaN 15.0
s 16.0 NaN 18.0 19.0 # how = 'all' axis = 0
----------------------------------------
a b c d e
o 0.0 NaN 2.0 3.0 NaN
p 4.0 5.0 6.0 7.0 NaN
q NaN 9.0 10.0 11.0 NaN
r 12.0 13.0 NaN 15.0 NaN
s 16.0 NaN 18.0 19.0 NaN
t NaN NaN NaN NaN NaN # df1 add col 'e' all NaN
----------------------------------------
a b c d
o 0.0 NaN 2.0 3.0
p 4.0 5.0 6.0 7.0
q NaN 9.0 10.0 11.0
r 12.0 13.0 NaN 15.0
s 16.0 NaN 18.0 19.0
t NaN NaN NaN NaN # how = 'all' axis = 1
----------------------------------------
14.3 NaN数据处理
fillna函数可以填充NaN数据。fillna函数有value、method、axis等参数(value和method不能同时使用),这些参数的意义在之前的章节里已有介绍,也可访问fillna函数官方参考。
import pandas as pd
import numpy as np
val = np.arange(24).reshape((6,4))
ind = list("opqrst")
col = list("abcd")
df1 = pd.DataFrame(val, columns = col, index = ind)
df1['a']['q'] = np.nan
df1.loc['r','c'] = np.nan
df1.loc['s','b'] = np.nan
df1.loc['o','b'] = np.nan
df1.loc['t','d'] = np.nan
print "-" * 40
print df1, "# df1"
print "-" * 40
print df1.fillna(value = 100), "# value"
print "-" * 40
print df1.fillna(method = 'bfill'), "# method = 'bfill'"
print "-" * 40
df1.fillna(method = 'ffill', inplace = True)
print df1, "# method = 'ffill', inplace = True"
print "-" * 40
执行结果:
----------------------------------------
a b c d
o 0.0 NaN 2.0 3.0
p 4.0 5.0 6.0 7.0
q NaN 9.0 10.0 11.0
r 12.0 13.0 NaN 15.0
s 16.0 NaN 18.0 19.0
t 20.0 21.0 22.0 NaN # df1
----------------------------------------
a b c d
o 0.0 100.0 2.0 3.0
p 4.0 5.0 6.0 7.0
q 100.0 9.0 10.0 11.0
r 12.0 13.0 100.0 15.0
s 16.0 100.0 18.0 19.0
t 20.0 21.0 22.0 100.0 # value
----------------------------------------
a b c d
o 0.0 5.0 2.0 3.0
p 4.0 5.0 6.0 7.0
q 12.0 9.0 10.0 11.0
r 12.0 13.0 18.0 15.0
s 16.0 21.0 18.0 19.0
t 20.0 21.0 22.0 NaN # method = 'bfill'
----------------------------------------
a b c d
o 0.0 NaN 2.0 3.0
p 4.0 5.0 6.0 7.0
q 4.0 9.0 10.0 11.0
r 12.0 13.0 10.0 15.0
s 16.0 13.0 18.0 19.0
t 20.0 21.0 22.0 19.0 # method = 'ffill', inplace = True
----------------------------------------
例子里使用fillna函数对dataframe的NaN进行填充值,只要dataframe里有NaN即可填充指定值,其实现实应用不能这样笼统的处理,如果某列代表的是性别数据填充100不太合适吧,所以可以使用字典(只支持列名做字典的key同时axis不能为1)给出列上的填充值。
import pandas as pd
import numpy as np
val = np.arange(24).reshape((6,4))
ind = list("opqrst")
col = list("abcd")
df1 = pd.DataFrame(val, columns = col, index = ind)
df1['a']['q'] = np.nan
df1.loc['r','c'] = np.nan
df1.loc['s','b'] = np.nan
df1.loc['o','b'] = np.nan
df1.loc['t',:] = np.nan
d = {"a" : 100, "b":"f", "c": 95.27, "d" : -1000}
print "-" * 40
print df1, "# df1"
print "-" * 40
print df1.fillna(d), "# using dict for each col"
print "-" * 40
df1.loc[:,'e'] = np.nan
d.update({"e" : "hello world"})
print df1, "# df1 add col 'e' all NaN"
print "-" * 40
print df1.fillna(d), "# df1 all filled"
print "-" * 40
print df1.fillna(d, limit = 2), "# limit = 2"
print "-" * 40
执行结果:
----------------------------------------
a b c d
o 0.0 NaN 2.0 3.0
p 4.0 5.0 6.0 7.0
q NaN 9.0 10.0 11.0
r 12.0 13.0 NaN 15.0
s 16.0 NaN 18.0 19.0
t NaN NaN NaN NaN # df1
----------------------------------------
a b c d
o 0.0 f 2.00 3.0
p 4.0 5 6.00 7.0
q 100.0 9 10.00 11.0
r 12.0 13 95.27 15.0
s 16.0 f 18.00 19.0
t 100.0 f 95.27 -1000.0 # using dict for each col
----------------------------------------
a b c d e
o 0.0 NaN 2.0 3.0 NaN
p 4.0 5.0 6.0 7.0 NaN
q NaN 9.0 10.0 11.0 NaN
r 12.0 13.0 NaN 15.0 NaN
s 16.0 NaN 18.0 19.0 NaN
t NaN NaN NaN NaN NaN # df1 add col 'e' all NaN
----------------------------------------
a b c d e
o 0.0 f 2.00 3.0 hello world
p 4.0 5 6.00 7.0 hello world
q 100.0 9 10.00 11.0 hello world
r 12.0 13 95.27 15.0 hello world
s 16.0 f 18.00 19.0 hello world
t 100.0 f 95.27 -1000.0 hello world # df1 all filled
----------------------------------------
a b c d e
o 0.0 f 2.00 3.0 hello world
p 4.0 5 6.00 7.0 hello world
q 100.0 9 10.00 11.0 NaN
r 12.0 13 95.27 15.0 NaN
s 16.0 f 18.00 19.0 NaN
t 100.0 NaN 95.27 -1000.0 NaN # limit = 2
----------------------------------------
limit参数可以限制某列上连续填充NaN为某值的次数,而不是整列都填充完。