Python For Data Analysis-七章第一节

《Python For Data Analysis》的第七章的主要围绕如何处理缺失数据、重复数据和字符串处理以及为数据分形做一些必要的预处理展开。第七章的第一节主要讨论研究对缺失数据的处理方法。

14.1 缺失数据

在pandas里会对字段上的缺失数据统一用NaN (Not a Number)来标识出来，可以用isnull来判断seriese或dataframe的数据里是否有NaN？用dropna删除。

import pandas as pd
import numpy as np
s = pd.Series([1,3, np.nan, 5])
print s, "# s"
print s.isnull(), "# s.isnull()"

执行结果：

0    1.0
1    3.0
2    NaN
3    5.0
dtype: float64 # s
0    False
1    False
2     True
3    False
dtype: bool # s.isnull()

dropna函数可以删除数据里的NaN数据，但如果想影响数据本身需使用inplace=True。

import pandas as pd
import numpy as np
s = pd.Series([1,3, np.nan, 5])
print "-" * 40
print s, "# s"
print "-" * 40
print s.isnull(), "# s.isnull()"
print "-" * 40
print s.dropna(),"# s.drop()"
print "-" * 40
print s, "# s"
print "-" * 40
s.dropna(inplace=True)
print s, "# s.dropna(inplace=True)"
print "-" * 40

执行结果：

----------------------------------------
0    1.0
1    3.0
2    NaN
3    5.0
dtype: float64 # s
----------------------------------------
0    False
1    False
2     True
3    False
dtype: bool # s.isnull()
----------------------------------------
0    1.0
1    3.0
3    5.0
dtype: float64 # s.drop()
----------------------------------------
0    1.0
1    3.0
2    NaN
3    5.0
dtype: float64 # s
----------------------------------------
0    1.0
1    3.0
3    5.0
dtype: float64 # s.dropna(inplace=True)
----------------------------------------

对于dataframe也是也一样，如果行内或者列内有NaN的话，正行或者整列会被删除，如果没有使用inplace=True执行结果是删除后的数据拷贝，而原数据不受影响。

import pandas as pd
import numpy as np
val = np.arange(24).reshape((6,4))
ind = list("opqrst")
col = list("abcd")
df1 = pd.DataFrame(val, columns = col, index = ind)
df1['a']['q'] = np.nan
df1.loc['r','c'] = np.nan
df1.loc['s','b'] = np.nan
print "-" * 40
print df1, "# df1"
print "-" * 40
print df1.dropna(), "# axis = 0"
print "-" * 40
print df1.dropna(axis = 1), "# axis = 1" 
df1.dropna(inplace= True)
print "-" * 40
print df1, "# inplace = True"
print "-" * 40

执行结果：

----------------------------------------
      a     b     c   d
o   0.0   1.0   2.0   3
p   4.0   5.0   6.0   7
q   NaN   9.0  10.0  11
r  12.0  13.0   NaN  15
s  16.0   NaN  18.0  19
t  20.0  21.0  22.0  23 # df1
----------------------------------------
      a     b     c   d
o   0.0   1.0   2.0   3
p   4.0   5.0   6.0   7
t  20.0  21.0  22.0  23 # axis = 0
----------------------------------------
    d
o   3
p   7
q  11
r  15
s  19
t  23 # axis = 1
----------------------------------------
      a     b     c   d
o   0.0   1.0   2.0   3
p   4.0   5.0   6.0   7
t  20.0  21.0  22.0  23 # inplace = True
----------------------------------------

14.2 过滤NaN值

数据里的NaN数据是没有用的情况下可以用很多方式得到非NaN数据的原数据的拷贝，例如之前的drop函数可以对dataframe过滤掉NaN数据、对Series数据使用isnull执行结果的boolean array布尔数组可以对series数据过滤NaN数据。上节里的drop函数会对行内或列内有NaN的行、列整体删除有些太霸道了，删除的太多，可以使用how形参的另一个值all(默认是any即只要有一个及以上的NaN就删除该行、列)，当正行或者整列都是NaN的行或者列才删除。

import pandas as pd
import numpy as np
val = np.arange(24).reshape((6,4))
ind = list("opqrst")
col = list("abcd")
df1 = pd.DataFrame(val, columns = col, index = ind)
df1['a']['q'] = np.nan
df1.loc['r','c'] = np.nan
df1.loc['s','b'] = np.nan
df1.loc['o','b'] = np.nan
df1.loc['t',:] = np.nan
print "-" * 40
print df1, "# df1"
print "-" * 40
print df1.dropna(), "# how = 'any' by default"
print "-" * 40
print df1.dropna(how = 'all'), "# how = 'all' axis = 0"
df1.loc[:,'e'] = np.nan
print "-" * 40
print df1, "# df1 add col 'e' all NaN"
print "-" * 40
print df1.dropna(how = 'all', axis = 1), "# how = 'all' axis = 1"
print "-" * 40

执行结果：

----------------------------------------
      a     b     c     d
o   0.0   NaN   2.0   3.0
p   4.0   5.0   6.0   7.0
q   NaN   9.0  10.0  11.0
r  12.0  13.0   NaN  15.0
s  16.0   NaN  18.0  19.0
t   NaN   NaN   NaN   NaN # df1
----------------------------------------
     a    b    c    d
p  4.0  5.0  6.0  7.0 # how = 'any' by default
----------------------------------------
      a     b     c     d
o   0.0   NaN   2.0   3.0
p   4.0   5.0   6.0   7.0
q   NaN   9.0  10.0  11.0
r  12.0  13.0   NaN  15.0
s  16.0   NaN  18.0  19.0 # how = 'all' axis = 0
----------------------------------------
      a     b     c     d   e
o   0.0   NaN   2.0   3.0 NaN
p   4.0   5.0   6.0   7.0 NaN
q   NaN   9.0  10.0  11.0 NaN
r  12.0  13.0   NaN  15.0 NaN
s  16.0   NaN  18.0  19.0 NaN
t   NaN   NaN   NaN   NaN NaN # df1 add col 'e' all NaN
----------------------------------------
      a     b     c     d
o   0.0   NaN   2.0   3.0
p   4.0   5.0   6.0   7.0
q   NaN   9.0  10.0  11.0
r  12.0  13.0   NaN  15.0
s  16.0   NaN  18.0  19.0
t   NaN   NaN   NaN   NaN # how = 'all' axis = 1
----------------------------------------

14.3 NaN数据处理

fillna函数可以填充NaN数据。fillna函数有value、method、axis等参数(value和method不能同时使用)，这些参数的意义在之前的章节里已有介绍，也可访问fillna函数官方参考。

import pandas as pd
import numpy as np
val = np.arange(24).reshape((6,4))
ind = list("opqrst")
col = list("abcd")
df1 = pd.DataFrame(val, columns = col, index = ind)
df1['a']['q'] = np.nan
df1.loc['r','c'] = np.nan
df1.loc['s','b'] = np.nan
df1.loc['o','b'] = np.nan
df1.loc['t','d'] = np.nan
print "-" * 40
print df1, "# df1"
print "-" * 40
print df1.fillna(value = 100), "# value"
print "-" * 40
print df1.fillna(method = 'bfill'), "# method = 'bfill'" 
print "-" * 40
df1.fillna(method = 'ffill', inplace = True)
print df1, "# method = 'ffill', inplace = True" 
print "-" * 40

执行结果：

----------------------------------------
      a     b     c     d
o   0.0   NaN   2.0   3.0
p   4.0   5.0   6.0   7.0
q   NaN   9.0  10.0  11.0
r  12.0  13.0   NaN  15.0
s  16.0   NaN  18.0  19.0
t  20.0  21.0  22.0   NaN # df1
----------------------------------------
       a      b      c      d
o    0.0  100.0    2.0    3.0
p    4.0    5.0    6.0    7.0
q  100.0    9.0   10.0   11.0
r   12.0   13.0  100.0   15.0
s   16.0  100.0   18.0   19.0
t   20.0   21.0   22.0  100.0 # value
----------------------------------------
      a     b     c     d
o   0.0   5.0   2.0   3.0
p   4.0   5.0   6.0   7.0
q  12.0   9.0  10.0  11.0
r  12.0  13.0  18.0  15.0
s  16.0  21.0  18.0  19.0
t  20.0  21.0  22.0   NaN # method = 'bfill'
----------------------------------------
      a     b     c     d
o   0.0   NaN   2.0   3.0
p   4.0   5.0   6.0   7.0
q   4.0   9.0  10.0  11.0
r  12.0  13.0  10.0  15.0
s  16.0  13.0  18.0  19.0
t  20.0  21.0  22.0  19.0 # method = 'ffill', inplace = True
----------------------------------------

例子里使用fillna函数对dataframe的NaN进行填充值，只要dataframe里有NaN即可填充指定值，其实现实应用不能这样笼统的处理，如果某列代表的是性别数据填充100不太合适吧，所以可以使用字典(只支持列名做字典的key同时axis不能为1)给出列上的填充值。

import pandas as pd
import numpy as np
val = np.arange(24).reshape((6,4))
ind = list("opqrst")
col = list("abcd")
df1 = pd.DataFrame(val, columns = col, index = ind)
df1['a']['q'] = np.nan
df1.loc['r','c'] = np.nan
df1.loc['s','b'] = np.nan
df1.loc['o','b'] = np.nan
df1.loc['t',:] = np.nan
d = {"a" : 100, "b":"f", "c": 95.27, "d" : -1000}
print "-" * 40
print df1, "# df1"
print "-" * 40
print df1.fillna(d), "# using dict for each col"
print "-" * 40
df1.loc[:,'e'] = np.nan
d.update({"e" : "hello world"})
print df1, "# df1 add col 'e' all NaN"
print "-" * 40
print df1.fillna(d), "# df1 all filled"
print "-" * 40
print df1.fillna(d, limit = 2), "# limit = 2"
print "-" * 40

执行结果：

----------------------------------------
      a     b     c     d
o   0.0   NaN   2.0   3.0
p   4.0   5.0   6.0   7.0
q   NaN   9.0  10.0  11.0
r  12.0  13.0   NaN  15.0
s  16.0   NaN  18.0  19.0
t   NaN   NaN   NaN   NaN # df1
----------------------------------------
       a   b      c       d
o    0.0   f   2.00     3.0
p    4.0   5   6.00     7.0
q  100.0   9  10.00    11.0
r   12.0  13  95.27    15.0
s   16.0   f  18.00    19.0
t  100.0   f  95.27 -1000.0 # using dict for each col
----------------------------------------
      a     b     c     d   e
o   0.0   NaN   2.0   3.0 NaN
p   4.0   5.0   6.0   7.0 NaN
q   NaN   9.0  10.0  11.0 NaN
r  12.0  13.0   NaN  15.0 NaN
s  16.0   NaN  18.0  19.0 NaN
t   NaN   NaN   NaN   NaN NaN # df1 add col 'e' all NaN
----------------------------------------
       a   b      c       d            e
o    0.0   f   2.00     3.0  hello world
p    4.0   5   6.00     7.0  hello world
q  100.0   9  10.00    11.0  hello world
r   12.0  13  95.27    15.0  hello world
s   16.0   f  18.00    19.0  hello world
t  100.0   f  95.27 -1000.0  hello world # df1 all filled
----------------------------------------
       a    b      c       d            e
o    0.0    f   2.00     3.0  hello world
p    4.0    5   6.00     7.0  hello world
q  100.0    9  10.00    11.0          NaN
r   12.0   13  95.27    15.0          NaN
s   16.0    f  18.00    19.0          NaN
t  100.0  NaN  95.27 -1000.0          NaN # limit = 2
----------------------------------------

limit参数可以限制某列上连续填充NaN为某值的次数，而不是整列都填充完。