Python For Data Analysis-八章第二节

《Python For Data Analysis》的第八章主要讨论的是数据的组合、合并和变形等问题。而第八章的第二节主要讨论的是合并数据问题, 这个有点儿像数据库的表的连接问题: 内连接、外连接、左连接、右连接、全连接等。主要涉及两个函数:merge和concat, merge函数依据key(两个dataframe想重合的列或关联的列)实现类似数据库的join操作;concat函数类似于简单的拼接操作。 为了更好的理解合并数据,这里先设计两个DataFrame:假设一个卖水果的店主手上有两张表,一张表记录供应商和联系电话的供应商信息表supplierInfo, 另一张表记录正在卖的水果和该水果的供应商的热卖水果表saleInfo。如何构建两个dataframe来存储这样的两张表呢?

import pandas as pd
import numpy as np

print "1rd DataFrame : Supplier Info 600 records"
print "-" * 40
sid = pd.Series(np.arange(200, 800))
tel = pd.Series(np.random.randint(13300000000, 13700000000, size = 600))
val = {"supplier" : sid, "Telephone" : tel}

supplierInfo = pd.DataFrame(val)
print supplierInfo[:10], "# supplierInfo"
supplierInfo.to_csv("supplierInfo.csv")
print "\n\n\n"
print "2rd DataFrame : Sales price info"
print "-" * 40
fruit = "pineapple,watermelon,banana,orange,apple,lemon,cherry,pear,coconut,peach,strawberry".split(",")
fruit_category = pd.Series(fruit, name = "Fruit")
fid = fruit_category.sample(len(fruit_category) * 3,replace=True)
fid.reset_index(inplace=True, drop = True)
sup = sid.sample(len(fid))
sup.reset_index(inplace=True, drop = True)
saleInfo = pd.DataFrame({"fruit":fid,"supplier":sup})
print saleInfo[:10], "# saleInfo"
saleInfo.to_csv("saleInfo.csv")

执行结果:

1rd DataFrame : Supplier Info 600 records
----------------------------------------
     Telephone  supplier
0  13421972855       200
1  13427373491       201
2  13639747726       202
3  13563018819       203
4  13463388117       204
5  13397690432       205
6  13428386545       206
7  13516691855       207
8  13514679589       208
9  13583503843       209 # supplierInfo




2rd DataFrame : Sales price info
----------------------------------------
        fruit  supplier
0     coconut       396
1       lemon       718
2       apple       268
3       lemon       703
4       peach       275
5       lemon       327
6       peach       448
7  strawberry       373
8       apple       574
9      banana       329 # saleInfo
----------------------------------------

由于页面限制,每张表只打印前10条数据。第一张表supplierInfo记录者600个供应商的电话和id编号,共600条记录。

第二张表saleInfo记录正在热卖的水果和供应商信息,共33条记录。

18.1 merge实现内连接inner

merge函数合并两个dataframe的时候默认是依据两个dataframe的共有的列,或者指定每个dataframe关联的列,默认是内连接两个dataframe即求两个dataframe的交集(intersection)。利用上边的两张表,提出一个问题,如何得到一个新的dataframe,即含热卖水果、该水果的供应商还有供应的电话呢?可以直接使用默认的merge的功能,即依据共同列和内连接。供应商表supplierInfo的供应商有600个,而热卖水果表saleInfo里只有33条记录,且有可能一个供应商提供多个水果都热卖,利用merge可以基于两表共有列supplier来提取供应商表里供应商的电话信息。

import pandas as pd
import numpy as np

print "1rd DataFrame : Supplier Info 600 records"
print "-" * 40
sid = pd.Series(np.arange(200, 800))
tel = pd.Series(np.random.randint(13300000000, 13700000000, size = 600))
val = {"supplier" : sid, "Telephone" : tel}

supplierInfo = pd.DataFrame(val)
print supplierInfo[:10], "# supplierInfo"
supplierInfo.to_csv("supplierInfo.csv")
print "\n\n\n"
print "2rd DataFrame : Sales price info"
print "-" * 40
fruit = "pineapple,watermelon,banana,orange,apple,lemon,cherry,pear,coconut,peach,strawberry".split(",")
fruit_category = pd.Series(fruit, name = "Fruit")
fid = fruit_category.sample(len(fruit_category) * 3,replace=True)
fid.reset_index(inplace=True, drop = True)
sup = sid.sample(len(fid))
sup.reset_index(inplace=True, drop = True)
saleInfo = pd.DataFrame({"fruit":fid,"supplier":sup})
print saleInfo[:10], "# saleInfo"
saleInfo.to_csv("saleInfo.csv")
print "-" * 40
inner = pd.merge(saleInfo,supplierInfo)
print inner, "# merge by inner default"
print "-" * 40

执行结果:

1rd DataFrame : Supplier Info 600 records
----------------------------------------
     Telephone  supplier
0  13397314405       200
1  13640069730       201
2  13452263985       202
3  13365776382       203
4  13372081583       204
5  13362805386       205
6  13691396902       206
7  13472919613       207
8  13630580967       208
9  13572586629       209 # supplierInfo




2rd DataFrame : Sales price info
----------------------------------------
        fruit  supplier
0      cherry       744
1       peach       530
2  watermelon       720
3      banana       753
4  watermelon       362
5       lemon       236
6        pear       536
7  watermelon       682
8      cherry       539
9      cherry       378 # saleInfo
----------------------------------------
         fruit  supplier    Telephone
0       cherry       744  13512653004
1        peach       530  13451525038
2   watermelon       720  13506123412
3       banana       753  13585478535
4   watermelon       362  13602478941
5        lemon       236  13448737006
6         pear       536  13615857657
7   watermelon       682  13305228132
8       cherry       539  13320374321
9       cherry       378  13617634073
10       peach       696  13555357057
11        pear       230  13547184365
12  watermelon       365  13439731283
13        pear       533  13640788128
14   pineapple       791  13609052279
15       peach       520  13449112307
16       lemon       437  13366315828
17      cherry       524  13465646212
18   pineapple       466  13370749700
19       apple       380  13687984821
20        pear       434  13334299009
21  strawberry       367  13489740422
22  watermelon       644  13614544558
23   pineapple       491  13419549293
24   pineapple       675  13381339073
25       apple       366  13408400428
26        pear       444  13551163610
27        pear       507  13588231571
28  watermelon       702  13427105077
29       apple       505  13560167229
30      banana       725  13622088077
31  strawberry       785  13582343547
32      cherry       582  13574155943 # merge by inner default
----------------------------------------

程序运行无误的原因是两个表都有supplier共有列且名字一致,但是如果两个表的关联列的名字不一致,可以使用merge函数的left_onright_on参数来指定关联列。

import pandas as pd
import numpy as np

print "1rd DataFrame : Supplier Info 600 records"
print "-" * 40
sid = pd.Series(np.arange(200, 800))
tel = pd.Series(np.random.randint(13300000000, 13700000000, size = 600))
val = {"supplier" : sid, "Telephone" : tel}

supplierInfo = pd.DataFrame(val)
print supplierInfo[:10], "# supplierInfo"
supplierInfo.to_csv("supplierInfo.csv")
print "\n\n\n"
print "2rd DataFrame : Sales price info"
print "-" * 40
fruit = "pineapple,watermelon,banana,orange,apple,lemon,cherry,pear,coconut,peach,strawberry".split(",")
fruit_category = pd.Series(fruit, name = "Fruit")
fid = fruit_category.sample(len(fruit_category) * 3,replace=True)
fid.reset_index(inplace=True, drop = True)
sup = sid.sample(len(fid))
sup.reset_index(inplace=True, drop = True)
saleInfo = pd.DataFrame({"fruit":fid,"sup":sup})
print saleInfo[:10], "# saleInfo"
saleInfo.to_csv("saleInfo.csv")
print "-" * 40
inner = pd.merge(saleInfo,supplierInfo, left_on = "sup", right_on = "supplier")
print inner, "# merge by inner default"
print "-" * 40

执行结果:

1rd DataFrame : Supplier Info 600 records
----------------------------------------
     Telephone  supplier
0  13419790746       200
1  13551084291       201
2  13463359987       202
3  13655480653       203
4  13457048691       204
5  13521227468       205
6  13452934190       206
7  13678163518       207
8  13587003496       208
9  13363122472       209 # supplierInfo




2rd DataFrame : Sales price info
----------------------------------------
        fruit  sup
0   pineapple  376
1      cherry  672
2       peach  384
3   pineapple  247
4     coconut  592
5      cherry  448
6       lemon  529
7      banana  615
8       lemon  374
9  strawberry  741 # saleInfo
----------------------------------------
         fruit  sup    Telephone  supplier
0    pineapple  376  13619661261       376
1       cherry  672  13659543874       672
2        peach  384  13586651620       384
3    pineapple  247  13490523210       247
4      coconut  592  13417280009       592
5       cherry  448  13652830401       448
6        lemon  529  13662448776       529
7       banana  615  13679672834       615
8        lemon  374  13349513320       374
9   strawberry  741  13458734929       741
10       lemon  249  13410741189       249
11      cherry  264  13359359933       264
12       lemon  787  13469956160       787
13  strawberry  727  13644866542       727
14     coconut  426  13522184540       426
15      orange  659  13589132728       659
16       peach  402  13538576857       402
17     coconut  235  13374021445       235
18      cherry  202  13463359987       202
19      cherry  764  13544761364       764
20      banana  287  13610190258       287
21      banana  355  13503122927       355
22      orange  455  13586231983       455
23     coconut  319  13463046713       319
24     coconut  796  13501732894       796
25        pear  350  13405649819       350
26      orange  431  13693748090       431
27        pear  250  13318097140       250
28   pineapple  712  13658847104       712
29        pear  224  13636139762       224
30       apple  327  13573669416       327
31      cherry  513  13651095441       513
32      orange  580  13534756074       580 # merge by inner default
----------------------------------------

程序变化之处是两个dataframe的列名字有所不同:

val = {"supplier" : sid, "Telephone" : tel}

supplierInfo = pd.DataFrame(val)

saleInfo = pd.DataFrame({"fruit":fid,"sup":sup})

那么在用merge时由于两个dataframe没有同名的列,需要用left_onright_on参数来指定关联列,即:

inner = pd.merge(saleInfo,supplierInfo, left_on = "sup", right_on = "supplier")

18.2 merge实现外连接outer

merge函数默认是内连接,可以使用how形参来指定非内连接,例如外连接outer,外连接实际类似于数学上的集合的并集。

import pandas as pd
import numpy as np

print "1rd DataFrame : Supplier Info 600 records"
print "-" * 40
sid = pd.Series(np.arange(200, 800))
tel = pd.Series(np.random.randint(13300000000, 13700000000, size = 600))
val = {"supplier" : sid, "Telephone" : tel}

supplierInfo = pd.DataFrame(val)
print supplierInfo[:10], "# supplierInfo"
supplierInfo.to_csv("supplierInfo.csv")
print "\n\n\n"
print "2rd DataFrame : Sales price info"
print "-" * 40
fruit = "pineapple,watermelon,banana,orange,apple,lemon,cherry,pear,coconut,peach,strawberry".split(",")
fruit_category = pd.Series(fruit, name = "Fruit")
fid = fruit_category.sample(len(fruit_category) * 3,replace=True)
fid.reset_index(inplace=True, drop = True)
sup = sid.sample(len(fid))
sup.reset_index(inplace=True, drop = True)
saleInfo = pd.DataFrame({"fruit":fid,"supplier":sup})
print saleInfo[:10], "# saleInfo"
saleInfo.to_csv("saleInfo.csv")
print "-" * 40
outer = pd.merge(saleInfo,supplierInfo, how = "outer")
print outer, "# merge by outer"
print "-" * 40

执行结果:

1rd DataFrame : Supplier Info 600 records
----------------------------------------
     Telephone  supplier
0  13543440652       200
1  13433706661       201
2  13549572735       202
3  13619468030       203
4  13573591940       204
5  13606770156       205
6  13360419930       206
7  13665511872       207
8  13628056017       208
9  13637414407       209 # supplierInfo




2rd DataFrame : Sales price info
----------------------------------------
        fruit  supplier
0      orange       507
1      cherry       372
2  strawberry       514
3       lemon       663
4      orange       727
5   pineapple       479
6  strawberry       315
7       peach       439
8  strawberry       454
9        pear       240 # saleInfo
----------------------------------------
          fruit  supplier    Telephone
0        orange       507  13439505961
1        cherry       372  13303155133
2    strawberry       514  13525833383
3         lemon       663  13590968869
4        orange       727  13417794792
....
28        apple       485  13614700517
29         pear       519  13601530338
..          ...       ...          ...
570         NaN       770  13641022823
571         NaN       771  13602055526
...
598         NaN       798  13610828633
599         NaN       799  13699702573

[600 rows x 3 columns] # merge by outer
----------------------------------------

18.3 merge实现左右连接

为了更好的展示用merge实现左右连接,将两张表的数据缩减一些。

import pandas as pd
import numpy as np

print "1rd DataFrame : Supplier Info 10 records"
print "-" * 40
sid = pd.Series(np.arange(200, 210))
tel = pd.Series(np.random.randint(13300000000, 13700000000, size = 10))
val = {"supplier" : sid, "Telephone" : tel}
supplierInfo = pd.DataFrame(val)
print supplierInfo, "# supplierInfo"
supplierInfo.to_csv("supplierInfo.csv")

print "2rd DataFrame : Sales price info"
print "-" * 40
fruit = "pineapple,banana,orange".split(",")
fruit_category = pd.Series(fruit, name = "Fruit")
fid = fruit_category.sample(len(fruit_category) * 2,replace=True)
fid.reset_index(inplace=True, drop = True)
sup = sid.sample(len(fid))
sup.reset_index(inplace=True, drop = True)
saleInfo = pd.DataFrame({"fruit":fid,"supplier":sup})
print saleInfo, "# saleInfo"
print "-" * 40
saleInfo.to_csv("saleInfo.csv")

print "\n\n\n"
print "*" * 40
print "saleInfo is left table, supplierInfo is right table"
print "-" * 40
left = pd.merge(saleInfo,supplierInfo, how = "left")
print left, "# merge by left"
print "-" * 40
print "\n"
print "*" * 40
print "saleInfo is left table, supplierInfo is right table"
print "-" * 40
right = pd.merge(saleInfo,supplierInfo, how = "right")
print right, "# merge by right"
print "-" * 40
print "\n"
print "*" * 40
print "supplierInfo is left table, saleInfo is right table"
print "-" * 40
left = pd.merge(supplierInfo,saleInfo, how = "left")
print left, "# merge by left"
print "-" * 40
print "\n"
print "*" * 40
print "supplierInfo is left table, saleInfo is right table"
print "-" * 40
right = pd.merge(supplierInfo,saleInfo, how = "right")
print right, "# merge by right"
print "-" * 40

执行结果:

1rd DataFrame : Supplier Info 10 records
----------------------------------------
     Telephone  supplier
0  13600789375       200
1  13660182483       201
2  13685123194       202
3  13654834211       203
4  13639572499       204
5  13329426145       205
6  13558906512       206
7  13457094656       207
8  13484579614       208
9  13314452594       209 # supplierInfo
2rd DataFrame : Sales price info
----------------------------------------
       fruit  supplier
0     banana       201
1     orange       204
2     banana       203
3     orange       206
4  pineapple       205
5     orange       202 # saleInfo
----------------------------------------




****************************************
saleInfo is left table, supplierInfo is right table
----------------------------------------
       fruit  supplier    Telephone
0     banana       201  13660182483
1     orange       204  13639572499
2     banana       203  13654834211
3     orange       206  13558906512
4  pineapple       205  13329426145
5     orange       202  13685123194 # merge by left
----------------------------------------


****************************************
saleInfo is left table, supplierInfo is right table
----------------------------------------
       fruit  supplier    Telephone
0     banana       201  13660182483
1     orange       204  13639572499
2     banana       203  13654834211
3     orange       206  13558906512
4  pineapple       205  13329426145
5     orange       202  13685123194
6        NaN       200  13600789375
7        NaN       207  13457094656
8        NaN       208  13484579614
9        NaN       209  13314452594 # merge by right
----------------------------------------


****************************************
supplierInfo is left table, saleInfo is right table
----------------------------------------
     Telephone  supplier      fruit
0  13600789375       200        NaN
1  13660182483       201     banana
2  13685123194       202     orange
3  13654834211       203     banana
4  13639572499       204     orange
5  13329426145       205  pineapple
6  13558906512       206     orange
7  13457094656       207        NaN
8  13484579614       208        NaN
9  13314452594       209        NaN # merge by left
----------------------------------------


****************************************
supplierInfo is left table, saleInfo is right table
----------------------------------------
     Telephone  supplier      fruit
0  13660182483       201     banana
1  13685123194       202     orange
2  13654834211       203     banana
3  13639572499       204     orange
4  13329426145       205  pineapple
5  13558906512       206     orange # merge by right
----------------------------------------

1).左连接的意思是以merge函数里的第一个dataframe为基准(saleInfo),如果第二个dataframe和第一个dataframe有共同列(关联列)的数据被合并成一个新dataframe,没有NaN; 语句left = pd.merge(saleInfo,supplierInfo, how = "left")的左表为saleInfo,右表为supplierInfo,函数merge的how指定了左连接,那么左表全输出,右表其他非共同或关联列如果有数据则输出数据,没有用NaN填充。

----------------------------------------
       fruit  supplier    Telephone
0     banana       201  13660182483
1     orange       204  13639572499
2     banana       203  13654834211
3     orange       206  13558906512
4  pineapple       205  13329426145
5     orange       202  13685123194 # merge by left
----------------------------------------

2). 右连接则是以第二个dataframe为基准,列出第一个dataframe里有与第二个dataframe能关联(共同)上的列的数据。 语句right = pd.merge(saleInfo,supplierInfo, how = "right")的左表为saleInfo,右表为supplierInfo,函数merge的how指定了右连接, 那么以’supplierInfo‘为基准并全部输出,左表saleInfo如果有和右表supplierInfo共同列数据则输出其他非共同列数据,没有用NaN填充。

----------------------------------------
       fruit  supplier    Telephone
0     banana       201  13660182483
1     orange       204  13639572499
2     banana       203  13654834211
3     orange       206  13558906512
4  pineapple       205  13329426145
5     orange       202  13685123194
6        NaN       200  13600789375
7        NaN       207  13457094656
8        NaN       208  13484579614
9        NaN       209  13314452594 # merge by right
----------------------------------------

18.4 Merging on Index

"Merging on Index"的意思是可以一个表的某列和另一个表的行索引index作为两个表的关联连接的依据,也可两个表都用index来关联连接。

  • 先演示一个表用index和另一个表用列来关联连接,为此需要对上边的两个表做一下简单的调整,对于记录供应商电话信息的supplierInfo表调整成这样,用供应商的id作为行索引,电话作为值。

import pandas as pd
import numpy as np

print "1rd DataFrame : Supplier Info 10 records"
print "-" * 40
sid = np.arange(200, 210)
tel = np.random.randint(13300000000, 13700000000, size = 10)
supplierInfo = pd.DataFrame(tel, index = sid, columns = ["telephone"])
print supplierInfo, "# supplierInfo"
supplierInfo.to_csv("supplierInfo.csv")
print ""
print "2rd DataFrame : Sales price info"
print "-" * 40
fruit = "pineapple,banana,orange".split(",")
fruit_category = pd.Series(fruit, name = "Fruit")
fid = fruit_category.sample(len(fruit_category) * 2,replace=True)
fid.reset_index(inplace=True, drop = True)
sup = pd.Series(sid).sample(len(fid))
sup.reset_index(inplace=True, drop = True)
saleInfo = pd.DataFrame({"fruit":fid,"supplier":sup})
print saleInfo, "# saleInfo"
print "-" * 40
saleInfo.to_csv("saleInfo.csv")

print "\n\n\n"
print "*" * 40
print "saleInfo is left table, supplierInfo is right table"
print "-" * 40
right = pd.merge(saleInfo,supplierInfo, left_on="supplier", right_index = True, how = "right")
print right, "# merge by key and index , right join"
print "-" * 40
left = pd.merge(saleInfo,supplierInfo, left_on="supplier", right_index = True, how = "left")
print left, "# merge by key and index , left join"
print "-" * 40

执行结果:

1rd DataFrame : Supplier Info 10 records
----------------------------------------
       telephone
200  13414429305
201  13324093133
202  13603734461
203  13691986220
204  13518758252
205  13631127768
206  13535125903
207  13544359176
208  13316553234
209  13455143637 # supplierInfo

2rd DataFrame : Sales price info
----------------------------------------
    fruit  supplier
0  orange       203
1  orange       202
2  banana       200
3  orange       208
4  orange       207
5  orange       209 # saleInfo
----------------------------------------




****************************************
saleInfo is left table, supplierInfo is right table
----------------------------------------
    fruit  supplier    telephone
0  orange       203  13691986220
1  orange       202  13603734461
2  banana       200  13414429305
3  orange       208  13316553234
4  orange       207  13544359176
5  orange       209  13455143637
5     NaN       201  13324093133
5     NaN       204  13518758252
5     NaN       205  13631127768
5     NaN       206  13535125903 # merge by key and index , right join
----------------------------------------
    fruit  supplier    telephone
0  orange       203  13691986220
1  orange       202  13603734461
2  banana       200  13414429305
3  orange       208  13316553234
4  orange       207  13544359176
5  orange       209  13455143637 # merge by key and index , left join
----------------------------------------

由于supplierInfo用供应商的id作为行索引是Index,而saleInfo里的supplier列也是供应商的id,那么如果想匹配出(连接)热卖水果供应商的电话,需要用supplierInfo的index和saleInfo的列进行连接操作即语句right = pd.merge(saleInfo,supplierInfo, left_on="supplier", right_index = True, how = "right")原因是saleInfo是左表,用了supplier列需用形参left_on="supplier"和右表supplierInfo的index进行连接,需要用形参right_index = True来完成右连接或左连接。总之,这条语句的意思就是左表用列、右表用index行索引。

  • 左右表均用index来关联连接。需要使用left_index=True, right_index = True形参。
import pandas as pd
import numpy as np

print "1rd DataFrame : Supplier Info 10 records"
print "-" * 40
sid = np.arange(200, 210)
tel = np.random.randint(13300000000, 13700000000, size = 10)
supplierInfo = pd.DataFrame(tel, index = sid, columns = ["telephone"])
print supplierInfo, "# supplierInfo"
supplierInfo.to_csv("supplierInfo.csv")
print ""
print "2rd DataFrame : Sales price info"
print "-" * 40
fruit = "pineapple,banana,orange".split(",")
fruit_category = pd.Series(fruit, name = "Fruit")
fid = fruit_category.sample(len(fruit_category) * 2,replace=True)
fid.reset_index(inplace=True, drop = True)
sup = pd.Series(sid).sample(len(fid))
sup.reset_index(inplace=True, drop = True)
saleInfo = pd.DataFrame(fid.values, index = sup)
print saleInfo, "# saleInfo"
print "-" * 40
saleInfo.to_csv("saleInfo.csv")

print "\n\n\n"
print "*" * 40
print "saleInfo is left table, supplierInfo is right table"
print "-" * 40
right = pd.merge(saleInfo,supplierInfo, left_index=True, right_index = True, how = "right")
print right, "# merge by both index , right join"
print "-" * 40
left = pd.merge(saleInfo,supplierInfo, left_index=True, right_index = True, how = "left")
print left, "# merge by both index , left join"
print "-" * 40

执行结果:

1rd DataFrame : Supplier Info 10 records
----------------------------------------
       telephone
200  13453805505
201  13461207097
202  13619022642
203  13658147818
204  13611310274
205  13545485123
206  13685924267
207  13562572868
208  13636868229
209  13385445647 # supplierInfo

2rd DataFrame : Sales price info
----------------------------------------
             0
202  pineapple
209  pineapple
200  pineapple
208     banana
201  pineapple
204     orange # saleInfo
----------------------------------------




****************************************
saleInfo is left table, supplierInfo is right table
----------------------------------------
             0    telephone
200  pineapple  13453805505
201  pineapple  13461207097
202  pineapple  13619022642
203        NaN  13658147818
204     orange  13611310274
205        NaN  13545485123
206        NaN  13685924267
207        NaN  13562572868
208     banana  13636868229
209  pineapple  13385445647 # merge by both index , right join
----------------------------------------
             0    telephone
202  pineapple  13619022642
209  pineapple  13385445647
200  pineapple  13453805505
208     banana  13636868229
201  pineapple  13461207097
204     orange  13611310274 # merge by both index , left join
----------------------------------------
  • 这种左右均用index的merge函数使用方式可以用join函数来简化。
import pandas as pd
import numpy as np

#print "1rd DataFrame : Supplier Info 10 records"
#print "-" * 40
sid = np.arange(200, 210)
tel = np.random.randint(13300000000, 13700000000, size = 10)
supplierInfo = pd.DataFrame(tel, index = sid, columns = ["telephone"])
#print supplierInfo, "# supplierInfo"
supplierInfo.to_csv("supplierInfo.csv")
#print ""
#print "2rd DataFrame : Sales price info"
#print "-" * 40
fruit = "pineapple,banana,orange".split(",")
fruit_category = pd.Series(fruit, name = "Fruit")
fid = fruit_category.sample(len(fruit_category) * 2,replace=True)
fid.reset_index(inplace=True, drop = True)
sup = pd.Series(sid).sample(len(fid))
sup.reset_index(inplace=True, drop = True)
saleInfo = pd.DataFrame(fid.values, index = sup)
#print saleInfo, "# saleInfo"
#print "-" * 40
saleInfo.to_csv("saleInfo.csv")

print "\n\n\n"
print "*" * 40
print "saleInfo is left table, supplierInfo is right table"
print "-" * 40
right = pd.merge(saleInfo,supplierInfo, left_index=True, right_index = True, how = "right")
print right, "# merge by both index , right join"
print "-" * 40
left = pd.merge(saleInfo,supplierInfo, left_index=True, right_index = True, how = "left")
print left, "# merge by both index , left join"
print "-" * 40

print saleInfo.join(supplierInfo), "# saleInfo.join(supplierInfo)"
print "-" * 40
print supplierInfo.join(saleInfo), "# supplierInfo.join(saleInfo)"
print "-" * 40

执行结果:

****************************************
saleInfo is left table, supplierInfo is right table
----------------------------------------
             0    telephone
200        NaN  13499398039
201        NaN  13472434376
202        NaN  13469093351
203     banana  13474856911
204     orange  13329906610
205     orange  13349143149
206  pineapple  13353763211
207     banana  13333693459
208        NaN  13305384269
209     banana  13531020571 # merge by both index , right join
----------------------------------------
             0    telephone
204     orange  13329906610
206  pineapple  13353763211
205     orange  13349143149
203     banana  13474856911
207     banana  13333693459
209     banana  13531020571 # merge by both index , left join
----------------------------------------
             0    telephone
204     orange  13329906610
206  pineapple  13353763211
205     orange  13349143149
203     banana  13474856911
207     banana  13333693459
209     banana  13531020571 # saleInfo.join(supplierInfo)
----------------------------------------
       telephone          0
200  13499398039        NaN
201  13472434376        NaN
202  13469093351        NaN
203  13474856911     banana
204  13329906610     orange
205  13349143149     orange
206  13353763211  pineapple
207  13333693459     banana
208  13305384269        NaN
209  13531020571     banana # supplierInfo.join(saleInfo)
----------------------------------------

18.5 concat拼接数据

第八章的第二节除了介绍了可以实现数据库连接的merge函数外,还介绍了另一个函数就是concat函数,concat函数可以拼接数据,例如将series拼接成dataframe,将dataframe拼接成更长、更宽的dataframe数据。

18.5.1 concat函数对Series的拼接

  • 将Series拼接成DataFrame,这个比较有用。concat函数在没有指定轴的情况下默认是0轴,对于series而言竖向连接。
import pandas as pd
import numpy as np

ind = [chr(x) for x in range(ord('a'), ord('g') + 1)]
val = range(len(ind))
#print ind
ss = pd.Series(val, index = ind)
#print ss
ind_random = np.random.permutation(ind)
s0 = pd.Series(val, index = ind_random).sample(6)
print "-" * 40
print s0, "# s0"
ind_random = np.random.permutation(ind)
s1 = pd.Series(val, index = ind_random).sample(4)
print "-" * 40
print s1, "# s1"
ind_random = np.random.permutation(ind)
s2 = pd.Series(val, index = ind_random).sample(5)
print "-" * 40
print s2, "# s2"
print "-" * 40
sc = pd.concat([s0, s1, s2])
print sc, "# concat axis = 0"
print "-" * 40

执行结果:

----------------------------------------
a    2
e    3
g    4
b    5
f    0
c    6
dtype: int64 # s0
----------------------------------------
g    5
d    2
b    1
e    4
dtype: int64 # s1
----------------------------------------
g    6
d    5
a    4
e    3
c    2
dtype: int64 # s2
----------------------------------------
a    2
e    3
g    4
b    5
f    0
c    6
g    5
d    2
b    1
e    4
g    6
d    5
a    4
e    3
c    2
dtype: int64 # concat axis = 0
----------------------------------------

解释一下程序的各个部分:

1). 创建一个a~g的Series。

ind = [chr(x) for x in range(ord('a'), ord('g') + 1)]
val = range(len(ind))
#print ind
ss = pd.Series(val, index = ind)
#print ss

2). 先将作为index的索引乱序,通过permutation函数,然后通过sample函数随机选取若干个样本构建s0、s1、s2三个Series。

ind_random = np.random.permutation(ind)
s0 = pd.Series(val, index = ind_random).sample(6)
print "-" * 40
print s0, "# s0"
ind_random = np.random.permutation(ind)
s1 = pd.Series(val, index = ind_random).sample(4)
print "-" * 40
print s1, "# s1"
ind_random = np.random.permutation(ind)
s2 = pd.Series(val, index = ind_random).sample(5)
print "-" * 40
print s2, "# s2"
print "-" * 40

3). 使用concat连接这三个Series,得到一个更长的Series。

sc = pd.concat([s0, s1, s2])
print sc, "# concat axis = 0"
print "-" * 40
  • 横向连接Series可以组成DataFrame数据类型,需要注意的是DataFrame是有列索引和行索引的,在concat函数里需要使用axis = 1来告诉concat函数横向连接Series,使用keys参数来指定各个列的名字。
import pandas as pd
import numpy as np

ind = [chr(x) for x in range(ord('a'), ord('g') + 1)]
val = range(len(ind))
#print ind
ss = pd.Series(val, index = ind)
#print ss
ind_random = np.random.permutation(ind)
s0 = pd.Series(val, index = ind_random).sample(6)
print "-" * 40
print s0, "# s0"
ind_random = np.random.permutation(ind)
s1 = pd.Series(val, index = ind_random).sample(4)
print "-" * 40
print s1, "# s1"
ind_random = np.random.permutation(ind)
s2 = pd.Series(val, index = ind_random).sample(5)
print "-" * 40
print s2, "# s2"
print "-" * 40

sc = pd.concat([s0, s1, s2])
print sc, "# concat axis = 0"
print "-" * 40

df = pd.concat([s0, s1, s2], axis = 1, keys = ["col_a", "col_b", "col_c"])
print df, "# concat axis = 1"
print "-" * 40

执行结果:

----------------------------------------
d    4
f    3
e    5
g    0
c    6
b    2
dtype: int64 # s0
----------------------------------------
a    1
g    2
c    6
b    3
dtype: int64 # s1
----------------------------------------
d    1
f    2
b    0
g    6
c    5
dtype: int64 # s2
----------------------------------------
d    4
f    3
e    5
g    0
c    6
b    2
a    1
g    2
c    6
b    3
d    1
f    2
b    0
g    6
c    5
dtype: int64 # concat axis = 0
----------------------------------------
   col_a  col_b  col_c
a    NaN    1.0    NaN
b    2.0    3.0    0.0
c    6.0    6.0    5.0
d    4.0    NaN    1.0
e    5.0    NaN    NaN
f    3.0    NaN    2.0
g    0.0    2.0    6.0 # concat axis = 1
----------------------------------------

从执行结果可以看出用Series使用concat横向连接可以组成(创建)DataFrame数据类型,结果有点儿像数据库里的outer连接方式,列会出现NaN数据,很自然想到有inner内连接么?

  • concat函数通过join形参给出连接的方式,默认是外连接outer。
import pandas as pd
import numpy as np

ind = [chr(x) for x in range(ord('a'), ord('g') + 1)]
val = range(len(ind))
#print ind
ss = pd.Series(val, index = ind)
#print ss
ind_random = np.random.permutation(ind)
s0 = pd.Series(val, index = ind_random).sample(6)
print "-" * 40
print s0, "# s0"
ind_random = np.random.permutation(ind)
s1 = pd.Series(val, index = ind_random).sample(4)
print "-" * 40
print s1, "# s1"
ind_random = np.random.permutation(ind)
s2 = pd.Series(val, index = ind_random).sample(5)
print "-" * 40
print s2, "# s2"
print "-" * 40

sc = pd.concat([s0, s1, s2])
print sc, "# concat axis = 0"
print "-" * 40

df = pd.concat([s0, s1, s2], axis = 1, keys = ["col_a", "col_b", "col_c"])
print df, "# concat axis = 1"
print "-" * 40

df = pd.concat([s0, s1, s2], axis = 1, keys = ["col_a", "col_b", "col_c"], join = "inner")
print df, "# concat axis = 1, inner join"
print "-" * 40

执行结果:

----------------------------------------
g    6
d    5
c    0
b    3
a    1
e    4
dtype: int64 # s0
----------------------------------------
c    3
d    2
g    6
e    4
dtype: int64 # s1
----------------------------------------
a    1
b    6
d    5
g    2
e    3
dtype: int64 # s2
----------------------------------------
g    6
d    5
c    0
b    3
a    1
e    4
c    3
d    2
g    6
e    4
a    1
b    6
d    5
g    2
e    3
dtype: int64 # concat axis = 0
----------------------------------------
    col_a  col_b  col_c
a      1    NaN    1.0
b      3    NaN    6.0
c      0    3.0    NaN
d      5    2.0    5.0
e      4    4.0    3.0
g      6    6.0    2.0 # concat axis = 1
----------------------------------------
   col_a  col_b  col_c
g      6      6      2
d      5      2      5
e      4      4      3 # concat axis = 1, inner join
----------------------------------------

18.5.2 concat函数拼接DataFrame

同样concat函数也可以作用于dataframe数据类型。如果不指定concat函数的axis,axis默认为0,即竖向连接得到更长的dataframe,以outer方式连接,即没有的列上数据填充NaN。

import pandas as pd
import numpy as np
val1 = np.arange(24).reshape((6,4))
col1 = list("abcd")
ind1 = list("opqrst")
df1 = pd.DataFrame(val1, columns = col1, index = ind1)
val2 = np.arange(25).reshape((5,5))
col2 = list("bxced")
ind2 = list("ijkst")
df2 = pd.DataFrame(val2, columns = col2, index = ind2)
print "-" * 40
print df1, "# df1"
print "-" * 40
print df2, "# df2"
print "-" * 40
print pd.concat([df1, df2]), "# concat axis = 0, outer"
print "-" * 40

执行结果:

----------------------------------------
    a   b   c   d
o   0   1   2   3
p   4   5   6   7
q   8   9  10  11
r  12  13  14  15
s  16  17  18  19
t  20  21  22  23 # df1
----------------------------------------
    b   x   c   e   d
i   0   1   2   3   4
j   5   6   7   8   9
k  10  11  12  13  14
s  15  16  17  18  19
t  20  21  22  23  24 # df2
----------------------------------------
      a   b   c   d     e     x
o   0.0   1   2   3   NaN   NaN
p   4.0   5   6   7   NaN   NaN
q   8.0   9  10  11   NaN   NaN
r  12.0  13  14  15   NaN   NaN
s  16.0  17  18  19   NaN   NaN
t  20.0  21  22  23   NaN   NaN
i   NaN   0   2   4   3.0   1.0
j   NaN   5   7   9   8.0   6.0
k   NaN  10  12  14  13.0  11.0
s   NaN  15  17  19  18.0  16.0
t   NaN  20  22  24  23.0  21.0 # concat axis = 0, outer
----------------------------------------

可以修改这些参数,例如横向:

import pandas as pd
import numpy as np
val1 = np.arange(24).reshape((6,4))
col1 = list("abcd")
ind1 = list("opqrst")
df1 = pd.DataFrame(val1, columns = col1, index = ind1)
val2 = np.arange(25).reshape((5,5))
col2 = list("bxced")
ind2 = list("ijkst")
df2 = pd.DataFrame(val2, columns = col2, index = ind2)
print "-" * 40
print df1, "# df1"
print "-" * 40
print df2, "# df2"
print "-" * 40
print pd.concat([df1, df2]), "# concat axis = 0, outer"
print "-" * 40
print pd.concat([df1, df2], axis = 1), "# concat axis = 1, outer"
print "-" * 40
print pd.concat([df1, df2], axis = 1, join = "inner"), "# concat axis = 1, inner"
print "-" * 40

执行结果:

----------------------------------------
    a   b   c   d
o   0   1   2   3
p   4   5   6   7
q   8   9  10  11
r  12  13  14  15
s  16  17  18  19
t  20  21  22  23 # df1
----------------------------------------
    b   x   c   e   d
i   0   1   2   3   4
j   5   6   7   8   9
k  10  11  12  13  14
s  15  16  17  18  19
t  20  21  22  23  24 # df2
----------------------------------------
      a   b   c   d     e     x
o   0.0   1   2   3   NaN   NaN
p   4.0   5   6   7   NaN   NaN
q   8.0   9  10  11   NaN   NaN
r  12.0  13  14  15   NaN   NaN
s  16.0  17  18  19   NaN   NaN
t  20.0  21  22  23   NaN   NaN
i   NaN   0   2   4   3.0   1.0
j   NaN   5   7   9   8.0   6.0
k   NaN  10  12  14  13.0  11.0
s   NaN  15  17  19  18.0  16.0
t   NaN  20  22  24  23.0  21.0 # concat axis = 0, outer
----------------------------------------
      a     b     c     d     b     x     c     e     d
i   NaN   NaN   NaN   NaN   0.0   1.0   2.0   3.0   4.0
j   NaN   NaN   NaN   NaN   5.0   6.0   7.0   8.0   9.0
k   NaN   NaN   NaN   NaN  10.0  11.0  12.0  13.0  14.0
o   0.0   1.0   2.0   3.0   NaN   NaN   NaN   NaN   NaN
p   4.0   5.0   6.0   7.0   NaN   NaN   NaN   NaN   NaN
q   8.0   9.0  10.0  11.0   NaN   NaN   NaN   NaN   NaN
r  12.0  13.0  14.0  15.0   NaN   NaN   NaN   NaN   NaN
s  16.0  17.0  18.0  19.0  15.0  16.0  17.0  18.0  19.0
t  20.0  21.0  22.0  23.0  20.0  21.0  22.0  23.0  24.0 # concat axis = 1, outer
----------------------------------------
    a   b   c   d   b   x   c   e   d
s  16  17  18  19  15  16  17  18  19
t  20  21  22  23  20  21  22  23  24 # concat axis = 1, inner
----------------------------------------

对于两个dataframe均有的列,均保留,合并后原来的dataframe行数没数据填充NaN。

18.6 合并重叠数据

本节第一个例子介绍了numpy的where函数,这个之前已经讲解过了,可以查看Numpy的where函数,本节介绍的另一个函数是pandas的combine_frist函数,比较有用:

a.combine_first(b)

作用是用b的数据填补a的缺失值。

import pandas as pd
import numpy as np
df1 = pd.DataFrame({'a': [1., np.nan, 5., np.nan],
                    'b': [np.nan, 2., np.nan, 6.],
                    'c': range(2, 18, 4)})
df2 = pd.DataFrame({'a': [5., 4., np.nan, 3., 7.],
                    'b': [np.nan, 3., 4., 6., 8.]})
print "-" * 40
print df1, "# df1"
print "-" * 40
print df2, "# df2"
print "-" * 40
print df1.combine_first(df2), "# df1.combine_first(df2)"
print "-" * 40
print df2.combine_first(df1), "# df2.combine_first(df1)"
print "-" * 40

执行结果:

----------------------------------------
     a    b   c
0  1.0  NaN   2
1  NaN  2.0   6
2  5.0  NaN  10
3  NaN  6.0  14 # df1
----------------------------------------
     a    b
0  5.0  NaN
1  4.0  3.0
2  NaN  4.0
3  3.0  6.0
4  7.0  8.0 # df2
----------------------------------------
     a    b     c
0  1.0  NaN   2.0
1  4.0  2.0   6.0
2  5.0  4.0  10.0
3  3.0  6.0  14.0
4  7.0  8.0   NaN # df1.combine_first(df2)
----------------------------------------
     a    b     c
0  5.0  NaN   2.0
1  4.0  3.0   6.0
2  5.0  4.0  10.0
3  3.0  6.0  14.0
4  7.0  8.0   NaN # df2.combine_first(df1)
----------------------------------------

语句df1.combine_first(df2)的执行结果:

----------------------------------------
     a    b     c
0  1.0  NaN   2.0
1  4.0  2.0   6.0
2  5.0  4.0  10.0
3  3.0  6.0  14.0
4  7.0  8.0   NaN # df1.combine_first(df2)
----------------------------------------

df1的‘a’列1行是NaN,df2的‘a’列1行是4,combine_first函数用df2的‘a’列1行是数据4填充df1的‘a’列1行,所以结果里的‘a’列1行是4。