27. Pandas的数据拼接-merge函数

concat函数可以实现内外连接,而pandas的merge函数可以真正实现数据库的内外连接,且外连接还可以有左右连接的特性。

  • merge函数默认拼接数据是inner join即内连接。下面以学生选课为例,设计两个DataFrame通过merge函数来拼接合并。
import pandas as pd
import numpy as np
col1 = "class_name class_id class_lecturer".split()
col2 = "class_id stu_id".split()
val1 = [["IT", 100, "Wangli"],["CS", 101, "WangMa"],["CAD", 102, "Liping"]]
val2 = [[100, 20181115],[100, 20181116],[101, 20181117]]
course = pd.DataFrame(val1, columns = col1)
print "***course", "*" * 38
print course
choose = pd.DataFrame(val2, columns = col2)
print "***choose", "*" * 38
print choose
print "***course merge choose", "*" * 25
print course.merge(choose)
print "***choose merge course", "*" * 25
print choose.merge(course)

程序的执行结果:

***course **************************************
  class_name  class_id class_lecturer
0         IT       100         Wangli
1         CS       101         WangMa
2        CAD       102         Liping
***choose **************************************
   class_id    stu_id
0       100  20181115
1       100  20181116
2       101  20181117
***course merge choose *************************
  class_name  class_id class_lecturer    stu_id
0         IT       100         Wangli  20181115
1         IT       100         Wangli  20181116
2         CS       101         WangMa  20181117
***choose merge course *************************
   class_id    stu_id class_name class_lecturer
0       100  20181115         IT         Wangli
1       100  20181116         IT         Wangli
2       101  20181117         CS         WangMa
  • merge的outer连接方式。结果是两个DataFrame均输出,未匹配上的用NaN填充。
import pandas as pd
import numpy as np
col1 = "class_name class_id class_lecturer".split()
col2 = "class_id stu_id".split()
val1 = [["IT", 100, "Wangli"],["CS", 101, "WangMa"],["CAD", 102, "Liping"], ["ME", 103, "Wufang"],["IT", 104, "Xiaomin"]]
val2 = [[100, 20181115],[100, 20181116],[101, 20181117]]
course = pd.DataFrame(val1, columns = col1)
print "***course", "*" * 38
print course
choose = pd.DataFrame(val2, columns = col2)
print "***choose", "*" * 38
print choose
print "***course merge choose in inner", "*" * 25
print course.merge(choose, how = "inner")
print "***course merge choose in outer", "*" * 25
print course.merge(choose, how = "outer")
print "***choose merge course in inner", "*" * 25
print choose.merge(course, how = "inner")
print "***choose merge course in outer", "*" * 25
print choose.merge(course, how = "outer")

程序执行结果:

***course **************************************
  class_name  class_id class_lecturer
0         IT       100         Wangli
1         CS       101         WangMa
2        CAD       102         Liping
3         ME       103         Wufang
4         IT       104        Xiaomin
***choose **************************************
   class_id    stu_id
0       100  20181115
1       100  20181116
2       101  20181117
***course merge choose in inner *************************
  class_name  class_id class_lecturer    stu_id
0         IT       100         Wangli  20181115
1         IT       100         Wangli  20181116
2         CS       101         WangMa  20181117
***course merge choose in outer *************************
  class_name  class_id class_lecturer    stu_id
0         IT       100         Wangli  20181115
1         IT       100         Wangli  20181116
2         CS       101         WangMa  20181117
3        CAD       102         Liping       NaN
4         ME       103         Wufang       NaN
5         IT       104        Xiaomin       NaN
***choose merge course in inner *************************
   class_id    stu_id class_name class_lecturer
0       100  20181115         IT         Wangli
1       100  20181116         IT         Wangli
2       101  20181117         CS         WangMa
***choose merge course in outer *************************
   class_id    stu_id class_name class_lecturer
0       100  20181115         IT         Wangli
1       100  20181116         IT         Wangli
2       101  20181117         CS         WangMa
3       102       NaN        CAD         Liping
4       103       NaN         ME         Wufang
5       104       NaN         IT        Xiaomin
  • merge的左右连接,这里调用merge的Dataframe是“左表”而连接即作为形参的是DataFrame是“右表”。左连接左表全输出而右表能匹配的输出,匹配不上的填充NaN,同理右连接时“右表”全输出,而左表匹配上输出,匹配不上填充NaN。
import pandas as pd
import numpy as np
col1 = "class_name class_id class_lecturer".split()
col2 = "class_id stu_id".split()
val1 = [["IT", 100, "Wangli"],["CS", 101, "WangMa"],["CAD", 102, "Liping"], ["ME", 103, "Wufang"],["IT", 104, "Xiaomin"]]
val2 = [[100, 20181115],[100, 20181116],[101, 20181117],[100, 20181118],[101, 20181119], [200, 20181120]]
course = pd.DataFrame(val1, columns = col1)
print "***course", "*" * 38
print course
choose = pd.DataFrame(val2, columns = col2)
print "***choose", "*" * 38
print choose
print "***course merge choose in left", "*" * 25
print course.merge(choose, how = "left")
print "***course merge choose in right", "*" * 25
print course.merge(choose, how = "right")
print "***choose merge course in left", "*" * 25
print choose.merge(course, how = "left")
print "***choose merge course in right", "*" * 25
print choose.merge(course, how = "right")

程序执行结果:

***course **************************************
  class_name  class_id class_lecturer
0         IT       100         Wangli
1         CS       101         WangMa
2        CAD       102         Liping
3         ME       103         Wufang
4         IT       104        Xiaomin
***choose **************************************
   class_id    stu_id
0       100  20181115
1       100  20181116
2       101  20181117
3       100  20181118
4       101  20181119
5       200  20181120
***course merge choose in left *************************
  class_name  class_id class_lecturer    stu_id
0         IT       100         Wangli  20181115
1         IT       100         Wangli  20181116
2         IT       100         Wangli  20181118
3         CS       101         WangMa  20181117
4         CS       101         WangMa  20181119
5        CAD       102         Liping       NaN
6         ME       103         Wufang       NaN
7         IT       104        Xiaomin       NaN
***course merge choose in right *************************
  class_name  class_id class_lecturer    stu_id
0         IT       100         Wangli  20181115
1         IT       100         Wangli  20181116
2         IT       100         Wangli  20181118
3         CS       101         WangMa  20181117
4         CS       101         WangMa  20181119
5        NaN       200            NaN  20181120
***choose merge course in left *************************
   class_id    stu_id class_name class_lecturer
0       100  20181115         IT         Wangli
1       100  20181116         IT         Wangli
2       101  20181117         CS         WangMa
3       100  20181118         IT         Wangli
4       101  20181119         CS         WangMa
5       200  20181120        NaN            NaN
***choose merge course in right *************************
   class_id    stu_id class_name class_lecturer
0       100  20181115         IT         Wangli
1       100  20181116         IT         Wangli
2       100  20181118         IT         Wangli
3       101  20181117         CS         WangMa
4       101  20181119         CS         WangMa
5       102       NaN        CAD         Liping
6       103       NaN         ME         Wufang
7       104       NaN         IT        Xiaomin

请注意[200, 20181120]这条选课数据,课程id为200在course里并不存在。而["CAD", 102, "Liping"], ["ME", 103, "Wufang"],["IT", 104, "Xiaomin"]这三门课没有学生选。 由此可见,merge函数的left join、right join和数据库的表的left join、right join的概念完全匹配。