Pandas
具有全功能、高性能的内存中连接操作,在惯用上与 SQL 等关系数据库非常相似。
Pandas
提供了一个单独的 merge()
函数,作为DataFrame对象之间所有标准数据库连接操作的入口。
pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None,
left_index=False, right_index=False, sort=True)
在这里,可以使用以下几个参数:
left
- 一个DataFrame对象。right
- 另一个DataFrame对象。on
- 列(名称)连接,必须在左右DataFrame对象中存在(找到)。left_on
- 左侧DataFrame中用作键的列。可以是列名或长度等于DataFrame长度的数组。right_on
- 来自右的DataFrame中用作键的列。可以是列名或长度等于DataFrame长度的数组。left_index
- 如果为True
,则使用左侧DataFrame中的索引(行标签)作为其连接键。 在具有MultiIndex(分层)的DataFrame的情况下,级别的数量必须与来自右DataFrame的连接键的数量相匹配。right_index
- 与右侧 DataFrame的left_index
具有相同的用法。how
- 它是left
,right
,outer
以及inner
之中的一个, 默认为内inner
。下面描述了每种方法。sort
- 按照字典顺序通过连接键对结果DataFrame进行排序。 默认为True
,设置为False
时,在很多情况下大大提高性能。
合并操作
现在创建两个不同的DataFrame并对其执行合并操作。
import pandas as pd
left = pd.DataFrame({
'id':[1,2,3,4,5],
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id':['sub1','sub2','sub4','sub6','sub5']})
right = pd.DataFrame(
{'id':[1,2,3,4,5],
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id':['sub2','sub4','sub3','sub6','sub5']})
print (left)
id Name subject_id 0 1 Alex sub1 1 2 Amy sub2 2 3 Allen sub4 3 4 Alice sub6 4 5 Ayoung sub5
print("========================================")
========================================
print (right)
id Name subject_id 0 1 Billy sub2 1 2 Brian sub4 2 3 Bran sub3 3 4 Bryce sub6 4 5 Betty sub5
在一个键上合并两个数据帧
import pandas as pd
left = pd.DataFrame({
'id':[1,2,3,4,5],
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id':['sub1','sub2','sub4','sub6','sub5']})
right = pd.DataFrame(
{'id':[1,2,3,4,5],
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id':['sub2','sub4','sub3','sub6','sub5']})
rs = pd.merge(left,right,on='id')
print(rs)
id Name_x subject_id_x Name_y subject_id_y 0 1 Alex sub1 Billy sub2 1 2 Amy sub2 Brian sub4 2 3 Allen sub4 Bran sub3 3 4 Alice sub6 Bryce sub6 4 5 Ayoung sub5 Betty sub5
合并多个键上的两个数据帧
import pandas as pd
left = pd.DataFrame({
'id':[1,2,3,4,5],
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id':['sub1','sub2','sub4','sub6','sub5']})
right = pd.DataFrame(
{'id':[1,2,3,4,5],
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id':['sub2','sub4','sub3','sub6','sub5']})
rs = pd.merge(left,right,on=['id','subject_id'])
print(rs)
id Name_x subject_id Name_y 0 4 Alice sub6 Bryce 1 5 Ayoung sub5 Betty
合并方法 | SQL等效 | 描述 |
---|---|---|
left |
LEFT OUTER JOIN |
使用左侧对象的键 |
right |
RIGHT OUTER JOIN |
使用右侧对象的键 |
outer |
FULL OUTER JOIN |
使用键的联合 |
inner |
INNER JOIN |
使用键的交集 |
import pandas as pd
left = pd.DataFrame({
'id':[1,2,3,4,5],
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id':['sub1','sub2','sub4','sub6','sub5']})
right = pd.DataFrame(
{'id':[1,2,3,4,5],
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id':['sub2','sub4','sub3','sub6','sub5']})
rs = pd.merge(left, right, on='subject_id', how='left')
print (rs)
id_x Name_x subject_id id_y Name_y 0 1 Alex sub1 NaN NaN 1 2 Amy sub2 1.0 Billy 2 3 Allen sub4 2.0 Brian 3 4 Alice sub6 4.0 Bryce 4 5 Ayoung sub5 5.0 Betty
import pandas as pd
left = pd.DataFrame({
'id':[1,2,3,4,5],
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id':['sub1','sub2','sub4','sub6','sub5']})
right = pd.DataFrame(
{'id':[1,2,3,4,5],
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id':['sub2','sub4','sub3','sub6','sub5']})
rs = pd.merge(left, right, on='subject_id', how='right')
print (rs)
id_x Name_x subject_id id_y Name_y 0 2.0 Amy sub2 1 Billy 1 3.0 Allen sub4 2 Brian 2 NaN NaN sub3 3 Bran 3 4.0 Alice sub6 4 Bryce 4 5.0 Ayoung sub5 5 Betty
import pandas as pd
left = pd.DataFrame({
'id':[1,2,3,4,5],
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id':['sub1','sub2','sub4','sub6','sub5']})
right = pd.DataFrame(
{'id':[1,2,3,4,5],
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id':['sub2','sub4','sub3','sub6','sub5']})
rs = pd.merge(left, right, how='outer', on='subject_id')
print (rs)
id_x Name_x subject_id id_y Name_y 0 1.0 Alex sub1 NaN NaN 1 2.0 Amy sub2 1.0 Billy 2 NaN NaN sub3 3.0 Bran 3 3.0 Allen sub4 2.0 Brian 4 5.0 Ayoung sub5 5.0 Betty 5 4.0 Alice sub6 4.0 Bryce
import pandas as pd
left = pd.DataFrame({
'id':[1,2,3,4,5],
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id':['sub1','sub2','sub4','sub6','sub5']})
right = pd.DataFrame(
{'id':[1,2,3,4,5],
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id':['sub2','sub4','sub3','sub6','sub5']})
rs = pd.merge(left, right, on='subject_id', how='inner')
print (rs)
id_x Name_x subject_id id_y Name_y 0 2 Amy sub2 1 Billy 1 3 Allen sub4 2 Brian 2 4 Alice sub6 4 Bryce 3 5 Ayoung sub5 5 Betty