title: 13 Pandas怎样实现DataFrame的Merge
tags: 数据分析,pandas,小书匠
grammar_cjkRuby: true
[toc]
13 Pandas怎样实现DataFrame的Merge
Pandas的Merge,相当于Sql的Join,将不同的表按key关联到一个表
merge的语法:
pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=True, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)
- left,right:要merge的dataframe或者有name的Series
- how:join类型,‘left’, ‘right’, ‘outer’, ‘inner’
- on:join的key,left和right都需要有这个key
- left_on:left的df或者series的key
- right_on:right的df或者seires的key
- left_index,right_index:使用index而不是普通的column做join
- suffixes:两个元素的后缀,如果列有重名,自动添加后缀,默认是(’_x’, ‘_y’)
文档地址:https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html
本次讲解提纲:
- 电影数据集的join实例
- 理解merge时一对一、一对多、多对多的数量对齐关系
- 理解left join、right join、inner join、outer join的区别
- 如果出现非Key的字段重名怎么办
1、电影数据集的join实例
电影评分数据集
是推荐系统研究的很好的数据集
位于本代码目录:./datas/movielens-1m
包含三个文件:
- 用户对电影的评分数据 ratings.dat
- 用户本身的信息数据 users.dat
- 电影本身的数据 movies.dat
可以关联三个表,得到一个完整的大表
数据集官方地址:https://grouplens.org/datasets/movielens/
import pandas as pd
df_ratings = pd.read_csv(
"./datas/movielens-1m/ratings.dat",
sep="::",
engine='python',
names="UserID::MovieID::Rating::Timestamp".split("::")
)
df_ratings.head()
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
<pre><code>.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</code></pre>
|
UserID |
MovieID |
Rating |
Timestamp |
0 |
1 |
1193 |
5 |
978300760 |
1 |
1 |
661 |
3 |
978302109 |
2 |
1 |
914 |
3 |
978301968 |
3 |
1 |
3408 |
4 |
978300275 |
4 |
1 |
2355 |
5 |
978824291 |
df_users = pd.read_csv(
"./datas/movielens-1m/users.dat",
sep="::",
engine='python',
names="UserID::Gender::Age::Occupation::Zip-code".split("::")
)
df_users.head()
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
<pre><code>.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</code></pre>
|
UserID |
Gender |
Age |
Occupation |
Zip-code |
0 |
1 |
F |
1 |
10 |
48067 |
1 |
2 |
M |
56 |
16 |
70072 |
2 |
3 |
M |
25 |
15 |
55117 |
3 |
4 |
M |
45 |
7 |
02460 |
4 |
5 |
M |
25 |
20 |
55455 |
df_movies = pd.read_csv(
"./datas/movielens-1m/movies.dat",
sep="::",
engine='python',
names="MovieID::Title::Genres".split("::")
)
df_movies.head()
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
<pre><code>.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</code></pre>
|
MovieID |
Title |
Genres |
0 |
1 |
Toy Story (1995) |
Animation|Children's|Comedy |
1 |
2 |
Jumanji (1995) |
Adventure|Children's|Fantasy |
2 |
3 |
Grumpier Old Men (1995) |
Comedy|Romance |
3 |
4 |
Waiting to Exhale (1995) |
Comedy|Drama |
4 |
5 |
Father of the Bride Part II (1995) |
Comedy |
df_ratings_users = pd.merge(
df_ratings, df_users, left_on="UserID", right_on="UserID", how="inner"
)
df_ratings_users.head()
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
<pre><code>.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</code></pre>
|
UserID |
MovieID |
Rating |
Timestamp |
Gender |
Age |
Occupation |
Zip-code |
0 |
1 |
1193 |
5 |
978300760 |
F |
1 |
10 |
48067 |
1 |
1 |
661 |
3 |
978302109 |
F |
1 |
10 |
48067 |
2 |
1 |
914 |
3 |
978301968 |
F |
1 |
10 |
48067 |
3 |
1 |
3408 |
4 |
978300275 |
F |
1 |
10 |
48067 |
4 |
1 |
2355 |
5 |
978824291 |
F |
1 |
10 |
48067 |
df_ratings_users_movies = pd.merge(
df_ratings_users, df_movies, left_on="MovieID", right_on="MovieID", how="inner"
)
df_ratings_users_movies.head(10)
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
<pre><code>.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</code></pre>
|
UserID |
MovieID |
Rating |
Timestamp |
Gender |
Age |
Occupation |
Zip-code |
Title |
Genres |
0 |
1 |
1193 |
5 |
978300760 |
F |
1 |
10 |
48067 |
One Flew Over the Cuckoo's Nest (1975) |
Drama |
1 |
2 |
1193 |
5 |
978298413 |
M |
56 |
16 |
70072 |
One Flew Over the Cuckoo's Nest (1975) |
Drama |
2 |
12 |
1193 |
4 |
978220179 |
M |
25 |
12 |
32793 |
One Flew Over the Cuckoo's Nest (1975) |
Drama |
3 |
15 |
1193 |
4 |
978199279 |
M |
25 |
7 |
22903 |
One Flew Over the Cuckoo's Nest (1975) |
Drama |
4 |
17 |
1193 |
5 |
978158471 |
M |
50 |
1 |
95350 |
One Flew Over the Cuckoo's Nest (1975) |
Drama |
5 |
18 |
1193 |
4 |
978156168 |
F |
18 |
3 |
95825 |
One Flew Over the Cuckoo's Nest (1975) |
Drama |
6 |
19 |
1193 |
5 |
982730936 |
M |
1 |
10 |
48073 |
One Flew Over the Cuckoo's Nest (1975) |
Drama |
7 |
24 |
1193 |
5 |
978136709 |
F |
25 |
7 |
10023 |
One Flew Over the Cuckoo's Nest (1975) |
Drama |
8 |
28 |
1193 |
3 |
978125194 |
F |
25 |
1 |
14607 |
One Flew Over the Cuckoo's Nest (1975) |
Drama |
9 |
33 |
1193 |
5 |
978557765 |
M |
45 |
3 |
55421 |
One Flew Over the Cuckoo's Nest (1975) |
Drama |
2、理解merge时数量的对齐关系
以下关系要正确理解:
- one-to-one:一对一关系,关联的key都是唯一的
- 比如(学号,姓名) merge (学号,年龄)
- 结果条数为:1*1
- one-to-many:一对多关系,左边唯一key,右边不唯一key
- 比如(学号,姓名) merge (学号,[语文成绩、数学成绩、英语成绩])
- 结果条数为:1*N
- many-to-many:多对多关系,左边右边都不是唯一的
- 比如(学号,[语文成绩、数学成绩、英语成绩]) merge (学号,[篮球、足球、乒乓球])
- 结果条数为:M*N
2.1 one-to-one 一对一关系的merge
left = pd.DataFrame({'sno': [11, 12, 13, 14],
'name': ['name_a', 'name_b', 'name_c', 'name_d']
})
left
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
<pre><code>.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</code></pre>
|
sno |
name |
0 |
11 |
name_a |
1 |
12 |
name_b |
2 |
13 |
name_c |
3 |
14 |
name_d |
right = pd.DataFrame({'sno': [11, 12, 13, 14],
'age': ['21', '22', '23', '24']
})
right
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
<pre><code>.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</code></pre>
|
sno |
age |
0 |
11 |
21 |
1 |
12 |
22 |
2 |
13 |
23 |
3 |
14 |
24 |
pd.merge(left, right, on='sno')
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
<pre><code>.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</code></pre>
|
sno |
name |
age |
0 |
11 |
name_a |
21 |
1 |
12 |
name_b |
22 |
2 |
13 |
name_c |
23 |
3 |
14 |
name_d |
24 |
2.2 one-to-many 一对多关系的merge
注意:数据会被复制
left = pd.DataFrame({'sno': [11, 12, 13, 14],
'name': ['name_a', 'name_b', 'name_c', 'name_d']
})
left
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
<pre><code>.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</code></pre>
|
sno |
name |
0 |
11 |
name_a |
1 |
12 |
name_b |
2 |
13 |
name_c |
3 |
14 |
name_d |
right = pd.DataFrame({'sno': [11, 11, 11, 12, 12, 13],
'grade': ['语文88', '数学90', '英语75','语文66', '数学55', '英语29']
})
right
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
<pre><code>.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</code></pre>
|
sno |
grade |
0 |
11 |
语文88 |
1 |
11 |
数学90 |
2 |
11 |
英语75 |
3 |
12 |
语文66 |
4 |
12 |
数学55 |
5 |
13 |
英语29 |
pd.merge(left, right, on='sno')
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
<pre><code>.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</code></pre>
|
sno |
name |
grade |
0 |
11 |
name_a |
语文88 |
1 |
11 |
name_a |
数学90 |
2 |
11 |
name_a |
英语75 |
3 |
12 |
name_b |
语文66 |
4 |
12 |
name_b |
数学55 |
5 |
13 |
name_c |
英语29 |
2.3 many-to-many 多对多关系的merge
注意:结果数量会出现乘法
left = pd.DataFrame({'sno': [11, 11, 12, 12,12],
'爱好': ['篮球', '羽毛球', '乒乓球', '篮球', "足球"]
})
left
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
<pre><code>.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</code></pre>
|
sno |
爱好 |
0 |
11 |
篮球 |
1 |
11 |
羽毛球 |
2 |
12 |
乒乓球 |
3 |
12 |
篮球 |
4 |
12 |
足球 |
right = pd.DataFrame({'sno': [11, 11, 11, 12, 12, 13],
'grade': ['语文88', '数学90', '英语75','语文66', '数学55', '英语29']
})
right
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
<pre><code>.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</code></pre>
|
sno |
grade |
0 |
11 |
语文88 |
1 |
11 |
数学90 |
2 |
11 |
英语75 |
3 |
12 |
语文66 |
4 |
12 |
数学55 |
5 |
13 |
英语29 |
pd.merge(left, right, on='sno')
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
<pre><code>.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</code></pre>
|
sno |
爱好 |
grade |
0 |
11 |
篮球 |
语文88 |
1 |
11 |
篮球 |
数学90 |
2 |
11 |
篮球 |
英语75 |
3 |
11 |
羽毛球 |
语文88 |
4 |
11 |
羽毛球 |
数学90 |
5 |
11 |
羽毛球 |
英语75 |
6 |
12 |
乒乓球 |
语文66 |
7 |
12 |
乒乓球 |
数学55 |
8 |
12 |
篮球 |
语文66 |
9 |
12 |
篮球 |
数学55 |
10 |
12 |
足球 |
语文66 |
11 |
12 |
足球 |
数学55 |
3、理解left join、right join、inner join、outer join的区别
left = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']})
right = pd.DataFrame({'key': ['K0', 'K1', 'K4', 'K5'],
'C': ['C0', 'C1', 'C4', 'C5'],
'D': ['D0', 'D1', 'D4', 'D5']})
left
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
<pre><code>.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</code></pre>
|
key |
A |
B |
0 |
K0 |
A0 |
B0 |
1 |
K1 |
A1 |
B1 |
2 |
K2 |
A2 |
B2 |
3 |
K3 |
A3 |
B3 |
right
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
<pre><code>.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</code></pre>
|
key |
C |
D |
0 |
K0 |
C0 |
D0 |
1 |
K1 |
C1 |
D1 |
2 |
K4 |
C4 |
D4 |
3 |
K5 |
C5 |
D5 |
3.1 inner join,默认
左边和右边的key都有,才会出现在结果里
pd.merge(left, right, how='inner')
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
<pre><code>.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</code></pre>
|
key |
A |
B |
C |
D |
0 |
K0 |
A0 |
B0 |
C0 |
D0 |
1 |
K1 |
A1 |
B1 |
C1 |
D1 |
3.2 left join
左边的都会出现在结果里,右边的如果无法匹配则为Null
pd.merge(left, right, how='left')
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
<pre><code>.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</code></pre>
|
key |
A |
B |
C |
D |
0 |
K0 |
A0 |
B0 |
C0 |
D0 |
1 |
K1 |
A1 |
B1 |
C1 |
D1 |
2 |
K2 |
A2 |
B2 |
NaN |
NaN |
3 |
K3 |
A3 |
B3 |
NaN |
NaN |
3.3 right join
右边的都会出现在结果里,左边的如果无法匹配则为Null
pd.merge(left, right, how='right')
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
<pre><code>.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</code></pre>
|
key |
A |
B |
C |
D |
0 |
K0 |
A0 |
B0 |
C0 |
D0 |
1 |
K1 |
A1 |
B1 |
C1 |
D1 |
2 |
K4 |
NaN |
NaN |
C4 |
D4 |
3 |
K5 |
NaN |
NaN |
C5 |
D5 |
3.4 outer join
左边、右边的都会出现在结果里,如果无法匹配则为Null
pd.merge(left, right, how='outer')
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
<pre><code>.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</code></pre>
|
key |
A |
B |
C |
D |
0 |
K0 |
A0 |
B0 |
C0 |
D0 |
1 |
K1 |
A1 |
B1 |
C1 |
D1 |
2 |
K2 |
A2 |
B2 |
NaN |
NaN |
3 |
K3 |
A3 |
B3 |
NaN |
NaN |
4 |
K4 |
NaN |
NaN |
C4 |
D4 |
5 |
K5 |
NaN |
NaN |
C5 |
D5 |
4、如果出现非Key的字段重名怎么办
left = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']})
right = pd.DataFrame({'key': ['K0', 'K1', 'K4', 'K5'],
'A': ['A10', 'A11', 'A12', 'A13'],
'D': ['D0', 'D1', 'D4', 'D5']})
left
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
<pre><code>.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</code></pre>
|
key |
A |
B |
0 |
K0 |
A0 |
B0 |
1 |
K1 |
A1 |
B1 |
2 |
K2 |
A2 |
B2 |
3 |
K3 |
A3 |
B3 |
right
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
<pre><code>.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</code></pre>
|
key |
A |
D |
0 |
K0 |
A10 |
D0 |
1 |
K1 |
A11 |
D1 |
2 |
K4 |
A12 |
D4 |
3 |
K5 |
A13 |
D5 |
pd.merge(left, right, on='key')
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
<pre><code>.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</code></pre>
|
key |
A_x |
B |
A_y |
D |
0 |
K0 |
A0 |
B0 |
A10 |
D0 |
1 |
K1 |
A1 |
B1 |
A11 |
D1 |
pd.merge(left, right, on='key', suffixes=('_left', '_right'))
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
<pre><code>.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</code></pre>
|
key |
A_left |
B |
A_right |
D |
0 |
K0 |
A0 |
B0 |
A10 |
D0 |
1 |
K1 |
A1 |
B1 |
A11 |
D1 |
本文使用 文章同步助手 同步
网友评论