美文网首页
13 Pandas怎样实现DataFrame的Merge

13 Pandas怎样实现DataFrame的Merge

作者: Viterbi | 来源:发表于2022-11-07 12:03 被阅读0次

    title: 13 Pandas怎样实现DataFrame的Merge
    tags: 数据分析,pandas,小书匠 grammar_cjkRuby: true

    [toc]

    13 Pandas怎样实现DataFrame的Merge

    Pandas的Merge,相当于Sql的Join,将不同的表按key关联到一个表

    merge的语法:

    pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=True, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)

    • left,right:要merge的dataframe或者有name的Series
    • how:join类型,‘left’, ‘right’, ‘outer’, ‘inner’
    • on:join的key,left和right都需要有这个key
    • left_on:left的df或者series的key
    • right_on:right的df或者seires的key
    • left_index,right_index:使用index而不是普通的column做join
    • suffixes:两个元素的后缀,如果列有重名,自动添加后缀,默认是(’_x’, ‘_y’)

    文档地址:https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html

    本次讲解提纲:

    1. 电影数据集的join实例
    2. 理解merge时一对一、一对多、多对多的数量对齐关系
    3. 理解left join、right join、inner join、outer join的区别
    4. 如果出现非Key的字段重名怎么办

    1、电影数据集的join实例

    电影评分数据集

    是推荐系统研究的很好的数据集 位于本代码目录:./datas/movielens-1m

    包含三个文件:

    1. 用户对电影的评分数据 ratings.dat
    2. 用户本身的信息数据 users.dat
    3. 电影本身的数据 movies.dat

    可以关联三个表,得到一个完整的大表

    数据集官方地址:https://grouplens.org/datasets/movielens/

    import pandas as pd
    
    #   用户对电影的评分数据
    df_ratings = pd.read_csv(
        "./datas/movielens-1m/ratings.dat", 
        sep="::",
        engine='python', 
        names="UserID::MovieID::Rating::Timestamp".split("::")
    )
    
    df_ratings.head()
    
    .dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>
    UserID MovieID Rating Timestamp
    0 1 1193 5 978300760
    1 1 661 3 978302109
    2 1 914 3 978301968
    3 1 3408 4 978300275
    4 1 2355 5 978824291
    
    #  用户本身的信息数据
    df_users = pd.read_csv(
        "./datas/movielens-1m/users.dat", 
        sep="::",
        engine='python', 
        names="UserID::Gender::Age::Occupation::Zip-code".split("::")
    )
    
    df_users.head()
    
    .dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>
    UserID Gender Age Occupation Zip-code
    0 1 F 1 10 48067
    1 2 M 56 16 70072
    2 3 M 25 15 55117
    3 4 M 45 7 02460
    4 5 M 25 20 55455
    
    # 电影数据
    df_movies = pd.read_csv(
        "./datas/movielens-1m/movies.dat", 
        sep="::",
        engine='python', 
        names="MovieID::Title::Genres".split("::")
    )
    
    df_movies.head()
    
    
    .dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>
    MovieID Title Genres
    0 1 Toy Story (1995) Animation|Children's|Comedy
    1 2 Jumanji (1995) Adventure|Children's|Fantasy
    2 3 Grumpier Old Men (1995) Comedy|Romance
    3 4 Waiting to Exhale (1995) Comedy|Drama
    4 5 Father of the Bride Part II (1995) Comedy
    # 用户评分关联用户信息
    
    df_ratings_users = pd.merge(
       df_ratings, df_users, left_on="UserID", right_on="UserID", how="inner"
    )
    
    df_ratings_users.head()
    
    .dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>
    UserID MovieID Rating Timestamp Gender Age Occupation Zip-code
    0 1 1193 5 978300760 F 1 10 48067
    1 1 661 3 978302109 F 1 10 48067
    2 1 914 3 978301968 F 1 10 48067
    3 1 3408 4 978300275 F 1 10 48067
    4 1 2355 5 978824291 F 1 10 48067
    
    # 用户评分关联电影
    df_ratings_users_movies = pd.merge(
        df_ratings_users, df_movies, left_on="MovieID", right_on="MovieID", how="inner"
    )
    
    df_ratings_users_movies.head(10)
    
    .dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>
    UserID MovieID Rating Timestamp Gender Age Occupation Zip-code Title Genres
    0 1 1193 5 978300760 F 1 10 48067 One Flew Over the Cuckoo's Nest (1975) Drama
    1 2 1193 5 978298413 M 56 16 70072 One Flew Over the Cuckoo's Nest (1975) Drama
    2 12 1193 4 978220179 M 25 12 32793 One Flew Over the Cuckoo's Nest (1975) Drama
    3 15 1193 4 978199279 M 25 7 22903 One Flew Over the Cuckoo's Nest (1975) Drama
    4 17 1193 5 978158471 M 50 1 95350 One Flew Over the Cuckoo's Nest (1975) Drama
    5 18 1193 4 978156168 F 18 3 95825 One Flew Over the Cuckoo's Nest (1975) Drama
    6 19 1193 5 982730936 M 1 10 48073 One Flew Over the Cuckoo's Nest (1975) Drama
    7 24 1193 5 978136709 F 25 7 10023 One Flew Over the Cuckoo's Nest (1975) Drama
    8 28 1193 3 978125194 F 25 1 14607 One Flew Over the Cuckoo's Nest (1975) Drama
    9 33 1193 5 978557765 M 45 3 55421 One Flew Over the Cuckoo's Nest (1975) Drama

    2、理解merge时数量的对齐关系

    以下关系要正确理解:

    • one-to-one:一对一关系,关联的key都是唯一的
      • 比如(学号,姓名) merge (学号,年龄)
      • 结果条数为:1*1
    • one-to-many:一对多关系,左边唯一key,右边不唯一key
      • 比如(学号,姓名) merge (学号,[语文成绩、数学成绩、英语成绩])
      • 结果条数为:1*N
    • many-to-many:多对多关系,左边右边都不是唯一的
      • 比如(学号,[语文成绩、数学成绩、英语成绩]) merge (学号,[篮球、足球、乒乓球])
      • 结果条数为:M*N

    2.1 one-to-one 一对一关系的merge

    left = pd.DataFrame({'sno': [11, 12, 13, 14],
                          'name': ['name_a', 'name_b', 'name_c', 'name_d']
                        })
    left
    
    .dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>
    sno name
    0 11 name_a
    1 12 name_b
    2 13 name_c
    3 14 name_d
    right = pd.DataFrame({'sno': [11, 12, 13, 14],
                          'age': ['21', '22', '23', '24']
                        })
    right
    
    .dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>
    sno age
    0 11 21
    1 12 22
    2 13 23
    3 14 24
    # 一对一关系,结果中有4条
    pd.merge(left, right, on='sno')
    
    .dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>
    sno name age
    0 11 name_a 21
    1 12 name_b 22
    2 13 name_c 23
    3 14 name_d 24

    2.2 one-to-many 一对多关系的merge

    注意:数据会被复制

    
    left = pd.DataFrame({'sno': [11, 12, 13, 14],
                          'name': ['name_a', 'name_b', 'name_c', 'name_d']
                        })
    left
    
    
    .dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>
    sno name
    0 11 name_a
    1 12 name_b
    2 13 name_c
    3 14 name_d
    right = pd.DataFrame({'sno': [11, 11, 11, 12, 12, 13],
                           'grade': ['语文88', '数学90', '英语75','语文66', '数学55', '英语29']
                         })
    right
    
    
    .dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>
    sno grade
    0 11 语文88
    1 11 数学90
    2 11 英语75
    3 12 语文66
    4 12 数学55
    5 13 英语29
    # 数目以多的一边为准
    pd.merge(left, right, on='sno')
    
    
    
    .dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>
    sno name grade
    0 11 name_a 语文88
    1 11 name_a 数学90
    2 11 name_a 英语75
    3 12 name_b 语文66
    4 12 name_b 数学55
    5 13 name_c 英语29

    2.3 many-to-many 多对多关系的merge

    注意:结果数量会出现乘法

    left = pd.DataFrame({'sno': [11, 11, 12, 12,12],
                          '爱好': ['篮球', '羽毛球', '乒乓球', '篮球', "足球"]
                        })
    left
    
    .dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>
    sno 爱好
    0 11 篮球
    1 11 羽毛球
    2 12 乒乓球
    3 12 篮球
    4 12 足球
    right = pd.DataFrame({'sno': [11, 11, 11, 12, 12, 13],
                           'grade': ['语文88', '数学90', '英语75','语文66', '数学55', '英语29']
                         })
    right
    
    .dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>
    sno grade
    0 11 语文88
    1 11 数学90
    2 11 英语75
    3 12 语文66
    4 12 数学55
    5 13 英语29
    pd.merge(left, right, on='sno')
    
    .dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>
    sno 爱好 grade
    0 11 篮球 语文88
    1 11 篮球 数学90
    2 11 篮球 英语75
    3 11 羽毛球 语文88
    4 11 羽毛球 数学90
    5 11 羽毛球 英语75
    6 12 乒乓球 语文66
    7 12 乒乓球 数学55
    8 12 篮球 语文66
    9 12 篮球 数学55
    10 12 足球 语文66
    11 12 足球 数学55

    3、理解left join、right join、inner join、outer join的区别

    left = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                          'A': ['A0', 'A1', 'A2', 'A3'],
                          'B': ['B0', 'B1', 'B2', 'B3']})
    
    right = pd.DataFrame({'key': ['K0', 'K1', 'K4', 'K5'],
                          'C': ['C0', 'C1', 'C4', 'C5'],
                          'D': ['D0', 'D1', 'D4', 'D5']})
    
    left
    
    .dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>
    key A B
    0 K0 A0 B0
    1 K1 A1 B1
    2 K2 A2 B2
    3 K3 A3 B3
    right
    
    .dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>
    key C D
    0 K0 C0 D0
    1 K1 C1 D1
    2 K4 C4 D4
    3 K5 C5 D5

    3.1 inner join,默认

    左边和右边的key都有,才会出现在结果里

    
    pd.merge(left, right, how='inner')
    
    
    .dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>
    key A B C D
    0 K0 A0 B0 C0 D0
    1 K1 A1 B1 C1 D1

    3.2 left join

    左边的都会出现在结果里,右边的如果无法匹配则为Null

    
    pd.merge(left, right, how='left')
    
    
    .dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>
    key A B C D
    0 K0 A0 B0 C0 D0
    1 K1 A1 B1 C1 D1
    2 K2 A2 B2 NaN NaN
    3 K3 A3 B3 NaN NaN

    3.3 right join

    右边的都会出现在结果里,左边的如果无法匹配则为Null

    pd.merge(left, right, how='right')
    
    .dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>
    key A B C D
    0 K0 A0 B0 C0 D0
    1 K1 A1 B1 C1 D1
    2 K4 NaN NaN C4 D4
    3 K5 NaN NaN C5 D5

    3.4 outer join

    左边、右边的都会出现在结果里,如果无法匹配则为Null

    
    pd.merge(left, right, how='outer')
    
    
    .dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>
    key A B C D
    0 K0 A0 B0 C0 D0
    1 K1 A1 B1 C1 D1
    2 K2 A2 B2 NaN NaN
    3 K3 A3 B3 NaN NaN
    4 K4 NaN NaN C4 D4
    5 K5 NaN NaN C5 D5

    4、如果出现非Key的字段重名怎么办

    left = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                          'A': ['A0', 'A1', 'A2', 'A3'],
                          'B': ['B0', 'B1', 'B2', 'B3']})
    
    right = pd.DataFrame({'key': ['K0', 'K1', 'K4', 'K5'],
                          'A': ['A10', 'A11', 'A12', 'A13'],
                          'D': ['D0', 'D1', 'D4', 'D5']})
    
    left
    
    .dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>
    key A B
    0 K0 A0 B0
    1 K1 A1 B1
    2 K2 A2 B2
    3 K3 A3 B3
    right
    
    .dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>
    key A D
    0 K0 A10 D0
    1 K1 A11 D1
    2 K4 A12 D4
    3 K5 A13 D5
    pd.merge(left, right, on='key')
    
    .dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>
    key A_x B A_y D
    0 K0 A0 B0 A10 D0
    1 K1 A1 B1 A11 D1
    
    pd.merge(left, right, on='key', suffixes=('_left', '_right'))
    
    
    .dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>
    key A_left B A_right D
    0 K0 A0 B0 A10 D0
    1 K1 A1 B1 A11 D1

    本文使用 文章同步助手 同步

    相关文章

      网友评论

          本文标题:13 Pandas怎样实现DataFrame的Merge

          本文链接:https://www.haomeiwen.com/subject/ddhxtdtx.html