美文网首页
Pandas - 4. 组合 concat 和 merge

Pandas - 4. 组合 concat 和 merge

作者: 陈天睡懒觉 | 来源:发表于2022-05-21 13:45 被阅读0次

    concat()函数

    当相同类型的表被分成好多个部分,需要将各部分按行或按列连接

    按行连接

    • concat,可以连接多个
    • append,只追加单个
    import pandas as pd
    df1 =  pd.read_csv('data/concat_1.csv')
    df2 =  pd.read_csv('data/concat_2.csv')
    df3 =  pd.read_csv('data/concat_3.csv')
    print(df1)
    print(df2)
    print(df3)
    
        A   B   C   D
    0  a0  b0  c0  d0
    1  a1  b1  c1  d1
    2  a2  b2  c2  d2
    3  a3  b3  c3  d3
        A   B   C   D
    0  a4  b4  c4  d4
    1  a5  b5  c5  d5
    2  a6  b6  c6  d6
    3  a7  b7  c7  d7
         A    B    C    D
    0   a8   b8   c8   d8
    1   a9   b9   c9   d9
    2  a10  b10  c10  d10
    3  a11  b11  c11  d11
    

    concat

    row_concat = pd.concat([df1, df2, df3])
    print(row_concat)
    
         A    B    C    D
    0   a0   b0   c0   d0
    1   a1   b1   c1   d1
    2   a2   b2   c2   d2
    3   a3   b3   c3   d3
    0   a4   b4   c4   d4
    1   a5   b5   c5   d5
    2   a6   b6   c6   d6
    3   a7   b7   c7   d7
    0   a8   b8   c8   d8
    1   a9   b9   c9   d9
    2  a10  b10  c10  d10
    3  a11  b11  c11  d11
    

    append

    # 添加df
    append_df = df1.append(df2)
    print(append_df) # 不改变原数据
    print(append_sample)
    
        A   B   C   D
    0  a0  b0  c0  d0
    1  a1  b1  c1  d1
    2  a2  b2  c2  d2
    3  a3  b3  c3  d3
    0  a4  b4  c4  d4
    1  a5  b5  c5  d5
    2  a6  b6  c6  d6
    3  a7  b7  c7  d7
        A   B   C   D
    0  a0  b0  c0  d0
    1  a1  b1  c1  d1
    2  a2  b2  c2  d2
    3  a3  b3  c3  d3
    4  n1  n2  n3  n4
    
    # 添加字典(必须加上ignore_index=True,否则报错)
    add_dict = {'A':'n1','B':'n2','C':'n3','D':'n4'}
    append_dict = df1.append(add_dict, ignore_index=True)
    print(append_dict)
    
        A   B   C   D
    0  a0  b0  c0  d0
    1  a1  b1  c1  d1
    2  a2  b2  c2  d2
    3  a3  b3  c3  d3
    4  n1  n2  n3  n4
    

    索引问题

    row_concat = pd.concat([df1, df2, df3],ignore_index=True)
    print(row_concat) # 索引不重复
    
          A    B    C    D
    0    a0   b0   c0   d0
    1    a1   b1   c1   d1
    2    a2   b2   c2   d2
    3    a3   b3   c3   d3
    4    a4   b4   c4   d4
    5    a5   b5   c5   d5
    6    a6   b6   c6   d6
    7    a7   b7   c7   d7
    8    a8   b8   c8   d8
    9    a9   b9   c9   d9
    10  a10  b10  c10  d10
    11  a11  b11  c11  d11
    

    按列连接

    concat函数的axis参数设为1,会造成列名重复

    col_concat = pd.concat([df1, df2, df3], axis=1)
    print(col_concat)
    
        A   B   C   D   A   B   C   D    A    B    C    D
    0  a0  b0  c0  d0  a4  b4  c4  d4   a8   b8   c8   d8
    1  a1  b1  c1  d1  a5  b5  c5  d5   a9   b9   c9   d9
    2  a2  b2  c2  d2  a6  b6  c6  d6  a10  b10  c10  d10
    3  a3  b3  c3  d3  a7  b7  c7  d7  a11  b11  c11  d11
    

    连接具有不同列索引或行索引的表

    连接不同列的行

    # 修改列名
    df1.columns =  ['A','B','C','D']
    df2.columns =  ['E','F','G','H']
    df3.columns =  ['A','C','F','H']
    print(df1)
    print(df2)
    print(df3)
    
        A   B   C   D
    0  a0  b0  c0  d0
    1  a1  b1  c1  d1
    2  a2  b2  c2  d2
    3  a3  b3  c3  d3
        E   F   G   H
    0  a4  b4  c4  d4
    1  a5  b5  c5  d5
    2  a6  b6  c6  d6
    3  a7  b7  c7  d7
         A    C    F    H
    0   a8   b8   c8   d8
    1   a9   b9   c9   d9
    2  a10  b10  c10  d10
    3  a11  b11  c11  d11
    

    直接用concat会产生很多NaN

    import warnings
    warnings.filterwarnings('ignore') 
    print(pd.concat([df1, df2, df3]))
    
         A    B    C    D    E    F    G    H
    0   a0   b0   c0   d0  NaN  NaN  NaN  NaN
    1   a1   b1   c1   d1  NaN  NaN  NaN  NaN
    2   a2   b2   c2   d2  NaN  NaN  NaN  NaN
    3   a3   b3   c3   d3  NaN  NaN  NaN  NaN
    0  NaN  NaN  NaN  NaN   a4   b4   c4   d4
    1  NaN  NaN  NaN  NaN   a5   b5   c5   d5
    2  NaN  NaN  NaN  NaN   a6   b6   c6   d6
    3  NaN  NaN  NaN  NaN   a7   b7   c7   d7
    0   a8  NaN   b8  NaN  NaN   c8  NaN   d8
    1   a9  NaN   b9  NaN  NaN   c9  NaN   d9
    2  a10  NaN  b10  NaN  NaN  c10  NaN  d10
    3  a11  NaN  b11  NaN  NaN  c11  NaN  d11
    

    只连接共有的列,concat函数指定参数join,join默认为outer,即保留所有列

    print(pd.concat([df1, df3], join='inner'))# 如果连接df1,ff1,df3则返回空的DataFrame,因为没有三者公共的列
    
         A    C
    0   a0   c0
    1   a1   c1
    2   a2   c2
    3   a3   c3
    0   a8   b8
    1   a9   b9
    2  a10  b10
    3  a11  b11
    

    连接不同行的列

    # 修改行索引
    df1.index =  [0, 1, 2, 3]
    df2.index =  [4, 5, 6, 7]
    df3.index =  [0, 2, 5, 7]
    print(df1)
    print(df2)
    print(df3)
    
        0   1   2   3
    0  a0  b0  c0  d0
    1  a1  b1  c1  d1
    2  a2  b2  c2  d2
    3  a3  b3  c3  d3
        4   5   6   7
    4  a4  b4  c4  d4
    5  a5  b5  c5  d5
    6  a6  b6  c6  d6
    7  a7  b7  c7  d7
         0    2    5    7
    0   a8   b8   c8   d8
    2   a9   b9   c9   d9
    5  a10  b10  c10  d10
    7  a11  b11  c11  d11
    

    将concat的axis参数改为1,结果与连接不同列的行类似,生成很多NaN,修改join参数保留公共的行

    print(pd.concat([df1, df3], axis=1, join='inner'))
    
        0   1   2   3   0   2   5   7
    0  a0  b0  c0  d0  a8  b8  c8  d8
    2  a2  b2  c2  d2  a9  b9  c9  d9
    

    merge()函数

    利用公共的数据值把两张包含不同信息的表连接

    • join 函数
    • merge 函数(推荐):第一个参数是“右DataFrame”(merge左边的用left,merge右边的用right)

    读取的四个表中包含不同信息

    person = pd.read_csv('data/survey_person.csv')
    site = pd.read_csv('data/survey_site.csv')
    survey = pd.read_csv('data/survey_survey.csv')
    visited = pd.read_csv('data/survey_visited.csv')
    
    print(person.head())
    
          ident   personal    family
    0      dyer    William      Dyer
    1        pb      Frank   Pabodie
    2      lake   Anderson      Lake
    3       roe  Valentina   Roerich
    4  danforth      Frank  Danforth
    
    print(site.head())
    
        name    lat    long
    0   DR-1 -49.85 -128.57
    1   DR-3 -47.15 -126.72
    2  MSK-4 -48.87 -123.40
    
    print(survey.head())
    
       taken person quant  reading
    0    619   dyer   rad     9.82
    1    619   dyer   sal     0.13
    2    622   dyer   rad     7.80
    3    622   dyer   sal     0.09
    4    734     pb   rad     8.41
    
    print(visited.head())
    
       ident  site       dated
    0    619  DR-1  1927-02-08
    1    622  DR-1  1927-02-10
    2    734  DR-3  1939-01-07
    3    735  DR-3  1930-01-12
    4    751  DR-3  1930-02-26
    

    合并visited子集和site的信息(一对一)

    # 由于两个表共同的数据值的列名不同,需要用left_on,right_on指定,merge的一个参数的表称为右表,被调用的表示左表
    visited_subset = visited.iloc[[0,2,4],]
    o2o_merge = site.merge(visited_subset, left_on='name', right_on='site')
    print(o2o_merge)
    
       name    lat    long  ident  site       dated
    0  DR-1 -49.85 -128.57    619  DR-1  1927-02-08
    1  DR-3 -47.15 -126.72    734  DR-3  1939-01-07
    2  DR-3 -47.15 -126.72    751  DR-3  1930-02-26
    

    合并visited和site的信息(多对一)

    # 由于两个表共同的数据值的列名不同,需要用left_on,right_on指定,merge的一个参数的表称为右表,被调用的表示左表
    o2o_merge = site.merge(visited, left_on='name', right_on='site')
    print(o2o_merge)
    
        name    lat    long  ident   site       dated
    0   DR-1 -49.85 -128.57    619   DR-1  1927-02-08
    1   DR-1 -49.85 -128.57    622   DR-1  1927-02-10
    2   DR-1 -49.85 -128.57    844   DR-1  1932-03-22
    3   DR-3 -47.15 -126.72    734   DR-3  1939-01-07
    4   DR-3 -47.15 -126.72    735   DR-3  1930-01-12
    5   DR-3 -47.15 -126.72    751   DR-3  1930-02-26
    6   DR-3 -47.15 -126.72    752   DR-3         NaN
    7  MSK-4 -48.87 -123.40    837  MSK-4  1932-01-14
    

    合并(person+survey)表和(visited+survey)表的信息(多对多),由于两个表包含的信息太多,需要多个列表组合进行匹配

    ps = person.merge(survey, left_on='ident', right_on='person')
    vs = visited.merge(survey, left_on='ident', right_on='taken')
    
    print(ps)
    
       ident   personal   family  taken person quant  reading
    0   dyer    William     Dyer    619   dyer   rad     9.82
    1   dyer    William     Dyer    619   dyer   sal     0.13
    2   dyer    William     Dyer    622   dyer   rad     7.80
    3   dyer    William     Dyer    622   dyer   sal     0.09
    4     pb      Frank  Pabodie    734     pb   rad     8.41
    5     pb      Frank  Pabodie    734     pb  temp   -21.50
    6     pb      Frank  Pabodie    735     pb   rad     7.22
    7     pb      Frank  Pabodie    751     pb   rad     4.35
    8     pb      Frank  Pabodie    751     pb  temp   -18.50
    9   lake   Anderson     Lake    734   lake   sal     0.05
    10  lake   Anderson     Lake    751   lake   sal     0.10
    11  lake   Anderson     Lake    752   lake   rad     2.19
    12  lake   Anderson     Lake    752   lake   sal     0.09
    13  lake   Anderson     Lake    752   lake  temp   -16.00
    14  lake   Anderson     Lake    837   lake   rad     1.46
    15  lake   Anderson     Lake    837   lake   sal     0.21
    16   roe  Valentina  Roerich    752    roe   sal    41.60
    17   roe  Valentina  Roerich    837    roe   sal    22.50
    18   roe  Valentina  Roerich    844    roe   rad    11.25
    
    print(vs)
    
        ident   site       dated  taken person quant  reading
    0     619   DR-1  1927-02-08    619   dyer   rad     9.82
    1     619   DR-1  1927-02-08    619   dyer   sal     0.13
    2     622   DR-1  1927-02-10    622   dyer   rad     7.80
    3     622   DR-1  1927-02-10    622   dyer   sal     0.09
    4     734   DR-3  1939-01-07    734     pb   rad     8.41
    5     734   DR-3  1939-01-07    734   lake   sal     0.05
    6     734   DR-3  1939-01-07    734     pb  temp   -21.50
    7     735   DR-3  1930-01-12    735     pb   rad     7.22
    8     735   DR-3  1930-01-12    735    NaN   sal     0.06
    9     735   DR-3  1930-01-12    735    NaN  temp   -26.00
    10    751   DR-3  1930-02-26    751     pb   rad     4.35
    11    751   DR-3  1930-02-26    751     pb  temp   -18.50
    12    751   DR-3  1930-02-26    751   lake   sal     0.10
    13    752   DR-3         NaN    752   lake   rad     2.19
    14    752   DR-3         NaN    752   lake   sal     0.09
    15    752   DR-3         NaN    752   lake  temp   -16.00
    16    752   DR-3         NaN    752    roe   sal    41.60
    17    837  MSK-4  1932-01-14    837   lake   rad     1.46
    18    837  MSK-4  1932-01-14    837   lake   sal     0.21
    19    837  MSK-4  1932-01-14    837    roe   sal    22.50
    20    844   DR-1  1932-03-22    844    roe   rad    11.25
    
    ps_vs = ps.merge(vs,
                    left_on=['ident', 'taken', 'quant', 'reading'],
                    right_on=['person','ident', 'quant', 'reading'])
    print(ps_vs.head())
    # 如果两个表的列名有冲突,会增加后缀,_x表示来自左表,_y表示来自右表
    
      ident_x personal   family  taken_x person_x quant  reading  ident_y  site  \
    0    dyer  William     Dyer      619     dyer   rad     9.82      619  DR-1   
    1    dyer  William     Dyer      619     dyer   sal     0.13      619  DR-1   
    2    dyer  William     Dyer      622     dyer   rad     7.80      622  DR-1   
    3    dyer  William     Dyer      622     dyer   sal     0.09      622  DR-1   
    4      pb    Frank  Pabodie      734       pb   rad     8.41      734  DR-3   
    
            dated  taken_y person_y  
    0  1927-02-08      619     dyer  
    1  1927-02-08      619     dyer  
    2  1927-02-10      622     dyer  
    3  1927-02-10      622     dyer  
    4  1939-01-07      734       pb  
    

    相关文章

      网友评论

          本文标题:Pandas - 4. 组合 concat 和 merge

          本文链接:https://www.haomeiwen.com/subject/rqucprtx.html