Pandas - 4. 组合 concat 和 merge

作者: 陈天睡懒觉 | 来源:发表于2022-05-21 13:45 被阅读0次

Pandas - 4. 组合 concat 和 merge
pandas 组合dataframe汇总
pandas拼接方法
RAC组合
pandas表连接
pandas join连接
Pandas | 详解数据的合并和拼接
【转载】python merge、concat合并数据集
concat，merge，join和append的区别
pandas(merge,concat,append)

concat()函数

当相同类型的表被分成好多个部分，需要将各部分按行或按列连接

按行连接

concat，可以连接多个
append，只追加单个

import pandas as pd
df1 =  pd.read_csv('data/concat_1.csv')
df2 =  pd.read_csv('data/concat_2.csv')
df3 =  pd.read_csv('data/concat_3.csv')
print(df1)
print(df2)
print(df3)

    A   B   C   D
0  a0  b0  c0  d0
1  a1  b1  c1  d1
2  a2  b2  c2  d2
3  a3  b3  c3  d3
    A   B   C   D
0  a4  b4  c4  d4
1  a5  b5  c5  d5
2  a6  b6  c6  d6
3  a7  b7  c7  d7
     A    B    C    D
0   a8   b8   c8   d8
1   a9   b9   c9   d9
2  a10  b10  c10  d10
3  a11  b11  c11  d11

concat

row_concat = pd.concat([df1, df2, df3])
print(row_concat)

     A    B    C    D
0   a0   b0   c0   d0
1   a1   b1   c1   d1
2   a2   b2   c2   d2
3   a3   b3   c3   d3
0   a4   b4   c4   d4
1   a5   b5   c5   d5
2   a6   b6   c6   d6
3   a7   b7   c7   d7
0   a8   b8   c8   d8
1   a9   b9   c9   d9
2  a10  b10  c10  d10
3  a11  b11  c11  d11

append

# 添加df
append_df = df1.append(df2)
print(append_df) # 不改变原数据
print(append_sample)

    A   B   C   D
0  a0  b0  c0  d0
1  a1  b1  c1  d1
2  a2  b2  c2  d2
3  a3  b3  c3  d3
0  a4  b4  c4  d4
1  a5  b5  c5  d5
2  a6  b6  c6  d6
3  a7  b7  c7  d7
    A   B   C   D
0  a0  b0  c0  d0
1  a1  b1  c1  d1
2  a2  b2  c2  d2
3  a3  b3  c3  d3
4  n1  n2  n3  n4

# 添加字典(必须加上ignore_index=True，否则报错)
add_dict = {'A':'n1','B':'n2','C':'n3','D':'n4'}
append_dict = df1.append(add_dict, ignore_index=True)
print(append_dict)

    A   B   C   D
0  a0  b0  c0  d0
1  a1  b1  c1  d1
2  a2  b2  c2  d2
3  a3  b3  c3  d3
4  n1  n2  n3  n4

索引问题

row_concat = pd.concat([df1, df2, df3],ignore_index=True)
print(row_concat) # 索引不重复

      A    B    C    D
0    a0   b0   c0   d0
1    a1   b1   c1   d1
2    a2   b2   c2   d2
3    a3   b3   c3   d3
4    a4   b4   c4   d4
5    a5   b5   c5   d5
6    a6   b6   c6   d6
7    a7   b7   c7   d7
8    a8   b8   c8   d8
9    a9   b9   c9   d9
10  a10  b10  c10  d10
11  a11  b11  c11  d11

按列连接

concat函数的axis参数设为1,会造成列名重复

col_concat = pd.concat([df1, df2, df3], axis=1)
print(col_concat)

    A   B   C   D   A   B   C   D    A    B    C    D
0  a0  b0  c0  d0  a4  b4  c4  d4   a8   b8   c8   d8
1  a1  b1  c1  d1  a5  b5  c5  d5   a9   b9   c9   d9
2  a2  b2  c2  d2  a6  b6  c6  d6  a10  b10  c10  d10
3  a3  b3  c3  d3  a7  b7  c7  d7  a11  b11  c11  d11

连接具有不同列索引或行索引的表

连接不同列的行

# 修改列名
df1.columns =  ['A','B','C','D']
df2.columns =  ['E','F','G','H']
df3.columns =  ['A','C','F','H']
print(df1)
print(df2)
print(df3)

    A   B   C   D
0  a0  b0  c0  d0
1  a1  b1  c1  d1
2  a2  b2  c2  d2
3  a3  b3  c3  d3
    E   F   G   H
0  a4  b4  c4  d4
1  a5  b5  c5  d5
2  a6  b6  c6  d6
3  a7  b7  c7  d7
     A    C    F    H
0   a8   b8   c8   d8
1   a9   b9   c9   d9
2  a10  b10  c10  d10
3  a11  b11  c11  d11

直接用concat会产生很多NaN

import warnings
warnings.filterwarnings('ignore') 
print(pd.concat([df1, df2, df3]))

     A    B    C    D    E    F    G    H
0   a0   b0   c0   d0  NaN  NaN  NaN  NaN
1   a1   b1   c1   d1  NaN  NaN  NaN  NaN
2   a2   b2   c2   d2  NaN  NaN  NaN  NaN
3   a3   b3   c3   d3  NaN  NaN  NaN  NaN
0  NaN  NaN  NaN  NaN   a4   b4   c4   d4
1  NaN  NaN  NaN  NaN   a5   b5   c5   d5
2  NaN  NaN  NaN  NaN   a6   b6   c6   d6
3  NaN  NaN  NaN  NaN   a7   b7   c7   d7
0   a8  NaN   b8  NaN  NaN   c8  NaN   d8
1   a9  NaN   b9  NaN  NaN   c9  NaN   d9
2  a10  NaN  b10  NaN  NaN  c10  NaN  d10
3  a11  NaN  b11  NaN  NaN  c11  NaN  d11

只连接共有的列，concat函数指定参数join,join默认为outer，即保留所有列

print(pd.concat([df1, df3], join='inner'))# 如果连接df1,ff1,df3则返回空的DataFrame，因为没有三者公共的列

     A    C
0   a0   c0
1   a1   c1
2   a2   c2
3   a3   c3
0   a8   b8
1   a9   b9
2  a10  b10
3  a11  b11

连接不同行的列

# 修改行索引
df1.index =  [0, 1, 2, 3]
df2.index =  [4, 5, 6, 7]
df3.index =  [0, 2, 5, 7]
print(df1)
print(df2)
print(df3)

    0   1   2   3
0  a0  b0  c0  d0
1  a1  b1  c1  d1
2  a2  b2  c2  d2
3  a3  b3  c3  d3
    4   5   6   7
4  a4  b4  c4  d4
5  a5  b5  c5  d5
6  a6  b6  c6  d6
7  a7  b7  c7  d7
     0    2    5    7
0   a8   b8   c8   d8
2   a9   b9   c9   d9
5  a10  b10  c10  d10
7  a11  b11  c11  d11

将concat的axis参数改为1，结果与连接不同列的行类似，生成很多NaN，修改join参数保留公共的行

print(pd.concat([df1, df3], axis=1, join='inner'))

    0   1   2   3   0   2   5   7
0  a0  b0  c0  d0  a8  b8  c8  d8
2  a2  b2  c2  d2  a9  b9  c9  d9

merge()函数

利用公共的数据值把两张包含不同信息的表连接

join 函数
merge 函数(推荐)：第一个参数是“右DataFrame”(merge左边的用left,merge右边的用right)

读取的四个表中包含不同信息

person = pd.read_csv('data/survey_person.csv')
site = pd.read_csv('data/survey_site.csv')
survey = pd.read_csv('data/survey_survey.csv')
visited = pd.read_csv('data/survey_visited.csv')

print(person.head())

      ident   personal    family
0      dyer    William      Dyer
1        pb      Frank   Pabodie
2      lake   Anderson      Lake
3       roe  Valentina   Roerich
4  danforth      Frank  Danforth

print(site.head())

    name    lat    long
0   DR-1 -49.85 -128.57
1   DR-3 -47.15 -126.72
2  MSK-4 -48.87 -123.40

print(survey.head())

   taken person quant  reading
0    619   dyer   rad     9.82
1    619   dyer   sal     0.13
2    622   dyer   rad     7.80
3    622   dyer   sal     0.09
4    734     pb   rad     8.41

print(visited.head())

   ident  site       dated
0    619  DR-1  1927-02-08
1    622  DR-1  1927-02-10
2    734  DR-3  1939-01-07
3    735  DR-3  1930-01-12
4    751  DR-3  1930-02-26

合并visited子集和site的信息(一对一)

# 由于两个表共同的数据值的列名不同，需要用left_on,right_on指定，merge的一个参数的表称为右表，被调用的表示左表
visited_subset = visited.iloc[[0,2,4],]
o2o_merge = site.merge(visited_subset, left_on='name', right_on='site')
print(o2o_merge)

   name    lat    long  ident  site       dated
0  DR-1 -49.85 -128.57    619  DR-1  1927-02-08
1  DR-3 -47.15 -126.72    734  DR-3  1939-01-07
2  DR-3 -47.15 -126.72    751  DR-3  1930-02-26

合并visited和site的信息(多对一)

# 由于两个表共同的数据值的列名不同，需要用left_on,right_on指定，merge的一个参数的表称为右表，被调用的表示左表
o2o_merge = site.merge(visited, left_on='name', right_on='site')
print(o2o_merge)

    name    lat    long  ident   site       dated
0   DR-1 -49.85 -128.57    619   DR-1  1927-02-08
1   DR-1 -49.85 -128.57    622   DR-1  1927-02-10
2   DR-1 -49.85 -128.57    844   DR-1  1932-03-22
3   DR-3 -47.15 -126.72    734   DR-3  1939-01-07
4   DR-3 -47.15 -126.72    735   DR-3  1930-01-12
5   DR-3 -47.15 -126.72    751   DR-3  1930-02-26
6   DR-3 -47.15 -126.72    752   DR-3         NaN
7  MSK-4 -48.87 -123.40    837  MSK-4  1932-01-14

合并(person+survey)表和(visited+survey)表的信息(多对多)，由于两个表包含的信息太多，需要多个列表组合进行匹配

ps = person.merge(survey, left_on='ident', right_on='person')
vs = visited.merge(survey, left_on='ident', right_on='taken')

print(ps)

   ident   personal   family  taken person quant  reading
0   dyer    William     Dyer    619   dyer   rad     9.82
1   dyer    William     Dyer    619   dyer   sal     0.13
2   dyer    William     Dyer    622   dyer   rad     7.80
3   dyer    William     Dyer    622   dyer   sal     0.09
4     pb      Frank  Pabodie    734     pb   rad     8.41
5     pb      Frank  Pabodie    734     pb  temp   -21.50
6     pb      Frank  Pabodie    735     pb   rad     7.22
7     pb      Frank  Pabodie    751     pb   rad     4.35
8     pb      Frank  Pabodie    751     pb  temp   -18.50
9   lake   Anderson     Lake    734   lake   sal     0.05
10  lake   Anderson     Lake    751   lake   sal     0.10
11  lake   Anderson     Lake    752   lake   rad     2.19
12  lake   Anderson     Lake    752   lake   sal     0.09
13  lake   Anderson     Lake    752   lake  temp   -16.00
14  lake   Anderson     Lake    837   lake   rad     1.46
15  lake   Anderson     Lake    837   lake   sal     0.21
16   roe  Valentina  Roerich    752    roe   sal    41.60
17   roe  Valentina  Roerich    837    roe   sal    22.50
18   roe  Valentina  Roerich    844    roe   rad    11.25

print(vs)

    ident   site       dated  taken person quant  reading
0     619   DR-1  1927-02-08    619   dyer   rad     9.82
1     619   DR-1  1927-02-08    619   dyer   sal     0.13
2     622   DR-1  1927-02-10    622   dyer   rad     7.80
3     622   DR-1  1927-02-10    622   dyer   sal     0.09
4     734   DR-3  1939-01-07    734     pb   rad     8.41
5     734   DR-3  1939-01-07    734   lake   sal     0.05
6     734   DR-3  1939-01-07    734     pb  temp   -21.50
7     735   DR-3  1930-01-12    735     pb   rad     7.22
8     735   DR-3  1930-01-12    735    NaN   sal     0.06
9     735   DR-3  1930-01-12    735    NaN  temp   -26.00
10    751   DR-3  1930-02-26    751     pb   rad     4.35
11    751   DR-3  1930-02-26    751     pb  temp   -18.50
12    751   DR-3  1930-02-26    751   lake   sal     0.10
13    752   DR-3         NaN    752   lake   rad     2.19
14    752   DR-3         NaN    752   lake   sal     0.09
15    752   DR-3         NaN    752   lake  temp   -16.00
16    752   DR-3         NaN    752    roe   sal    41.60
17    837  MSK-4  1932-01-14    837   lake   rad     1.46
18    837  MSK-4  1932-01-14    837   lake   sal     0.21
19    837  MSK-4  1932-01-14    837    roe   sal    22.50
20    844   DR-1  1932-03-22    844    roe   rad    11.25

ps_vs = ps.merge(vs,
                left_on=['ident', 'taken', 'quant', 'reading'],
                right_on=['person','ident', 'quant', 'reading'])
print(ps_vs.head())
# 如果两个表的列名有冲突，会增加后缀，_x表示来自左表，_y表示来自右表

  ident_x personal   family  taken_x person_x quant  reading  ident_y  site  \
0    dyer  William     Dyer      619     dyer   rad     9.82      619  DR-1   
1    dyer  William     Dyer      619     dyer   sal     0.13      619  DR-1   
2    dyer  William     Dyer      622     dyer   rad     7.80      622  DR-1   
3    dyer  William     Dyer      622     dyer   sal     0.09      622  DR-1   
4      pb    Frank  Pabodie      734       pb   rad     8.41      734  DR-3   

        dated  taken_y person_y  
0  1927-02-08      619     dyer  
1  1927-02-08      619     dyer  
2  1927-02-10      622     dyer  
3  1927-02-10      622     dyer  
4  1939-01-07      734       pb

Pandas - 4. 组合 concat 和 merge
concat()函数当相同类型的表被分成好多个部分，需要将各部分按行或按列连接按行连接 concat，可以连接...
pandas 组合dataframe汇总
pandas concat, merge, join pandas concat 参数解释pd.concat(ob...
pandas拼接方法
Pandas拼接操作（concat，merge，join和append）的区别 append, merge, co...
RAC组合
concat then merge zipWith 组合和聚合
pandas表连接
Pandas Dataframe有三种连接方法，分别是merge，join，concat。 merge merge...
pandas join连接
pandas join 连接(merge,join,concat)：https://blog.csdn.net/z...
Pandas | 详解数据的合并和拼接
本文介绍了利用pandas包的merge、join和concat方法来完成数据的合并和拼接，merge方法主要是基...
【转载】python merge、concat合并数据集
本文为转载，自python merge、concat合并数据集数据规整化：合并、清理、过滤 pandas和pyt...
concat，merge，join和append的区别
Pandas提供了concat，merge，join和append四种方法用于dataframe的拼接，其区别如下...
pandas(merge,concat,append)
Pandas操作数据集非常的方便，其中体现在就是有些在SQL语句中常用的方法，比如在合并数据集、left join...

Pandas - 4. 组合 concat 和 merge

concat()函数

当相同类型的表被分成好多个部分，需要将各部分按行或按列连接

按行连接

按列连接

连接具有不同列索引或行索引的表

连接不同列的行

连接不同行的列

merge()函数

利用公共的数据值把两张包含不同信息的表连接

相关文章

Pandas - 4. 组合 concat 和 merge

pandas 组合dataframe汇总

pandas拼接方法

RAC组合

pandas表连接

pandas join连接

Pandas | 详解数据的合并和拼接

【转载】python merge、concat合并数据集

concat，merge，join和append的区别

pandas(merge,concat,append)

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读