美文网首页
Python 数据处理(三十二)—— 合并连接之 merge

Python 数据处理(三十二)—— 合并连接之 merge

作者: 名本无名 | 来源:发表于2021-03-09 20:19 被阅读0次

    数据库风格的 DataFrame 或命名 Series 的合并

    pandas 具有功能齐全且高性能的连接操作,与 SQL 关系型数据库类似

    这些方法的性能明显优于其他开源的工具,在某些情况下可能远远超过一个数量级。如 R 语言中的 base::merge.data.frame

    其优良的性能源自精心设计的算法和 DataFrame 中数据的内部布局

    熟悉 SQL 但不熟悉 pandas 的用户可能会对其与 SQL 的比较感兴趣

    pandasmerge() 函数,提供了 DataFrame 或命名 Series 对象之间的所有标准数据库连接操作

    参数列表

    pd.merge(
        left,
        right,
        how="inner",
        on=None,
        left_on=None,
        right_on=None,
        left_index=False,
        right_index=False,
        sort=True,
        suffixes=("_x", "_y"),
        copy=True,
        indicator=False,
        validate=None,
    )
    

    注意

    0.23.0 开始,on, left_onright_on 参数支持指定索引的级别,从 0.24.0 开始支持对命名 Series 的合并

    mergepandas 的顶层方法,但是也可以作为 DataFrame 对象的实例方法,调用的 DataFrame 对象被隐式的视为连接的左侧对象

    相关的 join() 方法内部的实现是基于 merge 方法,主要用于索引和索引,列和索引的连接。

    如果只是想通过索引来连接,可以考虑使用 join() 减少输入

    1 merge 函数简要入门(关系代数)

    我们可以将 DataFrame 看作是 SQL 的表,而熟悉 SQL 关系型数据框的人应该对下面的术语很熟悉

    • one-to-one(一对一)
    • many-to-one(多对一)
    • many-to-many(多对多)

    注意:在进行列与列的合并时,用于连接的 DataFrame 对象上的索引都会被丢弃

    SQL 或标准关系代数中,如果一个连接键在两张表中出现一次以上,那么生成的表将具有相关数据的笛卡尔积。

    下面这个例子中,连接的键是唯一的

    In [39]: left = pd.DataFrame(
       ....:     {
       ....:         "key": ["K0", "K1", "K2", "K3"],
       ....:         "A": ["A0", "A1", "A2", "A3"],
       ....:         "B": ["B0", "B1", "B2", "B3"],
       ....:     }
       ....: )
       ....: 
    
    In [40]: right = pd.DataFrame(
       ....:     {
       ....:         "key": ["K0", "K1", "K2", "K3"],
       ....:         "C": ["C0", "C1", "C2", "C3"],
       ....:         "D": ["D0", "D1", "D2", "D3"],
       ....:     }
       ....: )
       ....: 
    
    In [41]: result = pd.merge(left, right, on="key")
    

    下面是一个更加复杂的例子,具有多个连接的键。默认的连接方式是 how='inner',即指定的键的数据要同时出现在 leftright 对象中

    In [42]: left = pd.DataFrame(
       ....:     {
       ....:         "key1": ["K0", "K0", "K1", "K2"],
       ....:         "key2": ["K0", "K1", "K0", "K1"],
       ....:         "A": ["A0", "A1", "A2", "A3"],
       ....:         "B": ["B0", "B1", "B2", "B3"],
       ....:     }
       ....: )
       ....: 
    
    In [43]: right = pd.DataFrame(
       ....:     {
       ....:         "key1": ["K0", "K1", "K1", "K2"],
       ....:         "key2": ["K0", "K0", "K0", "K0"],
       ....:         "C": ["C0", "C1", "C2", "C3"],
       ....:         "D": ["D0", "D1", "D2", "D3"],
       ....:     }
       ....: )
       ....: 
    
    In [44]: result = pd.merge(left, right, on=["key1", "key2"])
    

    mergehow 参数决定了什么样的键应该包含在结果中。以下是连接的方式以及对应的 SQL 操作

    In [45]: result = pd.merge(left, right, how="left", on=["key1", "key2"])
    
    In [46]: result = pd.merge(left, right, how="right", on=["key1", "key2"])
    
    In [47]: result = pd.merge(left, right, how="outer", on=["key1", "key2"])
    
    In [48]: result = pd.merge(left, right, how="inner", on=["key1", "key2"])
    

    如果 MultiIndex 的级别名称与 DataFrame 中的列名相对应,则可以合并一个 MultiIndexSeriesDataFrame

    在合并之前,可以使用 Series.reset_index()Series 转换为 DataFrame

    In [49]: df = pd.DataFrame({"Let": ["A", "B", "C"], "Num": [1, 2, 3]})
    
    In [50]: df
    Out[50]: 
      Let  Num
    0   A    1
    1   B    2
    2   C    3
    
    In [51]: ser = pd.Series(
       ....:     ["a", "b", "c", "d", "e", "f"],
       ....:     index=pd.MultiIndex.from_arrays(
       ....:         [["A", "B", "C"] * 2, [1, 2, 3, 4, 5, 6]], names=["Let", "Num"]
       ....:     ),
       ....: )
       ....: 
    
    In [52]: ser
    Out[52]: 
    Let  Num
    A    1      a
    B    2      b
    C    3      c
    A    4      d
    B    5      e
    C    6      f
    dtype: object
    
    In [53]: pd.merge(df, ser.reset_index(), on=["Let", "Num"])
    Out[53]: 
      Let  Num  0
    0   A    1  a
    1   B    2  b
    2   C    3  c
    

    下面是另一个在 DataFrame 中使用重复连接键的例子

    In [54]: left = pd.DataFrame({"A": [1, 2], "B": [2, 2]})
    
    In [55]: right = pd.DataFrame({"A": [4, 5, 6], "B": [2, 2, 2]})
    
    In [56]: result = pd.merge(left, right, on="B", how="outer")
    

    2 检查重复的键

    用户可以使用 validate 参数自动检查其合并的键中是否有意外的重复项

    在合并操作之前检查键的唯一性,可以防止内存溢出,也是确保用户数据结构符合预期的一种好方法

    在下面的示例中,rightB 列存在重复值。因此不是一对一的合并,validate 参数中所指定的方式将引发异常

    In [57]: left = pd.DataFrame({"A": [1, 2], "B": [1, 2]})
    
    In [58]: right = pd.DataFrame({"A": [4, 5, 6], "B": [2, 2, 2]})
    
    
    >>> result = pd.merge(left, right, on="B", how="outer", validate="one_to_one")
    ...
    MergeError: Merge keys are not unique in right dataset; not a one-to-one merge
    

    如果用户知道 right 中存在重复值,但又想确保 left 中没有重复,可以使用 validate='one_to_many' 参数来代替,这样就不会引发异常

    In [59]: pd.merge(left, right, on="B", how="outer", validate="one_to_many")
    Out[59]: 
       A_x  B  A_y
    0    1  1  NaN
    1    2  2  4.0
    2    2  2  5.0
    3    2  2  6.0
    

    3 merge 指示器

    merge 接受一个 indicator 参数,如果为 True,则会在输出对象中添加一个名为 _merge 的分类型列

    In [60]: df1 = pd.DataFrame({"col1": [0, 1], "col_left": ["a", "b"]})
    
    In [61]: df2 = pd.DataFrame({"col1": [1, 2, 2], "col_right": [2, 2, 2]})
    
    In [62]: pd.merge(df1, df2, on="col1", how="outer", indicator=True)
    Out[62]: 
       col1 col_left  col_right      _merge
    0     0        a        NaN   left_only
    1     1        b        2.0        both
    2     2      NaN        2.0  right_only
    3     2      NaN        2.0  right_only
    

    indicator 参数也接受字符串,在这种情况下,将使用传递的字符串的值作为 indicator 列的名称

    In [63]: pd.merge(df1, df2, on="col1", how="outer", indicator="indicator_column")
    Out[63]: 
       col1 col_left  col_right indicator_column
    0     0        a        NaN        left_only
    1     1        b        2.0             both
    2     2      NaN        2.0       right_only
    3     2      NaN        2.0       right_only
    

    4 merge 数据类型

    对于下面的 DataFrame 对象

    In [64]: left = pd.DataFrame({"key": [1], "v1": [10]})
    
    In [65]: left
    Out[65]: 
       key  v1
    0    1  10
    
    In [66]: right = pd.DataFrame({"key": [1, 2], "v1": [20, 30]})
    
    In [67]: right
    Out[67]: 
       key  v1
    0    1  20
    1    2  30
    

    合并操作会保留连接的键的数据类型

    In [68]: pd.merge(left, right, how="outer")
    Out[68]: 
       key  v1
    0    1  10
    1    1  20
    2    2  30
    
    In [69]: pd.merge(left, right, how="outer").dtypes
    Out[69]: 
    key    int64
    v1     int64
    dtype: object
    

    当然,如果引入了缺失值,则数据类型会发生向上转换

    In [70]: pd.merge(left, right, how="outer", on="key")
    Out[70]: 
       key  v1_x  v1_y
    0    1  10.0    20
    1    2   NaN    30
    
    In [71]: pd.merge(left, right, how="outer", on="key").dtypes
    Out[71]: 
    key       int64
    v1_x    float64
    v1_y      int64
    dtype: object
    

    合并操作也能够保留 category 类型。例如,有如下的 left 对象

    In [72]: from pandas.api.types import CategoricalDtype
    
    In [73]: X = pd.Series(np.random.choice(["foo", "bar"], size=(10,)))
    
    In [74]: X = X.astype(CategoricalDtype(categories=["foo", "bar"]))
    
    In [75]: left = pd.DataFrame(
       ....:     {"X": X, "Y": np.random.choice(["one", "two", "three"], size=(10,))}
       ....: )
       ....: 
    
    In [76]: left
    Out[76]: 
         X      Y
    0  bar    one
    1  foo    one
    2  foo  three
    3  bar  three
    4  foo    one
    5  bar    one
    6  bar  three
    7  bar  three
    8  bar  three
    9  foo  three
    
    In [77]: left.dtypes
    Out[77]: 
    X    category
    Y      object
    dtype: object
    

    right 对象

    In [78]: right = pd.DataFrame(
       ....:     {
       ....:         "X": pd.Series(["foo", "bar"], dtype=CategoricalDtype(["foo", "bar"])),
       ....:         "Z": [1, 2],
       ....:     }
       ....: )
       ....: 
    
    In [79]: right
    Out[79]: 
         X  Z
    0  foo  1
    1  bar  2
    
    In [80]: right.dtypes
    Out[80]: 
    X    category
    Z       int64
    dtype: object
    

    合并的结果为

    In [81]: result = pd.merge(left, right, how="outer")
    
    In [82]: result
    Out[82]: 
         X      Y  Z
    0  bar    one  2
    1  bar  three  2
    2  bar    one  2
    3  bar  three  2
    4  bar  three  2
    5  bar  three  2
    6  foo    one  1
    7  foo  three  1
    8  foo    one  1
    9  foo  three  1
    
    In [83]: result.dtypes
    Out[83]: 
    X    category
    Y      object
    Z       int64
    dtype: object
    

    5 join 连接

    DataFrame.join() 是一个简便的方法,能够通过索引连接两个 DataFrame 对象

    例如

    In [84]: left = pd.DataFrame(
       ....:     {"A": ["A0", "A1", "A2"], "B": ["B0", "B1", "B2"]}, index=["K0", "K1", "K2"]
       ....: )
       ....: 
    
    In [85]: right = pd.DataFrame(
       ....:     {"C": ["C0", "C2", "C3"], "D": ["D0", "D2", "D3"]}, index=["K0", "K2", "K3"]
       ....: )
       ....: 
    
    In [86]: result = left.join(right)
    
    In [87]: result = left.join(right, how="outer")
    
    In [88]: result = left.join(right, how="inner")
    

    当然,也可以使用 merge 实现相同的功能,但是需要写更多的代码

    In [89]: result = pd.merge(left, right, left_index=True, right_index=True, how="outer")
    
    In [90]: result = pd.merge(left, right, left_index=True, right_index=True, how="inner")
    

    6 通过列和索引连接

    jion() 接受一个 on 参数,用于指定该对象中用于连接的列名或列名列表与传入的 DataFrame 的索引进行连接

    下面两个函数的功能是一样的

    left.join(right, on=key_or_keys)
    pd.merge(
        left, right, left_on=key_or_keys, right_index=True, how="left", sort=False
    )
    

    显然,你可以选择自己认为的更简便的方式,对于多对一的连接,使用 join() 可能会更加方便

    例如

    In [91]: left = pd.DataFrame(
       ....:     {
       ....:         "A": ["A0", "A1", "A2", "A3"],
       ....:         "B": ["B0", "B1", "B2", "B3"],
       ....:         "key": ["K0", "K1", "K0", "K1"],
       ....:     }
       ....: )
       ....: 
    
    In [92]: right = pd.DataFrame({"C": ["C0", "C1"], "D": ["D0", "D1"]}, index=["K0", "K1"])
    
    In [93]: result = left.join(right, on="key")
    
    In [94]: result = pd.merge(
       ....:     left, right, left_on="key", right_index=True, how="left", sort=False
       ....: )
       ....: 
    

    如果要连接多个键,传递的 DataFrame 必须有一个 MultiIndex

    In [95]: left = pd.DataFrame(
       ....:     {
       ....:         "A": ["A0", "A1", "A2", "A3"],
       ....:         "B": ["B0", "B1", "B2", "B3"],
       ....:         "key1": ["K0", "K0", "K1", "K2"],
       ....:         "key2": ["K0", "K1", "K0", "K1"],
       ....:     }
       ....: )
       ....: 
    
    In [96]: index = pd.MultiIndex.from_tuples(
       ....:     [("K0", "K0"), ("K1", "K0"), ("K2", "K0"), ("K2", "K1")]
       ....: )
       ....: 
    
    In [97]: right = pd.DataFrame(
       ....:     {"C": ["C0", "C1", "C2", "C3"], "D": ["D0", "D1", "D2", "D3"]}, index=index
       ....: )
       ....: 
    

    现在可以通过传递两个列名来连接

    In [98]: result = left.join(right, on=["key1", "key2"])
    

    DataFrame.join 默认执行的是 left join,想要实现其他连接也很方便

    例如,inner join

    In [99]: result = left.join(right, on=["key1", "key2"], how="inner")
    

    7 单索引和层次索引的连接

    可以将单索引与层次索引的某一级别进行连接

    In [100]: left = pd.DataFrame(
       .....:     {"A": ["A0", "A1", "A2"], "B": ["B0", "B1", "B2"]},
       .....:     index=pd.Index(["K0", "K1", "K2"], name="key"),
       .....: )
       .....: 
    
    In [101]: index = pd.MultiIndex.from_tuples(
       .....:     [("K0", "Y0"), ("K1", "Y1"), ("K2", "Y2"), ("K2", "Y3")],
       .....:     names=["key", "Y"],
       .....: )
       .....: 
    
    In [102]: right = pd.DataFrame(
       .....:     {"C": ["C0", "C1", "C2", "C3"], "D": ["D0", "D1", "D2", "D3"]},
       .....:     index=index,
       .....: )
       .....: 
    
    In [103]: result = left.join(right, how="inner")
    

    下面的代码与上面是等价的,但是代码更多

    In [104]: result = pd.merge(
       .....:     left.reset_index(), right.reset_index(), on=["key"], how="inner"
       .....: ).set_index(["key","Y"])
       .....: 
    

    8 两个 MultiIndex 的连接

    使用该方式是有限制的,right 对象的索引必须是 left 对象索引的子集

    In [105]: leftindex = pd.MultiIndex.from_product(
       .....:     [list("abc"), list("xy"), [1, 2]], names=["abc", "xy", "num"]
       .....: )
       .....: 
    
    In [106]: left = pd.DataFrame({"v1": range(12)}, index=leftindex)
    
    In [107]: left
    Out[107]: 
                v1
    abc xy num    
    a   x  1     0
           2     1
        y  1     2
           2     3
    b   x  1     4
           2     5
        y  1     6
           2     7
    c   x  1     8
           2     9
        y  1    10
           2    11
    
    In [108]: rightindex = pd.MultiIndex.from_product(
       .....:     [list("abc"), list("xy")], names=["abc", "xy"]
       .....: )
       .....: 
    
    In [109]: right = pd.DataFrame({"v2": [100 * i for i in range(1, 7)]}, index=rightindex)
    
    In [110]: right
    Out[110]: 
             v2
    abc xy     
    a   x   100
        y   200
    b   x   300
        y   400
    c   x   500
        y   600
    
    In [111]: left.join(right, on=["abc", "xy"], how="inner")
    Out[111]: 
                v1   v2
    abc xy num         
    a   x  1     0  100
           2     1  100
        y  1     2  200
           2     3  200
    b   x  1     4  300
           2     5  300
        y  1     6  400
           2     7  400
    c   x  1     8  500
           2     9  500
        y  1    10  600
           2    11  600
    

    如果不满足该条件,则可以使用下面的代码连接两个具有层次索引的 DataFrame

    In [112]: leftindex = pd.MultiIndex.from_tuples(
       .....:     [("K0", "X0"), ("K0", "X1"), ("K1", "X2")], names=["key", "X"]
       .....: )
       .....: 
    
    In [113]: left = pd.DataFrame(
       .....:     {"A": ["A0", "A1", "A2"], "B": ["B0", "B1", "B2"]}, index=leftindex
       .....: )
       .....: 
    
    In [114]: rightindex = pd.MultiIndex.from_tuples(
       .....:     [("K0", "Y0"), ("K1", "Y1"), ("K2", "Y2"), ("K2", "Y3")], names=["key", "Y"]
       .....: )
       .....: 
    
    In [115]: right = pd.DataFrame(
       .....:     {"C": ["C0", "C1", "C2", "C3"], "D": ["D0", "D1", "D2", "D3"]}, index=rightindex
       .....: )
       .....: 
    
    In [116]: result = pd.merge(
       .....:     left.reset_index(), right.reset_index(), on=["key"], how="inner"
       .....: ).set_index(["key", "X", "Y"])
       .....: 
    

    9 列名与索引的组合进行连接

    onleft_onright_on 参数传递的字符串可以是列名或索引级别名称,这就可以在不重置索引的情况下,对索引级别和列名进行组合,合并两个 DataFrame 实例对象

    In [117]: left_index = pd.Index(["K0", "K0", "K1", "K2"], name="key1")
    
    In [118]: left = pd.DataFrame(
       .....:     {
       .....:         "A": ["A0", "A1", "A2", "A3"],
       .....:         "B": ["B0", "B1", "B2", "B3"],
       .....:         "key2": ["K0", "K1", "K0", "K1"],
       .....:     },
       .....:     index=left_index,
       .....: )
       .....: 
    
    In [119]: right_index = pd.Index(["K0", "K1", "K2", "K2"], name="key1")
    
    In [120]: right = pd.DataFrame(
       .....:     {
       .....:         "C": ["C0", "C1", "C2", "C3"],
       .....:         "D": ["D0", "D1", "D2", "D3"],
       .....:         "key2": ["K0", "K0", "K0", "K1"],
       .....:     },
       .....:     index=right_index,
       .....: )
       .....: 
    
    In [121]: result = left.merge(right, on=["key1", "key2"])
    

    注意:如果用于连接的字符串名称既匹配了索引级别名,又匹配了列名,将会引发一个警告,并有效使用列名

    10 交叠的列

    mergesuffix 参数接受一个字符串元组列表,作为合并结果中相同列名的后缀,以便于区别列来自哪个 DataFrame

    In [122]: left = pd.DataFrame({"k": ["K0", "K1", "K2"], "v": [1, 2, 3]})
    
    In [123]: right = pd.DataFrame({"k": ["K0", "K0", "K3"], "v": [4, 5, 6]})
    
    In [124]: result = pd.merge(left, right, on="k")
    
    In [125]: result = pd.merge(left, right, on="k", suffixes=("_l", "_r"))
    

    DataFrame.join() 中的 lsuffixrsuffix 参数也可以达到一样的效果

    In [126]: left = left.set_index("k")
    
    In [127]: right = right.set_index("k")
    
    In [128]: result = left.join(right, lsuffix="_l", rsuffix="_r")
    

    11 连接多个 DataFrame

    也可以将 DataFrame 列表或元组传递给 join(),将会根据它们的索引连接在一起

    In [129]: right2 = pd.DataFrame({"v": [7, 8, 9]}, index=["K1", "K1", "K2"])
    
    In [130]: result = left.join([right, right2])
    

    12 合并相似的 Series 或 DataFrame

    另一种常见的情况是,有两个具有类似索引的 SeriesDataFrame 对象,并希望用一个对象的值来修复另一个对象中对应位置的值

    例如

    In [131]: df1 = pd.DataFrame(
       .....:     [[np.nan, 3.0, 5.0], [-4.6, np.nan, np.nan], [np.nan, 7.0, np.nan]]
       .....: )
       .....: 
    
    In [132]: df2 = pd.DataFrame([[-42.6, np.nan, -8.2], [-5.0, 1.6, 4]], index=[1, 2])
    

    使用 combine_first() 来实现修复

    In [133]: result = df1.combine_first(df2)
    

    注意:这个方法只有当 left 对象中存在缺失值,才会用 right 对象中对应位置的值来替换

    而相关的 update() 方法,会在原地修改并替换非缺失值

    In [134]: df1.update(df2)
    

    相关文章

      网友评论

          本文标题:Python 数据处理(三十二)—— 合并连接之 merge

          本文链接:https://www.haomeiwen.com/subject/wnbbqltx.html