美文网首页Py100Skills
[Py011] pandas重复内容处理

[Py011] pandas重复内容处理

作者: 安哥生个信 | 来源:发表于2018-10-23 12:36 被阅读530次

    去冗余

    pd.drop_duplicates可以去除重复行,网上讲解很多,

    例如https://blog.csdn.net/u010665216/article/details/78559091/

    DataFrame.drop_duplicates(subset=None, keep='first', inplace=False)
    

    subset : column label or sequence of labels, optional
    用来指定特定的列,默认所有列
    keep : {‘first’, ‘last’, False}, default ‘first’
    删除重复项并保留第一次出现的项,False表示删除所有重复内容
    inplace : boolean, default False
    是直接在原来数据上修改还是保留一个副本

    实例展示

    df
    Out[18]: 
       A  B
    0  1  a
    1  2  b
    2  3  b
    3  1  c
    df.drop_duplicates(subset='A')
    Out[19]: 
       A  B
    0  1  a
    1  2  b
    2  3  b
    df.drop_duplicates(subset='A',keep="last")
    Out[20]: 
       A  B
    1  2  b
    2  3  b
    3  1  c
    In[21]: df.drop_duplicates(subset='A',keep=False)
    Out[21]: 
       A  B
    1  2  b
    2  3  b
    

    提取冗余

    pd.duplicated返回布尔值表示重复与否

    DataFrame.duplicated(subset=None, keep='first')
    

    subset : column label or sequence of labels, optional
    Only consider certain columns for identifying duplicates, by default use all of the columns
    keep : {‘first’, ‘last’, False}, default ‘first’

    • first : Mark duplicates as True except for the first occurrence.
    • last : Mark duplicates as True except for the last occurrence.
    • False : Mark all duplicates as True.

    例如

    In[23]: df.duplicated(subset="B")
    Out[23]: 
    0    False
    1    False
    2     True
    3    False
    dtype: bool
    

    提取重复行的话也比较简单

    In[24]: df[df.duplicated(subset="B")]
    Out[24]: 
       A  B
    2  3  b
    

    或者麻烦点,通过index来提取

    In[30]: df.drop(df.drop_duplicates('B').index)
    Out[30]: 
       A  B
    2  3  b
    

    又或者想要提取所有的重复行

    In[37]: df[df.duplicated(subset='B',keep=False)]
    Out[37]: 
       A  B
    1  2  b
    2  3  b
    

    相关文章

      网友评论

        本文标题:[Py011] pandas重复内容处理

        本文链接:https://www.haomeiwen.com/subject/xgoozftx.html