美文网首页
Python实例:表格多列值无序去重

Python实例:表格多列值无序去重

作者: HackerFen | 来源:发表于2022-01-18 20:04 被阅读0次

    问题

    • 从图数据库中取出能构成三角形的三个节点id,并进行去重。
    • 每一列都是三个节点的id,三个id都相同的话,视为相同的三角形。
    • 例如,数据第一行与第二行就是相同的两个三角形。

    原始数据

    # test.csv
    0000068d96366d5e052ed65eaecb1112    74a995fb9b79e84232b7510644688cd8    dfb644784e6292b8d7f499fc53229dda
    0000068d96366d5e052ed65eaecb1112    dfb644784e6292b8d7f499fc53229dda    74a995fb9b79e84232b7510644688cd8
    0000068d96366d5e052ed65eaecb1112    2902e25dde191f5f596a3756bed1a3ce    96288b79adb133ba4b69b5f97716eee6
    000007f62fb9ee58cec1391e55b9c200    836598c70ab46fb1c621fe548dd363cc    a5545ef32843ecfa7038692a1dbbe305
    000008dc6580ce3d3313e35417c0aa65    8c658523dbcf012383fb12aec76f220f    b193cb6de781dbacc04ce2ccb96d43dc
    00001d74b7b878b076ab2d84d5de4296    d2a5854ae3efa1ee4e7a4070e841f108    4c0fd736ef3845072bdd9a318a71a96f
    000021d8344e102a857740ee319c40e9    c27c333b1586b239bb302b23c70ce274    9a29e26feb9d254a8f70c9ad94e8d1dc
    000021d8344e102a857740ee319c40e9    9a29e26feb9d254a8f70c9ad94e8d1dc    c27c333b1586b239bb302b23c70ce274
    000021d8344e102a857740ee319c40e9    2902e25dde191f5f596a3756bed1a3ce    96288b79adb133ba4b69b5f97716eee6
    0000419d8eec64dd2582336713a51ad3    714215e7fdf0f3f32073df73caecd44f    3255e40d3de83c0ad9ebd35569e601b4
    0000434e444cab7991535ab1dc7e0a43    2ab9986c1d02386ca74155e55fcdb64c    16de98e5c7f9751fc6b66526ce875736
    0000434e444cab7991535ab1dc7e0a43    16de98e5c7f9751fc6b66526ce875736    2ab9986c1d02386ca74155e55fcdb64c
    00004e274353b7bd829ad789930a799c    a76fdec4ace13644f6f50934c53fb484    18ee36edb4a3ed1c9d35cca2a01af37c
    00004e274353b7bd829ad789930a799c    18ee36edb4a3ed1c9d35cca2a01af37c    a76fdec4ace13644f6f50934c53fb484
    00009475ecac55a5816427acd81a2873    d1c17f8ff05ee3243fc084ae45042891    73b357fc3e0f5e538521c9589762e113
    00009f5e45eb2f2c37ed8e349ba63e76    c8fbf4f7114038db015eb8debf0bfba8    1f111ee13d222ab4f4d3d128af201aea
    00009f5e45eb2f2c37ed8e349ba63e76    1f111ee13d222ab4f4d3d128af201aea    c8fbf4f7114038db015eb8debf0bfba8
    00009f5e45eb2f2c37ed8e349ba63e76    2902e25dde191f5f596a3756bed1a3ce    96288b79adb133ba4b69b5f97716eee6
    0000b229f3216a8ca929c1a90292665e    5aaf57319e251c04fe8b3d4f1d687cad    2f1f7611449c9834dc8d1df8e745e7c1
    0000f2546c76a8412c130228dca0ea8d    b24401256e3bd6d30c32bbedb9a90956    24338e2d8656a53303f3dced1d6af82f
    

    解决思路:三个相同的值,无论处理顺序如何,都应该生成相同的一个值。
    比如:a+b+c = c+b+a,a^b^c = c^b^a
    注:庆幸知识没有完全还给老师。

    方法一

    将三个点的id值先编码再异或得到一个值,根据这个值进行去重。
    编码异或之后的值,不方便打印,再用base64再处理成可读字符串。

    from base64 import b64encode
    
    vs = set()
    with open('test.csv','r') as fr,open('out.csv','w') as fw:
        for line in fr:
            v0,v1,v2 = line.strip().split(',')
            v = b64encode(bytes(i ^ j ^ k for i, j ,k in zip(v0.encode('utf-8'), v1.encode('utf-8'), v2.encode('utf-8')))).decode('utf-8')
    
            if v not in vs:
                fw.write(f'{v0},{v1},{v2},{v}\n')
                vs.add(v)
            else:
                continue        
    

    最终结果

    # output.csv
    0000068d96366d5e052ed65eaecb1112    74a995fb9b79e84232b7510644688cd8    dfb644784e6292b8d7f499fc53229dda    Y2IzPz03aT40MTI9am5jb2cwNmZoPmMwYGJnaDA2MWs=
    0000068d96366d5e052ed65eaecb1112    2902e25dde191f5f596a3756bed1a3ce    96288b79adb133ba4b69b5f97716eee6    Oz8yOm1mOjk8N2A+NDFiYjFuMj01NGZqNDc2ZTVnN2E=
    000007f62fb9ee58cec1391e55b9c200    836598c70ab46fb1c621fe548dd363cc    a5545ef32843ecfa7038692a1dbbe305    aTYzMTxqYzIwPzQ+NmAxaDdjYjhjZTYwPDVkaDAyY2Y=
    000008dc6580ce3d3313e35417c0aa65    8c658523dbcf012383fb12aec76f220f    b193cb6de781dbacc04ce2ccb96d43dc    amI/NmtvYDQ3YGNnNzZgNGgwYzIxMzcyMDljMmdgYjA=
    00001d74b7b878b076ab2d84d5de4296    d2a5854ae3efa1ee4e7a4070e841f108    4c0fd736ef3845072bdd9a318a71a96f    YGFhY21mMGNiYjRmYjw3YjExMmc/NTw1OWxnZTM6P2g=
    000021d8344e102a857740ee319c40e9    c27c333b1586b239bb302b23c70ce274    9a29e26feb9d254a8f70c9ad94e8d1dc    amM1amQwYTxnYzU3YTc1OWIxMzdlazYyaTJsODUzNm4=
    000021d8344e102a857740ee319c40e9    2902e25dde191f5f596a3756bed1a3ce    96288b79adb133ba4b69b5f97716eee6    Oz8yOm9hZmU2NWdtM2VlZjluN29lMjZqZmNsZDBmY2o=
    0000419d8eec64dd2582336713a51ad3    714215e7fdf0f3f32073df73caecd44f    3255e40d3de83c0ad9ebd35569e601b4    NDMxN2AwbDdtZWZrY2QyNmQ8amMzZjQxZGthYGVkMmE=
    0000434e444cab7991535ab1dc7e0a43    2ab9986c1d02386ca74155e55fcdb64c    16de98e5c7f9751fc6b66526ce875736    M2c2bDQzZzNmZ2JoZW8wPDswYzQ2YTUyMmBsNmdgM2Y=
    00004e274353b7bd829ad789930a799c    a76fdec4ace13644f6f50934c53fb484    18ee36edb4a3ed1c9d35cca2a01af37c    YD9jM2M2NGc3ZDExNGVnM2dgbGE3bWo/OzYyZjM+NjQ=
    00009475ecac55a5816427acd81a2873    d1c17f8ff05ee3243fc084ae45042891    73b357fc3e0f5e538521c9589762e113    YzIxMjtlaTAwNmRgZWNmMjNiZzVpOjU+aTo3Z2UxPzE=
    00009f5e45eb2f2c37ed8e349ba63e76    c8fbf4f7114038db015eb8debf0bfba8    1f111ee13d222ab4f4d3d128af201aea    Ym5nY243NmM2YGNgMz80NWUyNDI+bGVpOmJjZGRmM28=
    00009f5e45eb2f2c37ed8e349ba63e76    2902e25dde191f5f596a3756bed1a3ce    96288b79adb133ba4b69b5f97716eee6    Oz8yOmQ2NzgxNDZqMDNlZDJsZTxpZ2A7bDA0MTczMWU=
    0000b229f3216a8ca929c1a90292665e    5aaf57319e251c04fe8b3d4f1d687cad    2f1f7611449c9834dc8d1df8e745e7c1    NzdgMGAzMDlrYjlnPjo7Y2M/Mj9hMTNnZGE7P2RiNzA=
    0000f2546c76a8412c130228dca0ea8d    b24401256e3bd6d30c32bbedb9a90956    24338e2d8656a53303f3dced1d6af82f    YDY3N25mNWU4MDFiZDtjMTIzZDI2MzI4Nz42aDNgPzQ=
    
    

    想了一晚,还是觉得方法一太别扭,应该要把3个十六进制数异或成一个十六进制数。
    仔细再看了一下bytes的官方文档,看到 bytes.fromhexhex 方法,果然官方文档是最棒的指导~

    方法二

    将三个点的id值读取成bytes再异或得到一个值,根据这个值进行去重。

    vs = set()
    with open('test.csv','r') as fr,open('out.csv','w') as fw:
        for line in fr:
            v0,v1,v2 = line.strip().split(',')
            v = bytes(i ^ j ^ k for i, j ,k in zip(bytes.fromhex(v0), bytes.fromhex(v1), bytes.fromhex(v2))).hex()
            # print(f'{v0},{v1},{v2},{v}')
            if v not in vs:
                fw.write(f'{v0},{v1},{v2},{v}\n')
                vs.add(v)
            else:
                continue
    

    最终结果

    # out.csv
    0000068d96366d5e052ed65eaecb1112    74a995fb9b79e84232b7510644688cd8    dfb644784e6292b8d7f499fc53229dda    ab1fd70e432d17a4e06d1ea4b9810010
    0000068d96366d5e052ed65eaecb1112    2902e25dde191f5f596a3756bed1a3ce    96288b79adb133ba4b69b5f97716eee6    bf2a6fa9e59e41bb172d54f1670c5c3a
    000007f62fb9ee58cec1391e55b9c200    836598c70ab46fb1c621fe548dd363cc    a5545ef32843ecfa7038692a1dbbe305    2631c1c20d4e6d1378d8ae60c5d142c9
    000008dc6580ce3d3313e35417c0aa65    8c658523dbcf012383fb12aec76f220f    b193cb6de781dbacc04ce2ccb96d43dc    3df6469259ce14b270a4133669c2cbb6
    00001d74b7b878b076ab2d84d5de4296    d2a5854ae3efa1ee4e7a4070e841f108    4c0fd736ef3845072bdd9a318a71a96f    9eaa4f08bb6f9c59130cf7c5b7ee1af1
    000021d8344e102a857740ee319c40e9    c27c333b1586b239bb302b23c70ce274    9a29e26feb9d254a8f70c9ad94e8d1dc    5855f08cca558759b137a26062787341
    000021d8344e102a857740ee319c40e9    2902e25dde191f5f596a3756bed1a3ce    96288b79adb133ba4b69b5f97716eee6    bf2a48fc47e63ccf9774c241f85b0dc1
    0000419d8eec64dd2582336713a51ad3    714215e7fdf0f3f32073df73caecd44f    3255e40d3de83c0ad9ebd35569e601b4    4317b0774ef4ab24dc1a3f41b0afcf28
    0000434e444cab7991535ab1dc7e0a43    2ab9986c1d02386ca74155e55fcdb64c    16de98e5c7f9751fc6b66526ce875736    3c6743c79eb7e60af0a46a724d34eb39
    00004e274353b7bd829ad789930a799c    a76fdec4ace13644f6f50934c53fb484    18ee36edb4a3ed1c9d35cca2a01af37c    bf81a60e5b116ce5e95a121ff62f3e64
    00009475ecac55a5816427acd81a2873    d1c17f8ff05ee3243fc084ae45042891    73b357fc3e0f5e538521c9589762e113    a272bc0622fde8d23b856a5a0a7ce1f1
    00009f5e45eb2f2c37ed8e349ba63e76    c8fbf4f7114038db015eb8debf0bfba8    1f111ee13d222ab4f4d3d128af201aea    d7ea754869893d43c260e7c28b8ddf34
    00009f5e45eb2f2c37ed8e349ba63e76    2902e25dde191f5f596a3756bed1a3ce    96288b79adb133ba4b69b5f97716eee6    bf2af67a364303c925ee0c9b5261735e
    0000b229f3216a8ca929c1a90292665e    5aaf57319e251c04fe8b3d4f1d687cad    2f1f7611449c9834dc8d1df8e745e7c1    75b093092998eebc8b2fe11ef8bffd32
    0000f2546c76a8412c130228dca0ea8d    b24401256e3bd6d30c32bbedb9a90956    24338e2d8656a53303f3dced1d6af82f    96777d5c841bdba123d2652878631bf4
    

    这样看起来就舒服很多,合理多了。

    总结

    1. 无序集合的比较,如果是一个一个对比,想一想都觉得难以实现。因此选择将无序集合生成一个值能够作为代表,再进行比较。
    2. 多看官方文档

    相关文章

      网友评论

          本文标题:Python实例:表格多列值无序去重

          本文链接:https://www.haomeiwen.com/subject/hbflhrtx.html