使用python查找重复值

作者: 兴富同学 | 来源:发表于2018-07-07 17:51 被阅读36次

使用python查找重复值
如何快速掌握vlookup？
如何快速掌握vlookup？
每日Leetcode—SQL（4）
用字典查找两个较大列表之间的重复值
Excel查找重复值
关于Python杂记
thinking in haskell-递归
Python数据处理
Python 7 ：使用dict 和 set

任务要点

在词表中，一些单词重复，并有重复例句。找出所有重复单词的索引，并将重复例句合并。最后将整张词表分割成重复值和非重复值部分。

核心代码

1、使用xlwt和xlrd模块读写Excel

读取Excel的步骤在于，获得所有sheet名字的数组，通过名字读取某一个sheet的内容，然后使用sheet.row_values()和sheet.col_values()获取某一行或列的内容。

initialData = ‘...’ #需要读取的excel的路径
workbook = xlrd.open_workbook(initialData)
sheet_names = workbook.sheet_names()
sheet = workbook.sheet_by_name(sheet_names[0])
data = sheet.col_values(4)

写入EXCEL的步骤在于，使用xlwt.Workbook()新建一个Excel缓存，然后使用.add_sheet()指定名字新建sheet。

book = xlwt.Workbook(encoding='utf-8', style_compression=0)
wSheet1 = book.add_sheet("noRepetition")
wSheet2 = book.add_sheet("repetition")

2、使用set(data)去除所有重复值

构建矩阵allData，储存所有单词的序号、重复次数、单词内容。

data_unique = set(data)
allData = []
    
for item in data_unique:
    id = data.index(item)
    num = data.count(item)
    allData.append([id,num,data[id].strip()])

3、查找所有例句

核心思想是使用.index()查找重复单词的所有例句，.index()只能查找找到的第一个单词的索引。根据重复单词的重复次数，把之前找到的单词有其他内容代替，然后循环查找，就能找到所有例句了。（引自：https://blog.csdn.net/qq_33094993/article/details/53584379，也叫“偷梁换柱”）

nid = id
for n in range(num-1):
    data[nid] = 'quchu'
    print(id, num, data[nid])
    nid = data.index(word)
    nwordData = sheet.row_values(nid)
    wSheet2.write(c2, 1+dlen+4*n, nwordData[6])
    wSheet2.write(c2, 1+dlen+4*n+1, nwordData[7])
    wSheet2.write(c2, 1+dlen+4*n+2, nwordData[8])
    wSheet2.write(c2, 1+dlen+4*n+3, nwordData[9])

所有代码

import xlwt,xlrd

initialData = 'book.xlsx'
workbook = xlrd.open_workbook(initialData)
sheet_names = workbook.sheet_names()

sheet = workbook.sheet_by_name(sheet_names[0])
data = sheet.col_values(4)
print(len(data))
for i in range(len(data)):
    data[i] = data[i].strip()
    
data_unique = set(data)
allData = []
    
for item in data_unique:
    id = data.index(item)
    num = data.count(item)
    allData.append([id,num,data[id].strip()])

book = xlwt.Workbook(encoding='utf-8', style_compression=0)
wSheet1 = book.add_sheet("noRepetition")
wSheet2 = book.add_sheet("repetition")
c1 = 0
c2 = 0
for d in allData:
    id = d[0]
    
    num = d[1]
    word = d[2]
    
    wordData = sheet.row_values(int(id))
    if num > 1:
        wSheet2.write(c2, 0, num)
        dlen = len(wordData)
        for i in range(dlen):
            wSheet2.write(c2, i+1, wordData[i])
        nid = id
        for n in range(num-1):
            data[nid] = 'quchu'
            print(id, num, data[nid])
            nid = data.index(word)
            nwordData = sheet.row_values(nid)
            wSheet2.write(c2, 1+dlen+4*n, nwordData[6])
            wSheet2.write(c2, 1+dlen+4*n+1, nwordData[7])
            wSheet2.write(c2, 1+dlen+4*n+2, nwordData[8])
            wSheet2.write(c2, 1+dlen+4*n+3, nwordData[9])
        c2 = c2 + 1
    else:
        for i in range(len(wordData)):
            wSheet1.write(c1, i, wordData[i])
        c1 = c1 + 1

savePath = 'book_分离.xls'
book.save(savePath)