美文网首页
使用python查找重复值

使用python查找重复值

作者: 兴富同学 | 来源:发表于2018-07-07 17:51 被阅读36次

    任务要点

    在词表中,一些单词重复,并有重复例句。找出所有重复单词的索引,并将重复例句合并。最后将整张词表分割成重复值和非重复值部分。

    核心代码

    1、使用xlwt和xlrd模块读写Excel

    读取Excel的步骤在于,获得所有sheet名字的数组,通过名字读取某一个sheet的内容,然后使用sheet.row_values()和sheet.col_values()获取某一行或列的内容。

    initialData = ‘...’ #需要读取的excel的路径
    workbook = xlrd.open_workbook(initialData)
    sheet_names = workbook.sheet_names()
    sheet = workbook.sheet_by_name(sheet_names[0])
    data = sheet.col_values(4)
    

    写入EXCEL的步骤在于,使用xlwt.Workbook()新建一个Excel缓存,然后使用.add_sheet()指定名字新建sheet。

    book = xlwt.Workbook(encoding='utf-8', style_compression=0)
    wSheet1 = book.add_sheet("noRepetition")
    wSheet2 = book.add_sheet("repetition")
    

    2、使用set(data)去除所有重复值

    构建矩阵allData,储存所有单词的序号、重复次数、单词内容。

    data_unique = set(data)
    allData = []
        
    for item in data_unique:
        id = data.index(item)
        num = data.count(item)
        allData.append([id,num,data[id].strip()])
    

    3、查找所有例句

    核心思想是使用.index()查找重复单词的所有例句,.index()只能查找找到的第一个单词的索引。根据重复单词的重复次数,把之前找到的单词有其他内容代替,然后循环查找,就能找到所有例句了。(引自:https://blog.csdn.net/qq_33094993/article/details/53584379,也叫“偷梁换柱”)

    nid = id
    for n in range(num-1):
        data[nid] = 'quchu'
        print(id, num, data[nid])
        nid = data.index(word)
        nwordData = sheet.row_values(nid)
        wSheet2.write(c2, 1+dlen+4*n, nwordData[6])
        wSheet2.write(c2, 1+dlen+4*n+1, nwordData[7])
        wSheet2.write(c2, 1+dlen+4*n+2, nwordData[8])
        wSheet2.write(c2, 1+dlen+4*n+3, nwordData[9])
    

    所有代码

    import xlwt,xlrd
    
    initialData = 'book.xlsx'
    workbook = xlrd.open_workbook(initialData)
    sheet_names = workbook.sheet_names()
    
    sheet = workbook.sheet_by_name(sheet_names[0])
    data = sheet.col_values(4)
    print(len(data))
    for i in range(len(data)):
        data[i] = data[i].strip()
        
    data_unique = set(data)
    allData = []
        
    for item in data_unique:
        id = data.index(item)
        num = data.count(item)
        allData.append([id,num,data[id].strip()])
    
    book = xlwt.Workbook(encoding='utf-8', style_compression=0)
    wSheet1 = book.add_sheet("noRepetition")
    wSheet2 = book.add_sheet("repetition")
    c1 = 0
    c2 = 0
    for d in allData:
        id = d[0]
        
        num = d[1]
        word = d[2]
        
        wordData = sheet.row_values(int(id))
        if num > 1:
            wSheet2.write(c2, 0, num)
            dlen = len(wordData)
            for i in range(dlen):
                wSheet2.write(c2, i+1, wordData[i])
            nid = id
            for n in range(num-1):
                data[nid] = 'quchu'
                print(id, num, data[nid])
                nid = data.index(word)
                nwordData = sheet.row_values(nid)
                wSheet2.write(c2, 1+dlen+4*n, nwordData[6])
                wSheet2.write(c2, 1+dlen+4*n+1, nwordData[7])
                wSheet2.write(c2, 1+dlen+4*n+2, nwordData[8])
                wSheet2.write(c2, 1+dlen+4*n+3, nwordData[9])
            c2 = c2 + 1
        else:
            for i in range(len(wordData)):
                wSheet1.write(c1, i, wordData[i])
            c1 = c1 + 1
    
    savePath = 'book_分离.xls'
    book.save(savePath)
    

    相关文章

      网友评论

          本文标题:使用python查找重复值

          本文链接:https://www.haomeiwen.com/subject/chrguftx.html