美文网首页
遍历文本并除去字符串中重复的部分

遍历文本并除去字符串中重复的部分

作者: 弦好想断 | 来源:发表于2020-09-24 17:50 被阅读0次
    ls =[]
    for word in df['name'].values:
        result = jieba.tokenize(word)
        wd=[]
        start_index = []
        end_index = []
        for tk in result:
            wd.append(tk[0])
            start_index.append(tk[1])
            end_index.append(tk[2])
        data1 = {'wd':wd,
              'start_index':start_index,
              'end_index':end_index}
        data1 = pd.DataFrame(data1)
        keep_last_drop = data1.drop_duplicates(subset=['wd'],keep='last')#(按'word'列去重,保留后一项)注意!:这里的保留后一项keep=last很关键
        keep_False_drop = data1.drop_duplicates(subset=['wd'],keep=False)#未重复的数据
        dup_df=keep_last_drop.append(keep_False_drop).drop_duplicates(subset=['wd'],keep=False)#得出重复的数据
        dup_df = dup_df.loc[dup_df.wd !='('].loc[dup_df.wd!=')']#不考虑重复的括号
        dup_df=dup_df.reset_index()
    #     print(dup_df)#输出表格中的重复值位置
        if len(dup_df)==0:
            x=word
        elif len(dup_df)==1:
            x = word[:dup_df.loc[0,'start_index']]+word[dup_df.loc[0,'end_index']:]
        elif len(dup_df) == 2:
            if dup_df.loc[0,'end_index'] == dup_df.loc[1,'start_index']:
                x = word[:dup_df.loc[0,'start_index']]+word[dup_df.loc[1,'end_index']:]
            else:
                x = word[:dup_df.loc[0,'start_index']]+word[dup_df.loc[0,'end_index']:dup_df.loc[1,'start_index']]+word[dup_df.loc[1,'end_index']:]
        elif len(dup_df) ==3:
            x = word[:dup_df.loc[0,'start_index']]+word[dup_df.loc[0,'end_index']:dup_df.loc[1,'start_index']]+word[dup_df.loc[1,'end_index']:dup_df.loc[2,'start_index']]+word[dup_df.loc[2,'end_index']:]
            print('-------------------------------------------------------------------------------重复值为三个:\n',dup_df)
        else:
            print('-------------------------------------------------------------------------------重复值为四个及以上:\n',dup_df)
        ls.append(x)
    len(ls)
    

    相关文章

      网友评论

          本文标题:遍历文本并除去字符串中重复的部分

          本文链接:https://www.haomeiwen.com/subject/erfrektx.html