浅谈Single-Pass算法

作者: 致Great | 来源:发表于2020-11-30 23:26 被阅读0次

浅谈Single-Pass算法
Single-Pass
single-pass算法
浅谈算法和数据结构
(313)红黑树-java实现
1、密码处理
【重试】如何优雅地重试，指数退避
基于Single-Pass的话题发现算法（Python）
再谈AR中的图像识别算法
Android 加密算法AES/RSA

Single-Pass算法又称单通道法或单遍法,是流式数据聚类的经典方法。对于依次到达的数据流,该方法按输入顺序每次处理一个数据,依据当前数据与已有类的匹配度大小,将该数据判为已有类或者创建一个新的数据类,实现流式数据的增量和动态聚类，适合对流数据进行挖掘,而且算法的时间效率高;不足之处主要表现在该方法具有输入次序依赖特性,即对于同一聚类对象按不同的次序输入,会出现不同的聚类结果。

Single-Pass算法基本流程

假设我们有一个N篇文档的语料，single-pass聚类的基本流程如下：
（1）随机选取一篇文章，其文本向量记为d；
（2） d逐一与已有的话题中各报道进行相似度计算，并取最大者作为与该话题的相似度（single-link策略）；如果是第一篇文章，则创建一个新话题（或者说新标签）。
（3）在所有话题间选出与d相似度最大的一个，以及此时的相似度值；
（4）如果相似度大于阈值TC，d所对应的互联网文本被分配给这个话题模型文本类，跳至（6）；
（5）如果相似度值小于阈值TC，d所对应的文本不属于已有的话题，创建新话题，同时把这篇文本归属创建的新话题模型文本类；
（6）本次聚类结束，等待接收新的文本到来。
阈值在[0,1]之间，阈值设置的越高，得到的簇粒度越小（簇内文本数量少），簇的个数越多；相反，阈值设置的越低，得到的簇粒度越大（簇内文本数量多），簇的个数越少。

Single-Pass算法实例

假设我们有下面5个文档Doci以及词袋表示向量Ti，然后我们尝试使用single pass算法来聚类文档。

	Doc1	Doc2	Doc3	Doc4	Doc5
T1	1	2	0	0	1
T2	3	1	2	3	0
T3	3	0	0	0	1
T4	2	1	0	3	0
T5	2	2	1	5	1

首先我们从Doc1开始，并且假设它的聚类中心为C1，此时C1只有一篇文档Doc1，所以我们可以使用Doc1的向量表示C1

C1 = <1, 3, 3, 2, 2>

然后开始选择下一篇文档Doc2，然后与当前所有聚类中心进行计算相似度，此时只有一个聚类中心C1，简单地，我们这里使用点积作为相似度计算方法：

SIM(Doc2,C1) = 1*2 + 1*3 + 0*3 + 1*2 + 2*2 = 11

此时我们需要一个超参数threshold来控制类别的判断，假设threshold为10，如果一个文档与当前聚类中心的相似度大于1ode，我们就将该文档添加到当前聚类中，否则的话产生一个新的聚类中心。
在上面的例子中，

SIM(Doc2,C1)=11>10，

所以我们将Doc2添加到聚类中心C1中。现在我们要重新计算C1的聚类中心向量(此时聚类包含两个文档Doc1和Doc2)。我们使用平均的方法更新中心向量：

C1 = <3/2, 4/2, 3/2, 3/2, 4/2>

下一步我们开始选择Doc3，此时只有一个聚类中心C1，所以我们只需要计算Doc3与C1的相似度即可：

SIM(Doc3, C1) = 0 + 8/2 + 0 + 0 + 4/2 = 6

此时Doc3与C1的相似度小于阈值，所以我们使用Doc3产生一个新的类别：

C1 = {Doc1, Doc2}
C2 = {Doc3}

我们继续选择下一个文档Doc4，此时我们需要计算Doc4与C1和C2的相似度，并且选择一个相似度最大的类别作为Doc4的标签：

SIM(Doc4, C1) = <0, 3, 0, 3, 5> . <3/2, 4/2, 3/2, 3/2, 4/2>
            = 0 + 12/2 + 0 + 9/2 + 20/2 = 20.5

SIM(Doc4, C2) = <0, 3, 0, 3, 5> . <0, 2, 0, 0, 1>
            = 0 + 6 + 0 + 0 + 5 = 11

此时两个相似度都大于10，所以我们选择一个较大值的类别C1，将Doc4放到C1类别中，此时我们聚类中心的文档情形如下：

C1 = {Doc1, Doc2, Doc4}
C2 = {Doc3}

更新聚类中心向量：

C1 = <3/3, 7/3, 3/3, 6/3, 9/3>
C2 = <0, 2, 0, 0, 1>

现在只剩下一篇文档Doc5，然后分别于C1和C2计算相似度：

SIM(Doc5, C1) = <1, 0, 1, 0, 1> . <3/3, 7/3, 3/3, 6/3, 9/3>
            = 3/3 + 0 + 3/3 + 0 + 9/3 = 5

SIM(Doc5, C2) = <1, 0, 1, 0, 1> . <0, 2, 0, 0, 1>
            = 0 + 0 + 0 + 0 +1 = 1

这些相似性度没有一个超过阈值。因此，T5 需要放进新的聚类中C3。没有其他未分类的文档了，因此我们完成了一次Single Pass算法。最后的聚类结果是：

C1 = {T1, T2, T4}
C2 = {T3}
C3 = {T5}

显然，此方法的结果高度依赖于使用的相似性阈值以及选择文档的顺序

代码实践

import os
import re
import json
import math
import numpy as np
from gensim import corpora, models, similarities, matutils
from smart_open import  smart_open
import pandas as pd
from pyltp import SentenceSplitter
from  textrank4zh import TextRank4Keyword,TextRank4Sentence
from tkinter import _flatten
from pyltp import Segmentor,Postagger

class Single_Pass_Cluster(object):
    
    def __init__(self,
                 filename,
                 stop_words_file = '停用词汇总.txt',
                 theta = 0.5, 
                 LTP_DATA_DIR = r'D:\ltp-models\\',  # ltp模型目录的路径
                 segmentor = Segmentor(),
                 postagger = Postagger(),
                  
                ):
        
        self.filename = filename
        self.stop_words_file = stop_words_file
        self.theta = theta 
        self.LTP_DATA_DIR = LTP_DATA_DIR 
        self.cws_model_path = os.path.join(self.LTP_DATA_DIR, 'cws.model')
        self.pos_model_path = os.path.join(self.LTP_DATA_DIR, 'pos.model')  
        self.segmentor = segmentor  # 初始化实例
        self.segmentor.load_with_lexicon(self.cws_model_path, self.LTP_DATA_DIR + 'dictionary.txt')  # 加载模型
        self.postagger = postagger  # 初始化实例
        self.postagger.load(self.pos_model_path)  # 加载模型         
    
        
    def loadData(self,filename):
        Data = []
        i = 0
        with smart_open(self.filename,encoding='utf-8') as f:        
            texts = [list(SentenceSplitter.split(i.strip().strip('\ufeff'))) for i in f.readlines()]
            print('未切割前的语句总数有{}条...'.format(len(texts)))
            print ("............................................................................................")  
            texts = [i.strip() for i in list(_flatten(texts)) if len(i) > 5]
            print('切割后的语句总数有{}条...'.format(len(texts)))
            for line in texts:
                i  += 1
                Data.append(line )
        return Data
        
    def word_segment(self,sentence):
        stopwords = [line.strip() for line in open( self.stop_words_file,encoding='utf-8').readlines()]
        post_list = ['n','nh','ni','nl','ns','nz','j','ws','a','z','b']
        sentence = sentence.strip().replace('。','').replace('」','').replace('//','').replace('_','').replace('-','').replace('\r','').replace('\n','').replace('\t','').replace('@','').replace(r'\\','').replace("''",'')
        words = self.segmentor.segment(sentence.replace('\n',''))  # 分词
        postags = self.postagger.postag(words)  # 词性标注
        dict_data = dict(zip(words,postags))
        table = {k: v for k, v in dict_data.items() if v in post_list}
        words = list(table.keys())
        word_segmentation = []
        for word in words:
            if word == ' ':
                continue
            if word not in stopwords:
                word_segmentation.append(word)
        return word_segmentation
    
    def get_Tfidf_vector_representation(self,word_segmentation,pivot= 10, slope = 0.1):
        #得到文本数据的空间向量表示
        dictionary = corpora.Dictionary(word_segmentation)
        corpus = [dictionary.doc2bow(text) for text in word_segmentation]
        tfidf = models.TfidfModel(corpus,pivot=pivot, slope =slope)
        corpus_tfidf = tfidf[corpus]
        return corpus_tfidf
    
    def get_Doc2vec_vector_representation(self,word_segmentation):
        #得到文本数据的空间向量表示
        
        corpus_doc2vec = [get_avg_feature_vector(i, model, num_features = 50) for i in word_segmentation]
        return corpus_doc2vec 
    
    def getMaxSimilarity(self,dictTopic, vector):
        maxValue = 0
        maxIndex = -1
        for k,cluster in dictTopic.items():
            oneSimilarity = np.mean([matutils.cossim(vector, v) for v in cluster])
            #oneSimilarity = np.mean([cosine_similarity(vector, v) for v in cluster])
            if oneSimilarity > maxValue:
                maxValue = oneSimilarity
                maxIndex = k
        return maxIndex, maxValue
    
    def single_pass(self,corpus,texts,theta):
        dictTopic = {}
        clusterTopic = {}
        numTopic = 0 
        cnt = 0
        for vector,text in zip(corpus,texts): 
            if numTopic == 0:
                dictTopic[numTopic] = []
                dictTopic[numTopic].append(vector)
                clusterTopic[numTopic] = []
                clusterTopic[numTopic].append(text)
                numTopic += 1
            else:
                maxIndex, maxValue = self.getMaxSimilarity(dictTopic, vector)
                #将给定语句分配到现有的、最相似的主题中
                if maxValue >= theta:
                    dictTopic[maxIndex].append(vector)
                    clusterTopic[maxIndex].append(text)
                
                #或者创建一个新的主题
                else:
                    dictTopic[numTopic] = []
                    dictTopic[numTopic].append(vector)
                    clusterTopic[numTopic] = []
                    clusterTopic[numTopic].append(text)
                    numTopic += 1
            cnt += 1
            if cnt % 500 == 0:
                print ("processing {}...".format(cnt))
        return dictTopic, clusterTopic  
    
    def fit_transform(self,theta=0.5):
        datMat = self.loadData(self.filename)  
        word_segmentation = []
        for i in range(len(datMat)):
            word_segmentation.append(self.word_segment(datMat[i]))          
        print ("............................................................................................")  
        print('文本已经分词完毕 !')
        
        #得到文本数据的空间向量表示
        corpus_tfidf = self.get_Tfidf_vector_representation(word_segmentation)
        #corpus_tfidf =  self.get_Doc2vec_vector_representation(word_segmentation)
        dictTopic, clusterTopic = self.single_pass(corpus_tfidf, datMat, theta)
        print ("............................................................................................")  
        print( "得到的主题数量有: {} 个 ...".format(len(dictTopic)))
        print ("............................................................................................\n")  
        #按聚类语句数量对主题进行排序，找到重要的聚类群
        clusterTopic_list = sorted(clusterTopic.items(),key=lambda x: len(x[1]),reverse=True)
        for k in clusterTopic_list[:30]:
            cluster_title = '\n'.join(k[1]) 
            #print(''.join(cluster_title))
            #得到每个聚类中的的主题关键词
            word = TextRank4Keyword()
            word.analyze(''.join(self.word_segment(''.join(cluster_title))),window = 5,lower = True)
            w_list = word.get_keywords(num = 10,word_min_len = 2)
            sentence = TextRank4Sentence()
            sentence.analyze('\n'.join(k[1]) ,lower = True)
            s_list = sentence.get_key_sentences(num = 3,sentence_min_len = 5)[:30]
            print ("【主题索引】:{} \n【主题声量】：{} \n【主题关键词】： {} \n【主题中心句】 ：\n{}".format(k[0],len(k[1]),','.join([i.word for i in w_list]),'\n'.join([i.sentence for i in s_list])))
            print ("-------------------------------------------------------------------------")

single_pass_cluster = Single_Pass_Cluster('http://docs.bosonnlp.com/_downloads/text_comments.txt',
      stop_words_file= '/home/kesci/work/停用词汇总.txt')
single_pass_cluster.fit_transform(theta = 0.15)

参考资料

浅谈Single-Pass算法
Single-Pass算法又称单通道法或单遍法,是流式数据聚类的经典方法。对于依次到达的数据流,该方法按输入顺序每...
Single-Pass
single-pass 是一种简单的聚类算法 https://blog.csdn.net/xinxing__818...
single-pass算法
Example of Single Pass Clustering Technique single-pass-c...
浅谈算法和数据结构
注：采转归档，自己学习查询使用浅谈算法和数据结构: 一栈和队列浅谈算法和数据结构: 二基本排序算法浅谈算法和...
(313)红黑树-java实现
引言根据《算法》第4版。编写红黑树。理论参见: 浅谈算法和数据结构: 八平衡查找树之2-3树浅谈算法和数...
1、密码处理
BCryptPasswordEncoder算法介绍浅谈使用springsecurity中的BCryptPassw...
【重试】如何优雅地重试，指数退避
浅谈指数退避(backoff)算法http://www.xiaojiejing.com/2018/12/06/%e...
基于Single-Pass的话题发现算法（Python）
话题发现和跟踪是指新闻专线和广播新闻等来源的新闻数据流中自动地发现话题并把话题相关的内容组织到一起的技术。通过增量...
再谈AR中的图像识别算法
之前在《浅谈移动平台创新玩法》简单的猜测了easyar中使用的图像识别算法，基于图片指纹的哈希算法的图片检索。后...
Android 加密算法AES/RSA
Android中的MD5和RSA加密Android应用安全开发之浅谈加密算法的坑对称加密在对称加密算法中，数据...