自然语言处理构建文本向量空间

作者: J_101 | 来源:发表于2017-12-28 09:45 被阅读85次

    1.百科

    2.源代码

    • 系统环境

    python 3.6
    scikit-learn==0.19.1

    # utf-8
    
    import os
    import math
    import numpy as np
    
    '''
    不使用NLTK和Scikits-Learn包,构建文本向量空间模型
    
    reference:
    https://mp.weixin.qq.com/s/DisMF8frY2pkpGMfrWk4Wg
    '''
    
    
    def load_doc_list(file):
        with open(file, 'r') as f:
            return f.read().splitlines()
    
    
    '''
    第一步:Basic term frequencies
    
    frequencies:计算文本各行的单词频度(出现次数),存在问题,文本空间大小不一样
    build_lexicon:创建词汇表,以便构造相同文本空间向量(特征向量)
    tf:调用freq,计算单词在文档中出现的次数
    '''
    
    
    def frequencies(doc_list):
        from collections import Counter
        counters=[]
        for doc in doc_list:
            c=Counter()
            for word in doc.split():
                c[word]+=1
            counters.append(c)
        return counters
    
    
    def build_lexicon(corpus):
        lexicon=set()
        for doc in corpus:
            lexicon.update([w for w in doc.split()])
        return lexicon
    
    def tf(term,doc):
        return freq(term,doc)
    
    def freq(term,doc):
        return doc.split().count(term)
    
    
    '''
    第二步:Normalizing vectors to L2 Norm = 1
    如果有些单词在一个单一的文件中过于频繁地出现,它们将扰乱我们的分析。
    我们想要对每一个词频向量进行比例缩放,使其变得更具有代表性。换句话说,我们需要进行向量标准化,需要确保每个向量的L2范数等于1
    l2_normalizer
    
    
    '''
    
    def l2_normalizer(vec):
        denom = np.sum([el**2 for el in vec])
        return [(el / math.sqrt(denom)) for el in vec]
    
    '''
    第三步:逆向文件频率(IDF)频率加权
    利用反文档词频(IDF)调整每一个单词权重
    对于词汇中的每一个词,我们都有一个常规意义上的信息值,用于解释他们在整个语料库中的相对频率。
    回想一下,这个信息值是一个“逆”!即信息值越小的词,它在语料库中出现的越频繁。
    为了得到TF-IDF加权词向量,你必须做一个简单的计算:tf * idf。
    numDocsContaining
    idf
    build_idf_matrix
    '''
    def numDocsContaining(word, doclist):
        doccount = 0
        for doc in doclist:
            if freq(word, doc) > 0:
                doccount +=1
        return doccount
    
    
    def idf(word, doclist):
        n_samples = len(doclist)
        df = numDocsContaining(word, doclist)
        return math.log(n_samples / 1+df)
    
    
    def build_idf_matrix(idf_vector):
        '''
        将IDF向量转化为BxB的矩阵了,矩阵的对角线就是IDF向量,这样就可以用反文档词频矩阵乘以每一个词频向量
        '''
        idf_mat = np.zeros((len(idf_vector), len(idf_vector)))
        np.fill_diagonal(idf_mat, idf_vector)
        return idf_mat
    
    
    
    if __name__ == '__main__':
        os.chdir('./python')
        file="vector.txt"
        doc_list=load_doc_list(file)
        
        # counters=frequencies(doc_list)
        # for counter in counters:
        #     print(counter)
        
        vocabulary=build_lexicon(doc_list)
        # print(vocabulary)
        doc_term_matrix =[]
        for doc in doc_list:
            # print("doc = ",doc)
            doc_term_vector = [tf(word, doc) for word in vocabulary]
            # print("vec = ",doc_term_vector)
            doc_term_matrix.append(doc_term_vector)
        # print(doc_term_matrix)
        doc_term_matrix_l2 = []
        for vec in doc_term_matrix:
            doc_term_matrix_l2.append(l2_normalizer(vec))
        # print('A regular old document term matrix: ')
        # print(np.matrix(doc_term_matrix))
        # print('A document term matrix with row-wise L2 norms of 1:')
        # print(np.matrix(doc_term_matrix_l2))
    
        idf_vector = [idf(word, doc_list) for word in vocabulary]
        print('Our vocabulary vector is [' + ', '.join(list(vocabulary)) + ']')
        print('The inverse document frequency vector is [' + ', '.join(
            format(freq, 'f') for freq in idf_vector) + ']')
        
        idf_matrix = build_idf_matrix(idf_vector)
        # print(idf_matrix)
    
        #tf*idf
        doc_term_matrix_tfidf = []
        #performing tf-idf matrix multiplication
        for tf_vector in doc_term_matrix:
            doc_term_matrix_tfidf.append(np.dot(tf_vector, idf_matrix))
    
        #normalizing
        doc_term_matrix_tfidf_l2 = []
        for tf_vector in doc_term_matrix_tfidf:
            doc_term_matrix_tfidf_l2.append(l2_normalizer(tf_vector))
        print(vocabulary)
        print(np.matrix(doc_term_matrix_tfidf_l2)) # np.matrix() just to make it easier to look at
    
    

    3.参考:

    1. http://python.jobbole.com/81311/
    2. http://stanford.edu/~rjweiss/public_html/IRiSS2013/text2/notebooks/tfidf.html

    相关文章

      网友评论

        本文标题:自然语言处理构建文本向量空间

        本文链接:https://www.haomeiwen.com/subject/evnxgxtx.html