CountVector基础功能的复现

作者: 小透明苞谷 | 来源:发表于2020-03-18 18:31 被阅读0次

CountVector基础功能的复现
LSA基本功能的复现
【漏洞复现】CVE-2017-12615 Tomcat任意文件上
复现东方证券研报--投机、交易行为与股票收益
2018-06-05
实验进展
如何找到期望（功能）需求？
fastjson反序列化漏洞复现
Android Studio发布及依赖Maven项目
基础功能做到极致！

sklearn.feature_extraction.text 中有4种文本特征提取方法：

CountVectorizer
TfidfVectorizer
TfidfTransformer
HashingVectorizer

CountVectorizer会将文本中的词语转换为词频矩阵，它通过fit_transform函数计算各个词语在文档中出现的次数。

参数

属性

属性表	作用
vocabulary_	词汇表；字典型
get_feature_names()	所有文本的词汇；列表型
stop_words_	返回停用词表

方法

方法表	作用
fit_transform(X)	拟合模型，并返回term-document矩阵
fit(raw_documents[, y])	学习文档集中的vocabulary dictionary

入门示例

from sklearn.feature_extraction.text import CountVectorizer

texts=["dog cat fish","dog cat cat","fish bird", 'bird'] # “dog cat fish” 为输入列表元素,即代表一个文章的字符串
cv = CountVectorizer() #创建词袋数据结构
cv_fit = cv.fit_transform(texts)
# 上述代码等价于下面两行
# cv.fit(texts)
# cv_fit=cv.transform(texts)

print(cv.get_feature_names())    #['bird', 'cat', 'dog', 'fish'] 列表形式呈现文章生成的词典

print(cv.vocabulary_)            # {‘dog’:2,'cat':1,'fish':3,'bird':0} 字典形式，key：词，value:该词（特征）的索引，同时是tf矩阵的列号
[https://blog.csdn.net/weixin_38278334/article/details/82320307](https://blog.csdn.net/weixin_38278334/article/details/82320307)
[https://blog.csdn.net/weixin_38278334/article/details/82320307](https://blog.csdn.net/weixin_38278334/article/details/82320307)

print(cv_fit)
#（0,3）1   第0个列表元素，**词典中索引为3的元素**， 词频
#（0,1）1
#（0,2）1
#（1,1）2
#（1,2）1
#（2,0）1
#（2,3）1
#（3,0）1

print(cv_fit.toarray()) #.toarray() 是将结果转化为稀疏矩阵矩阵的表示方式；
#[[0 1 1 1]
# [0 2 1 0]
# [1 0 0 1]
# [1 0 0 0]]

print(cv_fit.toarray().sum(axis=0))  #每个词在所有文档中的词频
#[2 3 2 2]

复现

功能包括：

去停词等文本预处理操作
fit
transform
支持 n-gram

import numpy as np

with open('data.txt', 'r', encoding='utf-8') as f:
    data = [i.strip() for i in f.readlines()]

class MyCountVectorizer(object):
    vocabulary = {}
    corpus = []
    
    def __init__(self, n=1, remove_stop_words=False):
        self.n = n
        self.remove_stop_words = remove_stop_words
        
    def clean(self, corpus):
        if self.remove_stop_words:
            # Load stopword list
            with open('stopwords.txt') as f:
                stop_words = [w.strip() for w in f.readlines()]
        for text in corpus:
            # Lower case
            text = text.lower()
            # Remove special punctuation
            for c in """!"'#$%&\()*+,-./:;<=>?@[\\]^_`{|}~“”‘’""":
                text = text.replace(c, ' ')
            if self.remove_stop_words:
                word_ls = [word for word in text.split(' ') if word and word.isalnum() and len(word)>1 and (word not in stop_words)]
            else:
                word_ls = [word for word in text.split(' ') if word and word.isalnum() and len(word)>1]
            # corpus: document size * vocabulary size
            n_gram_word_ls = []
            for idx in range(len(word_ls)):
                if idx + self.n > len(word_ls):
                    break
                n_gram_word = ' '.join(word_ls[idx: idx + self.n])
                n_gram_word_ls.append(n_gram_word)
            self.corpus.append(n_gram_word_ls)    
    
    def fit(self, corpus):
        # Create a dictionary of terms which map to columns of the term-frequency matrix.
        self.clean(corpus)
        for row in self.corpus:
            for word in row:
                if word not in self.vocabulary:
                    self.vocabulary[word] = len(self.vocabulary)
        return
    
    def transform(self):
        # Create a term-frequency matrix of appropriate size (document size * vocabulary size)
        tf_matrix = []
        size = len(self.vocabulary)
        for doc in self.corpus:
            # Count how often the word appears in the document
            word_count = {}
            for word in doc:
                word_count[word] = word_count.get(word, 0) + 1
            # Construct the term-frequency vector of the row
            row = [0 for i in range(size)]
            for word, value in word_count.items():
                row[self.vocabulary[word]] = value
            tf_matrix.append(row)
        tf_matrix = np.array(tf_matrix)
        return tf_matrix
    
    def get_vocab(self):
        # Returns the dictionary of terms
        return self.vocabulary
    
cv = MyCountVectorizer(1, True)
cv.fit(data)
print(cv.get_vocab())
term_frequency_matrix = cv.transform()
print(term_frequency_matrix.shape)

参考文献：
sklearn——CountVectorizer详解