单词向量化

作者: 西域记 | 来源:发表于2018-11-12 15:34 被阅读0次

1.使用CountVectorizer将文本转化为向量

from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()
dialog = ['I have addicted into cyber security for years']
vect.fit(dialog)
print(vect.vocabulary_)

输出结果是一个字典：

{'have': 3, 'cyber': 1, 'security': 5, 'for': 2, 'addicted': 0, 'years': 6, 'into': 4}

本质上是为句子中出现的每个单词分配一个数字。

2.向量到词袋

向量化仅仅能够将字符串转化成数字，词袋模型能够统计每个字符串的数目，对文本解析有积极的作用。

from sklearn.feature_extraction.text import CountVectorizer

vect = CountVectorizer()
dialog = ['I have addicted into cyber security for years，after years of practice,'
          'I still can not say I am good at it']
vect.fit(dialog)
print(len(vect.vocabulary_))
print(vect.vocabulary_)
bag_of_words = vect.transform(dialog)
print(repr(bag_of_words))
print(bag_of_words.toarray())

输出的结果使用密度表达的方式，下标与该单词的向量一致，下表显示的数字为该单词出现的次数。

{'it': 10, 'practice': 13, 'addicted': 0, 'after': 1, 'have': 8, 'years': 17, 'say': 14, 'good': 7, 'am': 2, 'of': 12, 'cyber': 5, 'still': 16, 'security': 15, 'for': 6, 'not': 11, 'into': 9, 'can': 4, 'at': 3}
<1x18 sparse matrix of type '<class 'numpy.int64'>'
    with 18 stored elements in Compressed Sparse Row format>
[[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2]]

3.词袋到n-gram

使用词袋能够描述出整篇文章或整段话的单词统计量，但无法描述文章的意义，文章意义通常上下文相关，只有在向量的过程中，尽可能的保留前后文之间的关系，才能让机器更好的理解句子。n-gram表示n个单词相关。更改CountVectorizer()的设置。

vect = CountVectorizer(input='content', encoding='utf-8',
                 decode_error='strict', strip_accents=None,
                 lowercase=True, preprocessor=None, tokenizer=None,
                 stop_words=None, token_pattern=r"(?u)\b\w\w+\b",
                 ngram_range=(2,2), analyzer='word',
                 max_df=1.0, min_df=1, max_features=None,
                 vocabulary=None, binary=False, dtype=np.int64

除n-gram外，其余均为默认设置，不做操作亦可。（2,2）表示单词相关的上限和下限均为2。同样一句话的运行结果

18
{'cyber security': 5, 'good at': 7, 'security for': 14, 'for years': 6, 'at it': 3, 'years after': 16, 'years of': 17, 'still can': 15, 'not say': 10, 'say am': 13, 'have addicted': 8, 'of practice': 11, 'after years': 1, 'can not': 4, 'addicted into': 0, 'practice still': 12, 'am good': 2, 'into cyber': 9}
<1x18 sparse matrix of type '<class 'numpy.int64'>'
    with 18 stored elements in Compressed Sparse Row format>
[[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]]

这里有必要把CountVectorizer()的参数详细记录。

input='content',                    //string {‘filename’, ‘file’, ‘content’}如果是'filename'，
                                    那么作为参数传递给fit的序列应该是一个需要读取以获取要分析的原始内容的文件名列表。
encoding='utf-8',                   //string 默认utf-8。如果给出要分析的字节或文件，则使用此编码进行解码        
decode_error='strict',              // {‘strict’, ‘ignore’, ‘replace’}如果给出分析的字节序列包含没给定编码的字符，
                                    该如何操作的说明。 默认情况下，它是'strict'，这意味着将引发UnicodeDecodeError。
                                     其他值为'ignore'和'replace'
strip_accents=None,                 // {‘ascii’, ‘unicode’, None}在预处理步骤中删除重音。 'ascii'是一种快速方法，
                                    仅适用于具有直接ASCII映射的字符。 'unicode'是一种稍微慢一点的方法，适用于任何字符。
                                     无（默认）不执行任何操作
lowercase=True,                     //boolean 在标记化之前将所有字符转换为小写
preprocessor=None,                  //callable or None 覆盖预处理（字符串转换）阶段，同时保留标记化和n-gram生成步骤
tokenizer=None,                     //callable or None覆盖字符串标记化步骤，同时保留预处理和n-gram生成步骤。 
                                    仅适用于analyzer =='word'
stop_words=None,                    //string {‘english’}, list, or None (default)“英语”，则使用英语的内置停用词列表,
                                    列表，该列表被假定包含停用词，则所有这些将从生成的结果中删除。 
                                    仅适用于analyzer =='word'。None，则不使用停用词
token_pattern=r"(?u)\b\w\w+\b",     //string标点符号完全被忽略，并始终被视为标记分隔符
ngram_range=(2,2),                  //tuple (min_n, max_n)要提取的不同n-gram的n值范围的下边界和上边界。 将使用n的所有值，
                                    使得min_n <= n <= max_n
analyzer='word',                    //string, {‘word’, ‘char’, ‘char_wb’} or callable.
                                    'char_wb'仅从字边界内的文本创建字符n-gram; 单词边缘的n-gram用空格填充.
                                    如果传递了一个callable，它将用于从原始未处理的输入中提取特征序列
max_df=1.0,                         //float in range [0.0, 1.0] or int, default=1.0在构建词汇表时，忽略文档频率严格高
                                    于给定阈值的术语（语料库特定的停用词）。 如果是float，则参数表示文档的比例，整数绝对计数。
                                     如果词汇表不是None，则忽略此参数
min_df=1,                           //float in range [0.0, 1.0] or int, default=1构建词汇表时，请忽略文档频率严格低于给定
                                    阈值的术语。 该值在文献中也称为截止值。 如果是float，则参数表示文档的比例，整数绝对计数。
                                     如果词汇表不是None，则忽略此参数。
max_features=None,                  //int or None, default=None如果不是None，则构建一个词汇表，该词汇表仅考虑语料库中按术
                                    语频率排序的最高max_features。如果词汇表不是None，则忽略此参数。
vocabulary=None,                    //Mapping or iterable, optional其中键是术语和值的映射（例如，字典）是特征矩阵中的索引，
                                    或者是可迭代的术语。 如果没有给出，则从输入文档确定词汇表。 映射中的索引不应重复，并且不
                                    应该在0和最大索引之间存在任何差距。
binary=False,                       //如果为True，则所有非零计数都设置为1.这对于模拟二进制事件而非整数计数的离散概率模型非常有用。
dtype=np.int64                      //fit_transform（）或transform（）返回的矩阵的类型。

网友评论

本文标题：单词向量化

本文链接：https://www.haomeiwen.com/subject/xqzqfqtx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！