词袋(Bag of Words)模型及其 Python 语言实现

作者: 不可能打工 | 来源:发表于2020-08-20 15:00 被阅读0次

词袋(Bag of Words)模型及其 Python 语言实现
[笔记] Introduction to Shallow Lan
中文NLP笔记：8. 基于LSTM的文本分类
文本向量化表示方法一（词袋模型）
词袋模型BoW和词集模型SoW
词袋模型
情感分析
Quora句子相似度匹配
word2vec
Python + 自然语言 + 分类

词袋模型是一种文本特征的表示方法。

具体地，把词表里的词和我要表示的词作比对，没有画 0，有则画数量具体出现的频次。

例如：
句子 1：我/爱/知乎，知乎/真好。
句子 2：我/爱/微博，微博/真好。
于是有词表=【'我'，'爱'，'知乎'，'真好'，'微博'】

且 len(词表）=5，故最后我期待用 5 维向量来表示句子 1 和句子 2

句子 1 表示为[1,1,2,1,0] #第一句中没有'微博'

句子 2 表示为[1,1,0,1,2]#第一句中没有'知乎'

Python 语言实现

import numpy as np
from nltk.corpus import stopwords
#Step 1: Tokenize a sentence
def word_extraction(sentence):
    #提取句子中的词们
    words = sentence.split()
    stop_words = set(stopwords.words('english'))
    cleaned_text = [w.lower() for w in words if not w in stop_words]
    return cleaned_text
#Step 2：Apply tokenization to all sentences
def tokenize(sentences):
    #对所有句子做 step1,生成词表
    words = []
    for sentence in sentences:
        w = word_extraction(sentence)
        words.extend(w)
    words = sorted(list(set(words)))
    return words
#Step 3: Build vocabulary and generate vectors
def generate_bow(allsentences):
    vocab = tokenize(allsentences)
    print("Word List for Document \n{0} \n".format(vocab))
    for sentence in allsentences:
        words = word_extraction(sentence)
        bag_vector = np.zeros(len(vocab))
        for w in words:
            for i, word in enumerate(vocab):
                if word == w:
                    bag_vector[i] += 1

        print("{0}\n{1}\n".format(sentence, np.array(bag_vector)))

allsentences = ["Joe waited for the train", "The train was late", "Mary and Samantha took the bus",
"I looked for Mary and Samantha at the bus station",
"Mary and Samantha arrived at the bus station early but waited until noon for the bus"]

print(generate_bow(allsentences))

Output:
Word List for Document 
['arrived', 'bus', 'early', 'i', 'joe', 'late', 'looked', 'mary', 'noon', 'samantha', 'station', 'the', 'took', 'train', 'waited'] 

Joe waited for the train
[0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1.]

The train was late
[0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 1. 0.]

Mary and Samantha took the bus
[0. 1. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 1. 0. 0.]

I looked for Mary and Samantha at the bus station
[0. 1. 0. 1. 0. 0. 1. 1. 0. 1. 1. 0. 0. 0. 0.]

Mary and Samantha arrived at the bus station early but waited until noon for the bus
[1. 2. 1. 0. 0. 0. 0. 1. 1. 1. 1. 0. 0. 0. 1.]

None

网友评论

本文标题：词袋(Bag of Words)模型及其 Python 语言实现

本文链接：https://www.haomeiwen.com/subject/nnnsjktx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

词袋(Bag of Words)模型及其 Python 语言实现

相关文章

词袋(Bag of Words)模型及其 Python 语言实现

[笔记] Introduction to Shallow Lan

中文NLP笔记：8. 基于LSTM的文本分类

文本向量化表示方法一（词袋模型）

词袋模型BoW和词集模型SoW

词袋模型

情感分析

Quora句子相似度匹配

word2vec

Python + 自然语言 + 分类

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读