词袋模型是一种文本特征的表示方法。
具体地,把词表里的词和我要表示的词作比对,没有画 0,有则画数量具体出现的频次。
例如:
句子 1:我/爱/知乎,知乎/真好。
句子 2:我/爱/微博,微博/真好。
于是有 词表=【'我','爱','知乎','真好','微博'】
且 len(词表)=5,故最后我期待用 5 维向量来表示句子 1 和句子 2
句子 1 表示为[1,1,2,1,0] #第一句中没有'微博'
句子 2 表示为[1,1,0,1,2]#第一句中没有'知乎'
Python 语言实现
import numpy as np
from nltk.corpus import stopwords
#Step 1: Tokenize a sentence
def word_extraction(sentence):
#提取句子中的词们
words = sentence.split()
stop_words = set(stopwords.words('english'))
cleaned_text = [w.lower() for w in words if not w in stop_words]
return cleaned_text
#Step 2:Apply tokenization to all sentences
def tokenize(sentences):
#对所有句子做 step1,生成词表
words = []
for sentence in sentences:
w = word_extraction(sentence)
words.extend(w)
words = sorted(list(set(words)))
return words
#Step 3: Build vocabulary and generate vectors
def generate_bow(allsentences):
vocab = tokenize(allsentences)
print("Word List for Document \n{0} \n".format(vocab))
for sentence in allsentences:
words = word_extraction(sentence)
bag_vector = np.zeros(len(vocab))
for w in words:
for i, word in enumerate(vocab):
if word == w:
bag_vector[i] += 1
print("{0}\n{1}\n".format(sentence, np.array(bag_vector)))
allsentences = ["Joe waited for the train", "The train was late", "Mary and Samantha took the bus",
"I looked for Mary and Samantha at the bus station",
"Mary and Samantha arrived at the bus station early but waited until noon for the bus"]
print(generate_bow(allsentences))
Output:
Word List for Document
['arrived', 'bus', 'early', 'i', 'joe', 'late', 'looked', 'mary', 'noon', 'samantha', 'station', 'the', 'took', 'train', 'waited']
Joe waited for the train
[0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1.]
The train was late
[0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 1. 0.]
Mary and Samantha took the bus
[0. 1. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 1. 0. 0.]
I looked for Mary and Samantha at the bus station
[0. 1. 0. 1. 0. 0. 1. 1. 0. 1. 1. 0. 0. 0. 0.]
Mary and Samantha arrived at the bus station early but waited until noon for the bus
[1. 2. 1. 0. 0. 0. 0. 1. 1. 1. 1. 0. 0. 0. 1.]
None
网友评论