chapter1 引言


一般地,情感分析是基于自然语言处理、文本分析和计算语言学来完成的。虽然数据来自不同的数据源,但在本章中,将用两个特定的文本数据示例来分析在文本数据中的情绪:一个示例来自电影评论家,其文本高度结构化并有语义信息;另一个示例来自社交网络[本例中是推文(twitter) ],其文本无结构且用户可能使用(甚至滥用! )文本缩写。



  1. 讽刺的识别:有时不知道人的性格,不知道“坏”是指坏的还是好的。

  2. 没有文本结构:以推文为例,它可能包含缩写,可能没有大写、拼写错误、标点,号错误、语法错误,所有的这些都使得分析文本困难。

  3. 许多可能的情感类别和程度:积极和消极是一个简单的分析,我们想要确定的是D心有多少讨厌的意见、多少快乐的意见、多少悲伤的意见等。

  4. 确定分析的对象:文本中可以出现很多概念,如何察觉意见是积极的还是消极的是个公开的问题。例如,若你说“她赢了他! ",那么这对她来说意味着积极的情绪,同时对他来说意味着消极的情感。

  5. 主观的文本:另一个公开的挑战是如何分析非常主观的句子或段落。有时,即使对人类来说,也很难就这些高度主观文本的观点达成一致。

chapter2 数据清洗


给定单元格[1]中的输入文本数据,数据清洗的主要任务是删除那些在数据挖掘过程中被认为是噪声的字符。例如,逗号或冒号字符。当然,在每个特定的数据挖掘问题中,取决于分析的最终目的,不同的字符可以被视为噪声。在本例中,将考虑删除所有标点字符,包括其他非常规符号。为了执行数据清洗流程和后面的文本表示与分析,将在本章中使用自然语言工具箱( Natural Language Toolkit, NLTK )库作为例子。

raw_docs = ["Here are some very simple basic sentences.", "They won't be very interesting, I'm afraid.", "The point of these examples is to _learn how basic text cleaning works_ on *very simple* data."]


from nltk.tokenize import word_tokenize
tokenized_docs = [word_tokenize(doc) for doc in raw_docs]
print tokenized_docs

[['Here', 'are', 'some', 'very', 'simple', 'basic', 'sentences', '.'], ['They', 'wo', "n't", 'be', 'very', 'interesting', ',', 'I', "'m", 'afraid', '.'],
['The', 'point', 'of', 'these', 'examples', 'is', 'to', 'learn', 'how', 'basic', 'text', 'cleaning', 'works', 'on', 'very', 'simple', 'data', '.']]

因此,对于在raw_docs中的每一行文本, word_tokenize函数将建立词向量的列表。例如,现在可以搜索标点符号的列表,然后删除它们。有很多方法可以完成这一步。面看看使用String库的一个可能方案。

import string


可以看到string.punctuation包含一组常用的标点符号。这个列表可以根据想除的符号进行修改。下面看看使用正则表达式( Regular Expression, RE )包的下一个示例如何删除标点符号的。请注意,有许多其他可能的方法来删除存在的符号,如直接执行位置比较的循环。

在输入单元格[4]中, re.compile包含一个“表达式”列表, “表达式”为stringpunctuation中包含的符号。这里,不打算深入地讨论RE的细节。


import re
import string
regex = re.compile('[%s]' % re.escape(string.punctuation))
tokenized_docs_no_punctuation = []
for review in tokenized_docs:
    new_review = []
    for token in review:
        new_token = regex.sub(u'', token)
        if not new_token == u'':
print tokenized_docs_no_punctuation

[[’Here’, ’are’, ’some’, ’very’, ’simple’, ’basic’,
[’They’, ’wo’, u’nt’, ’be’, ’very’, ’interesting’, ’I’, u’m’,
[’The’, ’point’, ’of’, ’these’, ’examples’, ’is’, ’to’,
u’learn’, ’how’, ’basic’, ’text’, ’cleaning’, u’works’, ’on’,
u’very’, u’simple’, ’data’]]


在许多用于文本分析的数据挖掘系统中,另一个重要步骤是词干分析和词汇归并。词法学中有词具有根形式的概念。如果想要了解该词的基本术语含义,可以尝试使用词干分析器或词汇归并器。这一步有助于减少大小,降低后验高维(posterior high-dimensional)和减少稀疏特征空间( sparse feature space ), NLTK提供了完成此步骤的不同方式。在使用porter. stem (word)方法的情况下,输出如下所示:

import re
import string
regex = re.compile('[%s]' % re.escape(string.punctuation))
tokenized_docs_no_punctuation = []
for review in tokenized_docs:
    new_review = []
    for token in review:
        new_token = regex.sub(u'', token)
        if not new_token == u'':

[[’Here’, ’are’, ’some’, ’very’, ’simple’, ’basic’,
’sentences’], [’They’, ’wo’, u’nt’, ’be’, ’very’,
’interesting’, ’I’, u’m’, ’afraid’], [’The’, ’point’, ’of’,
’these’, ’examples’, ’is’, ’to’, u’learn’, ’how’, ’basic’,
’text’, ’cleaning’, u’works’, ’on’, u’very’, u’simple’,
[[’Here’, ’are’, ’some’, ’veri’, ’simpl’, ’basic’, ’sentenc’],
[’They’, ’wo’, u’nt’, ’be’, ’veri’, ’interest’, ’I’, u’m’,
’afraid’], [’The’, ’point’,’of’, ’these’, ’exampl’, ’is’,
’to’, u’learn’, ’how’, ’basic’, ’text’, ’clean’, u’work’, ’on’,
u’veri’,u’simpl’, ’data’]]


import nltk
test_string ="<p>While many of the stories tugged
at the heartstrings , I never felt manipulated by
the authors. ( Note: Part of the reason why I
don’t like the ’Chicken Soup for the Soul’
series is that I feel that the authors are just
dying to make the reader clutch for the box of
tissues .) </a>"
print ’Original text:’
print test_string
print ’Cleaned text:’
nltk. clean_html(test_string. decode())

Original text:
<p>While many of the stories tugged at the heartstrings, I
never felt manipulated by the authors. (Note: Part of the
reason why I don’t like the "Chicken Soup for the Soul" series
is that I feel that the authors are just dying to make the
reader clutch for the box of tissues.)</a>
Cleaned text:
u"While many of the stories tugged at the heartstrings, I never
felt manipulated by the authors. (Note: Part of the reason why
I don’t like the "Chicken Soup for the Soul" series is that I
feel that the authors are just dying to make the reader clutch
for the box of tissues.)"


chapter3 文本表示

在之前的章节中,分析了不同的技术,有数据清洗、词干分析和词汇归并;还对文本进行了筛选,来删除其他在后文分析中不必要的标签。为了分析源自文本的情感,下一步是要得到已清理的文本表示。虽然存在不同的文本表示,但最常见的是词袋( Bag of Words Bow)模型的变体。其基本思想是考虑词频。如果能定义一个可能有不同词的词典,不同的现有词数量将被定义成特征空间的长度,用来表示每个文本。



接下来,将查看一个特殊的Bow,即文本的向量空间模型(vector space model ):
TF-IDF (term frequency-inverse document frequency,词频-逆文档频率),首先,需要对每个文档的词条即词频向量计数。请参阅下面的代码示例:

mydoclist = [’Mireia loves me more than Hector
loves me’,
’Sergio likes me more than Mireia loves me’,
’ He likes basketball more than football’]
from collections import Counter
for doc in mydoclist:
tf = Counter()
for word in doc.split():
tf[word] += 1
print tf.items()

: [(’me’, 2), (’Mireia’, 1), (’loves’, 2), (’Hector’, 1),
(’than’, 1), (’more’, 1)] [(’me’, 2), (’Mireia’, 1), (’likes’,
1), (’loves’, 1), (’Sergio’, 1), (’than’, 1), (’more’, 1)]
[(’basketball’, 1), (’football’, 1), (’likes’, 1), (’He’, 1),
(’than’, 1), (’more’, 1)]
这里,引入了一个名为Counter的Python对象。Counter只存在于Python 2.7及更高版本中。它很有用,因为它允许你执行这种确切类型的功能:在循环中计数。Counter是用于计数可哈希对象的字典子类。它是一个无序的集合,元素被存储为字典的关键字,它们的计数被存储为字典的值。计数可以是任何整数值,包括零或负数。

c = Counter() # a new , empty counter
c = Counter(’gallahad’) # a new counter from an iterable

counter对象有一个字典接口,它返回一个值为零的计数来表示缺失项,而不是抛出KeyError 异常。

c = Counter([ ’eggs’, ’ham’])
c [’bacon’]


def build_lexicon( corpus):
# define a set with all possible words included in
all the sentences or "corpus"
lexicon = set()
for doc in corpus:
lexicon.update ([word for word in doc.split
return lexicon
def tf(term , document):
return freq(term , document)
def freq(term , document):
return document. split().count( term)
vocabulary = build_lexicon (mydoclist)
doc_term_matrix = []
print ’Our vocabulary vector is [’ +
’, ’.join(list( vocabulary)) + ’]’
for doc in mydoclist:
print ’The doc is "’ + doc + ’"’
tf_vector = [tf(word , doc) for word in
tf_vector_string = ’, ’.join(format(freq , ’d’)
for freq
in tf_vector)
print ’The tf vector for Document %d is [%s]’
% ((mydoclist. index(doc)+1),
doc_term_matrix. append(tf_vector)
print ’All combined , here is our master document
term matrix: ’
print doc_term_matrix

Our vocabulary vector is [me, basketball, Julie, baseball,
likes, loves, Jane, Linda, He, than, more]
The doc is "Julie loves me more than Linda loves me"
The tf vector for Document 1 is [2, 0, 1, 0, 0, 2, 0, 1, 0, 1,
The doc is "Jane likes me more than Julie loves me"
The tf vector for Document 2 is [2, 0, 1, 0, 1, 1, 1, 0, 0, 1,
The doc is "He likes basketball more than baseball"
The tf vector for Document 3 is [0, 1, 0, 1, 1, 0, 0, 0, 1, 1,
All combined, here is our master document term matrix:
[[2, 0, 1, 0, 0, 2, 0, 1, 0, 1, 1], [2, 0, 1, 0, 1, 1, 1, 0, 0,
1, 1], [0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1]]

import math
def l2_normalizer(vec):
denom = np.sum([el**2 for el in vec])
return [(el / math.sqrt(denom)) for el in vec]
doc_term_matrix_l2 = []
for vec in doc_term_matrix:
doc_term_matrix_l2. append( l2_normalizer(vec))
print ’A regular old document term matrix: ’
print np.matrix (doc_term_matrix)
print ’\nA document term matrix with row -wise L2
print np.matrix( doc_term_matrix_l2)

: A regular old document term matrix:
[[2 0 1 0 0 2 0 1 0 1 1]
[2 0 1 0 1 1 1 0 0 1 1]
[0 1 0 1 1 0 0 0 1 1 1]]
A document term matrix with row-wise L2 norm:
[[ 0.57735027 0. 0.28867513 0. 0. 0.57735027

  1. 0.28867513 0. 0.28867513 0.28867513]
    [ 0.63245553 0. 0.31622777 0. 0.31622777 0.31622777
    0.31622777 0. 0. 0.31622777 0.31622777]
    [ 0. 0.40824829 0. 0.40824829 0.40824829 0. 0.
  2. 0.40824829 0.40824829 0.40824829]]


def numDocsContaining(word , doclist):
doccount = 0
for doc in doclist:
if freq(word , doc) > 0:
doccount += 1
return doccount
def idf(word , doclist):
n_samples = len(doclist)
df = numDocsContaining(word , doclist)
return np.log(n_samples / (float(df)) )
my_idf_vector = [idf(word , mydoclist) for word in
print ’Our vocabulary vector is [’ + ’, ’.join(list
(vocabulary)) + ’]’
print ’The inverse document frequency vector is
[ ’ + ’, ’.join(format(freq , ’f’) for freq in
my_idf_vector) + ’]’

Our vocabulary vector is [me, basketball, Mireia, football,
likes, loves, Sergio, Hector, He, than, more]
The inverse document frequency vector is [0.405465, 1.098612,
0.405465, 1.098612, 0.405465, 0.405465, 1.098612, 1.098612,
1.098612, 0.000000, 0.000000]


def build_idf_matrix( idf_vector):
idf_mat = np.zeros((len(idf_vector), len(
np.fill_diagonal( idf_mat , idf_vector)
return idf_mat
my_idf_matrix = build_idf_matrix (my_idf_vector)
print my_idf_matrix

[[ 0.40546511 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. ]
[ 0. 1.09861229 0. 0. 0. 0. 0. 0. 0. 0. 0. ]
[ 0. 0. 0.40546511 0. 0. 0. 0. 0. 0. 0. 0. ]
[ 0. 0. 0. 1.09861229 0. 0. 0. 0. 0. 0. 0. ]
[ 0. 0. 0. 0. 0.40546511 0. 0. 0. 0. 0. 0. ]
[ 0. 0. 0. 0. 0. 0.40546511 0. 0. 0. 0. 0. ]
[ 0. 0. 0. 0. 0. 0. 1.09861229 0. 0. 0. 0. ]
[ 0. 0. 0. 0. 0. 0. 0. 1.09861229 0. 0. 0. ]
[ 0. 0. 0. 0. 0. 0. 0. 0. 1.09861229 0. 0. ]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. ]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. ]]

doc_term_matrix_tfidf = []
#performing tf -idf matrix multiplication
for tf_vector in doc_term_matrix:
doc_term_matrix_tfidf. append(np.dot(tf_vector ,
doc_term_matrix_tfidf_l2 = []
for tf_vector in doc_term_matrix_tfidf:
append( l2_normalizer(tf_vector))
print vocabulary
# np.matrix() just to make it easier to look at
print np.matrix( doc_term_matrix_tfidf_l2)

set([’me’, ’basketball’, ’Mireia’, ’football’, ’likes’,
’loves’, ’Sergio’, ’Linda’, ’He’, ’than’, ’more’])
[[ 0.49474872 0. 0.24737436 0. 0. 0.49474872 0. 0.67026363 0.

    1. ]
      [ 0.52812101 0. 0.2640605 0. 0.2640605 0.2640605 0.71547492 0.
      1. ]
        [ 0. 0.56467328 0. 0.56467328 0.20840411 0. 0. 0. 0.56467328 0.
  1. ]]

10.3.1 二元组和n元组

使用Bow将重要的二元组引入到模型中有时是有用的。请注意,这个例子可以扩展到 n元组。在计算语言学和概率论领域中,n元组是来自给定文本或语音序列的n项连续序列。这些项可以是音素/音节/字母/字词等。n元组通常从文本或语音语料库中收集。

大小为1的n元组称为"一元组(uni-gram) "。大小为2则是一个"二元组(bigram)" [或不太常见地称为连字(digram)];大小为3则是一个“三元组(tri-gram)"。较大规格的元组有时用n值来表示,如“四元组” “五元组”等。这些n元组可以在Bow的模型中引入,只需将每个不同的n元组作为特征向量表示式中的新位置即可。

chapter4 实际案例



代码重用了以前数据清洗示例的一部分,即从数据集作者提供的文件夹中读取训练和测试数据。然后,计算TF-IDF,执行之前提到的、用于计算特征空间、归一化和特征权重的所有步骤。请注意,在脚本的最后,将基于两种不同的最先进的机器学习方法进行训练和测试:朴素贝叶斯( naive Bayes )和支持向量机(suport ector machine, SM)法和参数的细节超出了本章的范围。这里,重要的一点是在可以被不同数据挖掘工具使用的特征空间中表示文档。

from nltk. tokenize import word_tokenize
from nltk.stem .porter import PorterStemmer
from sklearn. feature_extraction. text import
from nltk. classify import NaiveBayesClassifier
from sklearn. naive_bayes import GaussianNB
from sklearn import svm
from unidecode import unidecode
def BoW(text):
# Tokenizing text
text_tokenized = [word_tokenize(doc) for doc in
# Removing punctuation
regex = re.compile(’[%s]’ % re.escape(string.
tokenized_docs_no_punctuation = []
for review in text_tokenized:
new_review = []
for token in review:
new_token = regex.sub(u’ ’, token)
if not new_token == u’ ’:
new_review. append(new_token)
tokenized_docs_no_punctuation. append(
# Stemming and Lemmatizing
porter = PorterStemmer()
preprocessed_docs = []
for doc in tokenized_docs_no_punctuation:
final_doc = ’ ’
for word in doc:
final_doc = final_doc + ’ ’ + porter.
preprocessed_docs. append(final_doc)
return preprocessed_docs
#read your train text data here
textTrain= ReadTrainDataText()
preprocessed_docs=BoW(textTrain) # for train data
# Computing TIDF word space
tfidf_vectorizer = TfidfVectorizer( min_df = 1)
trainData = tfidf_vectorizer.fit_transform(
textTest= ReadTestDataText() #read your test text
data here
prepro_docs_test=BoW(textTest) # for test data
testData = tfidf_vectorizer. transform(
print (’Training and testing on training Naive Bayes
gnb = GaussianNB()
y_pred = gnb.fit(trainData.todense() , targetTrain)
print ("Number of mislabeled training points out of
a total %d points : %d"
% (trainData. shape[0],( targetTrain != y_pred)
y_pred = gnb.fit(trainData.todense() , targetTrain)
print ("Number of mislabeled test points out of a
total %d points : %d" %
(testData. shape[0],( targetTest != y_pred).sum
print (’Training and testing on train with SVM’)
clf = svm.SVC()
clf.fit(trainData. todense(), targetTrain)
y_pred = clf.predict( trainData.todense())
print ("Number of mislabeled test points out of a
total %d points : %d" %
(trainData. shape[0],( targetTrain != y_pred).
print (’Testing on test with already trained SVM’)
y_pred = clf.predict(testData.todense())
print ("Number of mislabeled test points out of a
total %d points : %d" %
(testData. shape[0],( targetTest != y_pred).sum

除了本示例中使用的Sciki-learn 模块提供的机器学习工具外, NLTK也提供了用于文本学习的有用学习工具,其中还包括朴素贝叶斯分类器。另一个具有相似功能的相关包是Texblob。接下来会显示运行脚本的结果:
: Training and testing on training Naive Bayes
Number of mislabeled training points out of a total 4313 points
: 129
Number of mislabeled test points out of a total 6292 points :
Training and testing on train with SVM
Number of mislabeled test points out of a total 4313 points :
Testing on test with already trained SVM
Number of mislabeled test points out of a total 6292 points :
可以看到,朴素贝叶斯对被选定数据的训练误差是129/4313,而在测试中它是2087/6292,有趣的是,使用SVM的训练误差更高( 1288/4313 ),但它在测试集上提供了比朴素贝叶斯更好的泛化性能( 1680/6292),因此,似乎朴素贝叶斯会生成更多的过拟合数据(选取特定的特征来更好地学习训练数据,但是对无法修复的测试产生如此多的特征空间修改,降低了该技术的泛化能力),但是请注意,在提供的数据集的一个子集上,这是标准方法的一个简单的执行过程。更多的数据以及其他许多方面都会影响性能。例如,可以通过引入早已研究过的积极和消极字词来丰富词典(如在http://www.cs.uic.edu/-liub/BS/sentiment-analysis.html中提供的那些)。关于这个数据集分析的更多细节,见参考文献。


textTrain = [’I love this sandwich.’, ’This is an
amazing place!’, ’I feel very good about these
beers.’, ’This is my best work.’, ’What an
awesome view’, ’I do not like this restaurant’,
’I am tired of this stuff.’, ’I can not deal
with this’, ’He is my sworn enemy!’, ’My boss is
targetTrain = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]
tfidf_vectorizer = TfidfVectorizer( min_df = 1)
trainData = tfidf_vectorizer.fit_transform(
textTest = [’The beer was good.’, ’I do not enjoy
my job’, ’I aint feeling dandy today’, ’I feel
amazing!’, ’Gary is a friend of mine.’, ’I can
not believe I am doing this.’]
targetTest = [0, 1, 1, 0, 0, 1]
testData = tfidf_vectorizer. transform(
print (’Training and testing on test Naive Bayes’)
gnb = GaussianNB()
y_pred = gnb.fit(trainData.todense() , targetTrain)
print ("Number of mislabeled training points out of
a total %d points : %d" % (trainData. shape[0],(
targetTrain != y_pred).sum()))
y_pred = gnb.fit(trainData.todense() , targetTrain)
print ("Number of mislabeled test points out of a
total %d points : %d" % (testData. shape[0],(
targetTest != y_pred).sum()))
print (’Training and testing on train with SVM’)
clf = svm.SVC()
clf.fit(trainData. todense(), targetTrain)
y_pred = clf.predict( trainData.todense())
print ("Number of mislabeled test points out of a
%d points : %d"
% (trainData. shape[0],( targetTrain != y_pred
print (’Testing on test with already trained SVM’)
y_pred = clf.predict(testData.todense())
print ("Number of mislabeled test points out of a
%d points : %d"
% (testData. shape[0],( targetTest != y_pred).

: Training and testing on test Naive Bayes
Number of mislabeled training points out of a total 10 points : 0
Number of mislabeled test points out of a total 6 points : 2
Training and testing on train with SVM
Number of mislabeled test points out of a total 10 points : 0
Testing on test with already trained SVM
Number of mislabeled test points out of a total 6 points : 2

chapter5 小结



本章所讲述的工具可以为处理那些更具挑战性的问题提供一个基础。当前最前沿研究的一个近期示例是参考文献[3]的工作,即将深度学习架构用于情感分析。深度学习目前是模式识别、机器学习和计算机视觉等领域的强大工具;主要的深度学习策略是基于神经网络架构的。在参考文献[3]中,深度学习模型根据句子结构建立了整个句子的表示,并且根据字词如何构成更长短语的含义来计算出情感。在本章解释的方法中, n元组是捕获这些语义的唯一特征。关于这个领域的进一步讨论,见参考文献[4, 5].


