美文网首页深度学习
2020机器学习情感分析

2020机器学习情感分析

作者: zidea | 来源:发表于2020-02-15 20:46 被阅读0次
machine_learning.jpg

语义分析

%matplotlib inline
import matplotlib.pyplot as plt
import tensorflow as tf
import numpy as np
# from scipy.spatial.distance import cdis
# from scipy.spatial.distance import cdis
from tensorflow.python.keras.models import Sequential
from tensorflow.python.keras.layers import Dense, GRU, Embedding
from tensorflow.python.keras.optimizers import Adam
from tensorflow.python.keras.preprocessing.text import Tokenizer
from tensorflow.python.keras.preprocessing.sequence import pad_sequences
tf.keras.__version__
'2.2.4-tf'
import imdb

影评语义分析

imdb 这个数据集是有关影评,数据集中每段影评对应对应标签是正面或者负面,通过设计循环神经网来训练模型用于读取影评通过推测语义来判断语义是正面和负面。这个语料库是英文,但是并不影响我们对模型设计,只要吃透这个模型替换为其他语言是一样的,因为中文这样免费语料资源并不多,所以用大家在 npl 中常用的 imdb 语料库。

imdb.maybe_download_and_extract()
- Download progress: 100.0%
Download finished. Extracting files.
Done.

拆分数据集

将数据集拆分为训练集合测试集,

x_train_text, y_train = imdb.load_data(train=True)
x_test_text, y_test = imdb.load_data(train=False)
print("Train-set size: ", len(x_train_text))
print("Test-set size:  ", len(x_test_text))
Train-set size:  25000
Test-set size:   25000
data_text = x_train_text + x_test_text
x_train_text[1]
'Bizarre horror movie filled with famous faces but stolen by Cristina Raines (later of TV\'s "Flamingo Road") as a pretty but somewhat unstable model with a gummy smile who is slated to pay for her attempted suicides by guarding the Gateway to Hell! The scenes with Raines modeling are very well captured, the mood music is perfect, Deborah Raffin is charming as Cristina\'s pal, but when Raines moves into a creepy Brooklyn Heights brownstone (inhabited by a blind priest on the top floor), things really start cooking. The neighbors, including a fantastically wicked Burgess Meredith and kinky couple Sylvia Miles & Beverly D\'Angelo, are a diabolical lot, and Eli Wallach is great fun as a wily police detective. The movie is nearly a cross-pollination of "Rosemary\'s Baby" and "The Exorcist"--but what a combination! Based on the best-seller by Jeffrey Konvitz, "The Sentinel" is entertainingly spooky, full of shocks brought off well by director Michael Winner, who mounts a thoughtfully downbeat ending with skill. ***1/2 from ****'

x 数据为一段电影评论,标签为 1 表示正面评价,而 0 表示负面评价

y_train[1]
1.0

分词

tensorflow.python.keras.preprocessing.text

这里用了 keras 提供分词,在英文 tokenizer 是 token 化,token 这个英文词希望大家很好了解,token 最开始接触时是在 AST 语法树上接触到 token 这个词,编程语法树将程序分解成若干个 token 表示一定意义单元,并进行标识。这里我们仅是提取 10000 个token 毕竟我们笔记本性能有限。

num_words = 10000
tokenizer = Tokenizer(num_words=num_words)

初始化好 tokenizer 分词器,将数据传入分词器进行分词

%%time
tokenizer.fit_on_texts(data_text)
CPU times: user 11.5 s, sys: 131 ms, total: 11.6 s
Wall time: 11.9 s
if num_words is None:
    num_words = len(tokenizer.word_index)

print(num_words)
10000

现在可以看看词汇标记器(tokenizer)收集的词汇表。这是按单词在词汇表集中出现的次数排序的。

print(type(tokenizer.word_index))
<class 'dict'>

我们可以进行测试一下,在 word_index 键值为单词,值对应是单词对应唯一标识

print(tokenizer.word_index['the'])
1
texts_to_sequences

方法将序列中每一个词汇变成对应词汇 token

x_train_tokens = tokenizer.texts_to_sequences(x_train_text)
x_train_text[1]
'Bizarre horror movie filled with famous faces but stolen by Cristina Raines (later of TV\'s "Flamingo Road") as a pretty but somewhat unstable model with a gummy smile who is slated to pay for her attempted suicides by guarding the Gateway to Hell! The scenes with Raines modeling are very well captured, the mood music is perfect, Deborah Raffin is charming as Cristina\'s pal, but when Raines moves into a creepy Brooklyn Heights brownstone (inhabited by a blind priest on the top floor), things really start cooking. The neighbors, including a fantastically wicked Burgess Meredith and kinky couple Sylvia Miles & Beverly D\'Angelo, are a diabolical lot, and Eli Wallach is great fun as a wily police detective. The movie is nearly a cross-pollination of "Rosemary\'s Baby" and "The Exorcist"--but what a combination! Based on the best-seller by Jeffrey Konvitz, "The Sentinel" is entertainingly spooky, full of shocks brought off well by director Michael Winner, who mounts a thoughtfully downbeat ending with skill. ***1/2 from ****'
np.array(x_train_tokens[1])
array([1153,  182,   17, 1066,   16,  815, 1457,   18, 2602,   31, 7951,
        305,    4, 7888, 1210,   14,    3,  180,   18,  672, 8306, 2196,
         16,    3, 1861,   35,    6,    5,  969,   15,   40, 3138,   31,
          1,    5,  603,    1,  134,   16, 7951,   23,   52,   69, 1819,
          1, 1245,  207,    6,  399, 8161,    6, 1313,   14, 4989,   18,
         50, 7951, 1121,   82,    3,  977, 5103, 5663, 8770,   31,    3,
       1998, 1982,   20,    1,  342, 1862,  177,   62,  375, 6187,    1,
       5030,  585,    3, 8697, 3542, 8230,    2, 8458,  374, 7952, 2079,
       5422,   23,    3, 9719,  169,    2, 9935,    6,   78,  245,   14,
          3,  571, 1361,    1,   17,    6,  800,    3, 1631,    4, 8771,
        978,    2,    1, 5031,   18,   48,    3, 2174,  441,   20,    1,
        116,   31, 4423,    1, 8847,    6, 3746,  363,    4, 7017,  831,
        122,   69,   31,  164,  498, 2297,   35,    3, 9311,  272,   16,
       2788,  307,  230,   36])
print(tokenizer.word_index['movie'])
print(tokenizer.word_index['bizarre'])#转换为小写
17
1153
x_test_tokens = tokenizer.texts_to_sequences(x_test_text)

循环神经网络中,可以将任意长度的序列作为输入,但为了使用整批数据,序列需要具有相同的长度。可以通过实现这一点有两种方法:

  1. 数据集中提供的序列具有相同的长度,
  2. 编写自定义数据生成器,以获得序列在每个批次中具有相同的长度。
    选择多长的序列作为序列根据任务而定,通常我们可以用平均值来作为序列
num_tokens = [len(tokens) for tokens in x_train_tokens + x_test_tokens]
num_tokens = np.array(num_tokens)

计算所有序列的平均值

np.mean(num_tokens)
221.27716

计算序列的最大长度为 max_tokens = mean + std \times 2

max_tokens = np.mean(num_tokens) + 2 * np.std(num_tokens)
max_tokens = int(max_tokens)
max_tokens
544
np.sum(num_tokens < max_tokens) / len(num_tokens)
0.94532
pad = 'pre'
x_train_pad = pad_sequences(x_train_tokens, maxlen=max_tokens,
                            padding=pad, truncating=pad)
x_test_pad = pad_sequences(x_test_tokens, maxlen=max_tokens,
                           padding=pad, truncating=pad)
x_train_pad.shape
(25000, 544)
x_test_pad.shape
(25000, 544)
np.array(x_train_tokens[1])
array([1153,  182,   17, 1066,   16,  815, 1457,   18, 2602,   31, 7951,
        305,    4, 7888, 1210,   14,    3,  180,   18,  672, 8306, 2196,
         16,    3, 1861,   35,    6,    5,  969,   15,   40, 3138,   31,
          1,    5,  603,    1,  134,   16, 7951,   23,   52,   69, 1819,
          1, 1245,  207,    6,  399, 8161,    6, 1313,   14, 4989,   18,
         50, 7951, 1121,   82,    3,  977, 5103, 5663, 8770,   31,    3,
       1998, 1982,   20,    1,  342, 1862,  177,   62,  375, 6187,    1,
       5030,  585,    3, 8697, 3542, 8230,    2, 8458,  374, 7952, 2079,
       5422,   23,    3, 9719,  169,    2, 9935,    6,   78,  245,   14,
          3,  571, 1361,    1,   17,    6,  800,    3, 1631,    4, 8771,
        978,    2,    1, 5031,   18,   48,    3, 2174,  441,   20,    1,
        116,   31, 4423,    1, 8847,    6, 3746,  363,    4, 7017,  831,
        122,   69,   31,  164,  498, 2297,   35,    3, 9311,  272,   16,
       2788,  307,  230,   36])
x_train_pad[1]
array([   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0, 1153,  182,   17, 1066,   16,  815, 1457,   18, 2602,   31,
       7951,  305,    4, 7888, 1210,   14,    3,  180,   18,  672, 8306,
       2196,   16,    3, 1861,   35,    6,    5,  969,   15,   40, 3138,
         31,    1,    5,  603,    1,  134,   16, 7951,   23,   52,   69,
       1819,    1, 1245,  207,    6,  399, 8161,    6, 1313,   14, 4989,
         18,   50, 7951, 1121,   82,    3,  977, 5103, 5663, 8770,   31,
          3, 1998, 1982,   20,    1,  342, 1862,  177,   62,  375, 6187,
          1, 5030,  585,    3, 8697, 3542, 8230,    2, 8458,  374, 7952,
       2079, 5422,   23,    3, 9719,  169,    2, 9935,    6,   78,  245,
         14,    3,  571, 1361,    1,   17,    6,  800,    3, 1631,    4,
       8771,  978,    2,    1, 5031,   18,   48,    3, 2174,  441,   20,
          1,  116,   31, 4423,    1, 8847,    6, 3746,  363,    4, 7017,
        831,  122,   69,   31,  164,  498, 2297,   35,    3, 9311,  272,
         16, 2788,  307,  230,   36], dtype=int32)

这里还需要创建字典(dict)键为词汇序标识 token 值为词汇

idx = tokenizer.word_index
inverse_map = dict(zip(idx.values(), idx.keys()))

token_to_string 方法传入一个系列标识 token 会得到有标识 token 对应的词汇组成的句子,下面通过输出我们可以验证一下

def tokens_to_string(tokens):
    # Map from tokens back to words.
    words = [inverse_map[token] for token in tokens if token != 0]
    
    # Concatenate all words.
    text = " ".join(words)

    return text
x_train_text[1]
'Bizarre horror movie filled with famous faces but stolen by Cristina Raines (later of TV\'s "Flamingo Road") as a pretty but somewhat unstable model with a gummy smile who is slated to pay for her attempted suicides by guarding the Gateway to Hell! The scenes with Raines modeling are very well captured, the mood music is perfect, Deborah Raffin is charming as Cristina\'s pal, but when Raines moves into a creepy Brooklyn Heights brownstone (inhabited by a blind priest on the top floor), things really start cooking. The neighbors, including a fantastically wicked Burgess Meredith and kinky couple Sylvia Miles & Beverly D\'Angelo, are a diabolical lot, and Eli Wallach is great fun as a wily police detective. The movie is nearly a cross-pollination of "Rosemary\'s Baby" and "The Exorcist"--but what a combination! Based on the best-seller by Jeffrey Konvitz, "The Sentinel" is entertainingly spooky, full of shocks brought off well by director Michael Winner, who mounts a thoughtfully downbeat ending with skill. ***1/2 from ****'
tokens_to_string(x_train_tokens[1])
"bizarre horror movie filled with famous faces but stolen by raines later of tv's road as a pretty but somewhat unstable model with a smile who is to pay for her attempted by the to hell the scenes with raines are very well captured the mood music is perfect deborah is charming as pal but when raines moves into a creepy brooklyn heights inhabited by a blind priest on the top floor things really start cooking the neighbors including a fantastically wicked meredith and kinky couple sylvia miles beverly are a diabolical lot and eli is great fun as a police detective the movie is nearly a cross of rosemary's baby and the exorcist but what a combination based on the best by jeffrey the sentinel is spooky full of shocks brought off well by director michael winner who a downbeat ending with skill 1 2 from"

创建循环神经网

model = Sequential()

词嵌入

从现在我们开始构建 RNN 循环神经网,我们先叠第一层 Embedding 层,我们之前已经做了许多工作例如将词汇对应给一个唯一标识(token),然后将文本表示为一个整数形式。在 Embedding 层用一个向量来代替一个表示词汇整数 token。
The first layer in the RNN is a so-called Embedding-layer which converts each integer-token into a vector of values. This is necessary because the integer-tokens may take on values between 0 and 10000 for a vocabulary of 10000 words. The RNN cannot work on values in such a wide range. The embedding-layer is trained as a part of the RNN and will learn to map words with similar semantic meanings to similar embedding-vectors, as will be shown further below.

First we define the size of the embedding-vector for each integer-token. In this case we have set it to 8, so that each integer-token will be converted to a vector of length 8. The values of the embedding-vector will generally fall roughly between -1.0 and 1.0, although they may exceed these values somewhat.

嵌入向量通常在 100-300 间,这是经验,在语义分析得到精
The size of the embedding-vector is typically selected between 100-300, but it seems to work reasonably well with small values for Sentiment Analysis.

embedding_size = 8

添加 Embedding 图层,并且给 Embedding 层起一个名字 layer_embedding,便于随后我们通过名字获取该层的参数信息,input_length 表示输入最大长度 max_tokens 544 而输出为 embedding_size 经过词嵌入矩阵转换后将每一个词从用一个数字转换为用一个 8 维度向量来表示

model.add(Embedding(input_dim=num_words,
                    output_dim=embedding_size,
                    input_length=max_tokens,
                    name='layer_embedding'))
WARNING:tensorflow:From /anaconda3/envs/tf_py36/lib/python3.6/site-packages/tensorflow/python/keras/initializers.py:119: calling RandomUniform.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor

有关 GRU 在我的文章已经通过对比 LSTM 循环神经网讲解过其机理。在 GRU 图层中 return_sequences 表示该层输出将作为下一层的输入使用

model.add(GRU(units=16, return_sequences=True))
WARNING:tensorflow:From /anaconda3/envs/tf_py36/lib/python3.6/site-packages/tensorflow/python/ops/init_ops.py:1251: calling VarianceScaling.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
model.add(GRU(units=8, return_sequences=True))
model.add(GRU(units=4))
model.add(Dense(1, activation='sigmoid'))
optimizer = Adam(lr=1e-3)

因为 2 分类问题这里使用 binary_crossentropy 交叉熵损失函数,

model.compile(loss='binary_crossentropy',
              optimizer=optimizer,
              metrics=['accuracy'])
WARNING:tensorflow:From /anaconda3/envs/tf_py36/lib/python3.6/site-packages/tensorflow/python/ops/nn_impl.py:180: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
model.summary()
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
layer_embedding (Embedding)  (None, 544, 8)            80000     
_________________________________________________________________
gru (GRU)                    (None, 544, 16)           1200      
_________________________________________________________________
gru_1 (GRU)                  (None, 544, 8)            600       
_________________________________________________________________
gru_2 (GRU)                  (None, 4)                 156       
_________________________________________________________________
dense (Dense)                (None, 1)                 5         
=================================================================
Total params: 81,961
Trainable params: 81,961
Non-trainable params: 0
_________________________________________________________________

我们可以训练模型,这里使用 5% 训练数据作为验证数据集,因此我们会对模型运行,更新参数的状况有一个大概的了解。观察模型是否在训练集上发生过拟合。

%%time
model.fit(x_train_pad, y_train,
          validation_split=0.05, epochs=3, batch_size=64)
Train on 23750 samples, validate on 1250 samples
Epoch 1/3
23750/23750 [==============================] - 505s 21ms/sample - loss: 0.4807 - acc: 0.7503 - val_loss: 0.2968 - val_acc: 0.8904
Epoch 2/3
23750/23750 [==============================] - 761s 32ms/sample - loss: 0.2680 - acc: 0.8982 - val_loss: 0.1680 - val_acc: 0.9488
Epoch 3/3
23750/23750 [==============================] - 4845s 204ms/sample - loss: 0.2140 - acc: 0.9229 - val_loss: 0.3405 - val_acc: 0.8680
CPU times: user 40min 58s, sys: 7min 39s, total: 48min 37s
Wall time: 1h 41min 53s





<tensorflow.python.keras.callbacks.History at 0x138401fd0>
%%time
result = model.evaluate(x_test_pad, y_test)
25000/25000 [==============================] - 118s 5ms/sample - loss: 0.3330 - acc: 0.8696
CPU times: user 2min 3s, sys: 4.22 s, total: 2min 7s
Wall time: 1min 58s
print("Accuracy: {0:.2%}".format(result[1]))
Accuracy: 86.96%
%%time
y_pred = model.predict(x=x_test_pad[0:1000])
y_pred = y_pred.T[0]
CPU times: user 10.3 s, sys: 1.48 s, total: 11.8 s
Wall time: 7.23 s
cls_pred = np.array([1.0 if p>0.5 else 0.0 for p in y_pred])
cls_true = np.array(y_test[0:1000])
incorrect = np.where(cls_pred != cls_true)
incorrect = incorrect[0]
len(incorrect)
108
idx = incorrect[0]
idx
3
text = x_test_text[idx]
text
'This is the best 3-D experience Disney has at their themeparks. This is certainly better than their original 1960\'s acid-trip film that was in it\'s place, is leagues better than "Honey I Shrunk The Audience" (and far more fun), barely squeaks by the MuppetVision 3-D movie at Disney-MGM and can even beat the original 3-D "Movie Experience" Captain EO. This film relives some of Disney\'s greatest musical hits from Aladdin, The Little Mermaid, and others, and brought a smile to my face throughout the entire show. This is a totally kid-friendly movie too, unlike "Honey..." and has more effects than the spectacular "MuppetVision"'
y_pred[idx]
0.09833968
cls_true[idx]
1.0
# 这个电影真的不错,这么好的电影我当然很喜欢
text1 = "This movie is fantastic! I really like it because it is so good!"
# 好电影
text2 = "Good movie!"
# 可能这是一个好电影
text3 = "Maybe I like this movie."
# 嗯
text4 = "Meh ..."
# 如果我是少年可能会喜欢这部电影
text5 = "If I were a drunk teenager then this movie might be good."
# 不好的电影
text6 = "Bad movie!"
# 不算是一个好电影
text7 = "Not a good movie!"
# 这是一部让人感觉不舒服的电影,
text8 = "This movie really sucks! Can I get my money back please?"
texts = [text1, text2, text3, text4, text5, text6, text7, text8]
tokens = tokenizer.texts_to_sequences(texts)
tokens_pad = pad_sequences(tokens, maxlen=max_tokens,
                           padding=pad, truncating=pad)
tokens_pad.shape
(8, 544)
model.predict(tokens_pad)
array([[0.90905607],
       [0.81788987],
       [0.46912616],
       [0.7713511 ],
       [0.5090726 ],
       [0.27790606],
       [0.691438  ],
       [0.13297775]], dtype=float32)
layer_embedding = model.get_layer('layer_embedding')
weights_embedding = layer_embedding.get_weights()[0]

weights_embedding.shape
(10000, 8)
token_good = tokenizer.word_index['good']
token_good
49
token_great = tokenizer.word_index['great']
token_great
78
weights_embedding[token_good]
array([-0.05211709, -0.00953285, -0.03444127,  0.039264  ,  0.09406167,
       -0.01757899, -0.067539  ,  0.02221631], dtype=float32)
weights_embedding[token_great]
array([-0.18220197, -0.1531388 , -0.1142754 ,  0.0924244 ,  0.1220694 ,
       -0.12130003, -0.10198361,  0.1011835 ], dtype=float32)
token_bad = tokenizer.word_index['bad']
token_horrible = tokenizer.word_index['horrible']
weights_embedding[token_bad]
array([ 0.07193956,  0.13492817,  0.08987457, -0.07411115, -0.10967217,
        0.13347408,  0.11364663, -0.13038015], dtype=float32)
weights_embedding[token_horrible]
array([ 0.2255772 ,  0.13835084,  0.20364708, -0.15602347, -0.20101002,
        0.18828708,  0.14935762, -0.14687213], dtype=float32)

最后希望大家关注我们微信公众号


wechat.jpeg

相关文章

网友评论

    本文标题:2020机器学习情感分析

    本文链接:https://www.haomeiwen.com/subject/bwayfhtx.html