美文网首页Machine Learning & Recommendation & NLP & DL
自然语言处理N天-Day0902神经序列模型RNN及其变种LST

自然语言处理N天-Day0902神经序列模型RNN及其变种LST

作者: 我的昵称违规了 | 来源:发表于2019-02-18 09:28 被阅读5次
新建 Microsoft PowerPoint 演示文稿 (2).jpg

说明:本文依据《中文自然语言处理入门实战》完成。目前网上有不少转载的课程,我是从GitChat上购买。

这一课开始讲深度学习部分的RNN(LSTM和GRU),之前也在教程中学过,但是仅仅是实现了一个LSTM,然后使用RNN构建了一个词向量模型用来做词嵌入预测。

第九课 神经序列模型RNN及其变种LSTM、GRU

LSTM进行文本分类

教程在这里使用的数据仍是那个接警记录数据,暂时模拟一下吧。

数据准备,还是老一套,读取、分词、打标签(因为是分类)、随机打散数据。

import random
import jieba
import pandas as pd

stopwords = pd.read_csv(r'C://Users//01//Desktop//stopwords.txt', index_col=False, quoting=3, sep="\t",
                        names=['stopword'], encoding='utf-8')
stopwords = stopwords['stopword'].values
print(stopwords)

erzi_data = pd.read_excel(r'C://Users//01//Desktop//randomdata.xlsx', sheet_name=0)
linju_data = pd.read_excel(r'C://Users//01//Desktop//randomdata.xlsx', sheet_name=1)
laogong_data = pd.read_excel(r'C://Users//01//Desktop//randomdata.xlsx', sheet_name=2)
laopo_data = pd.read_excel(r'C://Users//01//Desktop//randomdata.xlsx', sheet_name=3)
# print(laopo_data)

erzi = erzi_data.values.tolist()
nver = linju_data.values.tolist()
laogong = laogong_data.values.tolist()
laopo = laopo_data.values.tolist()


# 定义分词和打标签函数preprocess_text
# 参数content_lines即为上面转换的list
# 参数sentences是定义的空list,用来储存打标签之后的数据
# 参数category 是类型标签

def preprocess_text(content_lines, sentences, category):
    for line in content_lines:
        line = "".join(line)
        try:
            segs = jieba.lcut(line)
            segs = [v for v in segs if not str(v).isdigit()]  # 去数字
            segs = list(filter(lambda x: x.strip(), segs))  # 去左右空格
            segs = list(filter(lambda x: len(x) > 1, segs))  # 长度为1的字符
            segs = list(filter(lambda x: x not in stopwords, segs))  # 去掉停用词
            sentences.append((" ".join(segs), category))  # 打标签
        except Exception:
            print(line)
            continue

# 调用函数、生成训练数据
sentences = []
preprocess_text(laogong, sentences, 0)
preprocess_text(laopo, sentences, 1)
preprocess_text(erzi, sentences, 2)
preprocess_text(nver, sentences, 3)

# 打散数据,生成更可靠的训练集
random.shuffle(sentences)

# 所有特征和对应标签
all_texts = [sentence[0] for sentence in sentences]
all_labels = [sentence[1] for sentence in sentences]

构建LSTM并进行训练

MAX_SEQUENCE_LENGTH = 100  # 最大序列长度
EMBEDDING_DIM = 200  # embdding 维度
VALIDATION_SPLIT = 0.16  # 验证集比例
TEST_SPLIT = 0.2  # 测试集比例

tokenizer = Tokenizer()
tokenizer.fit_on_sequences(all_texts)
sequences = tokenizer.texts_to_sequences(all_texts)
word_index = tokenizer.word_index

print('Found %s unique tokens.' % len(word_index))
data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)
labels = to_categorical(np.asarray(all_labels))
print('Shape of data tensor:', data.shape)
print('Shape of label tensor:', labels.shape)

# 数据切分
p1 = int(len(data) * (1 - VALIDATION_SPLIT - TEST_SPLIT))
p2 = int(len(data) * (1 - TEST_SPLIT))
x_train = data[:p1]
y_train = labels[:p1]
x_val = data[p1:p2]
y_val = labels[p1:p2]
x_test = data[p2:]
y_test = labels[p2:]

# LSTM训练模型
model = Sequential()
model.add(Embedding(len(word_index) + 1, EMBEDDING_DIM, input_length=MAX_SEQUENCE_LENGTH))
model.add(LSTM(200, dropout=0.2, recurrent_dropout=0.2))
model.add(Dropout(0.2))
model.add(Dense(64, activation='relu'))
model.add(Dense(labels.shape[1], activation='softmax'))
model.summary()
# 模型编译
model.compile(loss='categorical_crossentropy',
              optimizer='rmsprop',
              metrics=['acc'])
print(model.metrics_names)
model.fit(x_train, y_train, validation_data=(x_val, y_val), epochs=10, batch_size=128)
model.save('lstm.h5')

模型评估

print(model.evaluate(x_test, y_test))

GRU模型

其实就是在model部分将LSTM修改为GRU,keras还是很方便。

# GRU训练模型
model = Sequential()
model.add(Embedding(len(word_index) + 1, EMBEDDING_DIM, input_length=MAX_SEQUENCE_LENGTH))
model.add(GRU(200, dropout=0.2, recurrent_dropout=0.2))
model.add(Dropout(0.2))
model.add(Dense(64, activation='relu'))
model.add(Dense(labels.shape[1], activation='softmax'))
model.summary()

model.compile(loss='categorical_crossentropy',
              optimizer='rmsprop',
              metrics=['acc'])
print(model.metrics_names)
model.fit(x_train, y_train, validation_data=(x_val, y_val), epochs=10, batch_size=128)
model.save('lstm.h5')

print(model.evaluate(x_test, y_test))
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_1 (Embedding)      (None, 100, 200)          200       
_________________________________________________________________
lstm_1 (LSTM)                (None, 200)               320800    
_________________________________________________________________
dropout_1 (Dropout)          (None, 200)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 64)                12864     
_________________________________________________________________
dense_2 (Dense)              (None, 4)                 260       
=================================================================
Total params: 334,124
Trainable params: 334,124
Non-trainable params: 0
_________________________________________________________________
['loss', 'acc']
Train on 93 samples, validate on 23 samples
Epoch 1/10
2019-02-18 09:16:44.521694: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2

93/93 [==============================] - 2s 17ms/step - loss: 1.3836 - acc: 0.2151 - val_loss: 1.3627 - val_acc: 0.3913
Epoch 2/10

93/93 [==============================] - 0s 5ms/step - loss: 1.3507 - acc: 0.4086 - val_loss: 1.3492 - val_acc: 0.3913
Epoch 3/10

93/93 [==============================] - 0s 5ms/step - loss: 1.3391 - acc: 0.4086 - val_loss: 1.3504 - val_acc: 0.3913
Epoch 4/10

93/93 [==============================] - 0s 5ms/step - loss: 1.3392 - acc: 0.4086 - val_loss: 1.3403 - val_acc: 0.3913
Epoch 5/10

93/93 [==============================] - 0s 5ms/step - loss: 1.3174 - acc: 0.4086 - val_loss: 1.3378 - val_acc: 0.3913
Epoch 6/10

93/93 [==============================] - 0s 4ms/step - loss: 1.3232 - acc: 0.4086 - val_loss: 1.3369 - val_acc: 0.3913
Epoch 7/10

93/93 [==============================] - 0s 4ms/step - loss: 1.3306 - acc: 0.4086 - val_loss: 1.3350 - val_acc: 0.3913
Epoch 8/10

93/93 [==============================] - 0s 4ms/step - loss: 1.3193 - acc: 0.4086 - val_loss: 1.3421 - val_acc: 0.3913
Epoch 9/10

93/93 [==============================] - 1s 6ms/step - loss: 1.3269 - acc: 0.4086 - val_loss: 1.3352 - val_acc: 0.3913
Epoch 10/10

93/93 [==============================] - 0s 5ms/step - loss: 1.3332 - acc: 0.4086 - val_loss: 1.3364 - val_acc: 0.3913

30/30 [==============================] - 0s 997us/step
[1.3347249031066895, 0.4000000059604645]

相关文章

网友评论

    本文标题:自然语言处理N天-Day0902神经序列模型RNN及其变种LST

    本文链接:https://www.haomeiwen.com/subject/uzvoeqtx.html