![](https://img.haomeiwen.com/i14340919/5072e2ab73a5bdcc.jpg)
说明:本文依据《中文自然语言处理入门实战》完成。目前网上有不少转载的课程,我是从GitChat上购买。
这一课开始讲深度学习部分的RNN(LSTM和GRU),之前也在教程中学过,但是仅仅是实现了一个LSTM,然后使用RNN构建了一个词向量模型用来做词嵌入预测。
第九课 神经序列模型RNN及其变种LSTM、GRU
LSTM进行文本分类
教程在这里使用的数据仍是那个接警记录数据,暂时模拟一下吧。
数据准备,还是老一套,读取、分词、打标签(因为是分类)、随机打散数据。
import random
import jieba
import pandas as pd
stopwords = pd.read_csv(r'C://Users//01//Desktop//stopwords.txt', index_col=False, quoting=3, sep="\t",
names=['stopword'], encoding='utf-8')
stopwords = stopwords['stopword'].values
print(stopwords)
erzi_data = pd.read_excel(r'C://Users//01//Desktop//randomdata.xlsx', sheet_name=0)
linju_data = pd.read_excel(r'C://Users//01//Desktop//randomdata.xlsx', sheet_name=1)
laogong_data = pd.read_excel(r'C://Users//01//Desktop//randomdata.xlsx', sheet_name=2)
laopo_data = pd.read_excel(r'C://Users//01//Desktop//randomdata.xlsx', sheet_name=3)
# print(laopo_data)
erzi = erzi_data.values.tolist()
nver = linju_data.values.tolist()
laogong = laogong_data.values.tolist()
laopo = laopo_data.values.tolist()
# 定义分词和打标签函数preprocess_text
# 参数content_lines即为上面转换的list
# 参数sentences是定义的空list,用来储存打标签之后的数据
# 参数category 是类型标签
def preprocess_text(content_lines, sentences, category):
for line in content_lines:
line = "".join(line)
try:
segs = jieba.lcut(line)
segs = [v for v in segs if not str(v).isdigit()] # 去数字
segs = list(filter(lambda x: x.strip(), segs)) # 去左右空格
segs = list(filter(lambda x: len(x) > 1, segs)) # 长度为1的字符
segs = list(filter(lambda x: x not in stopwords, segs)) # 去掉停用词
sentences.append((" ".join(segs), category)) # 打标签
except Exception:
print(line)
continue
# 调用函数、生成训练数据
sentences = []
preprocess_text(laogong, sentences, 0)
preprocess_text(laopo, sentences, 1)
preprocess_text(erzi, sentences, 2)
preprocess_text(nver, sentences, 3)
# 打散数据,生成更可靠的训练集
random.shuffle(sentences)
# 所有特征和对应标签
all_texts = [sentence[0] for sentence in sentences]
all_labels = [sentence[1] for sentence in sentences]
构建LSTM并进行训练
MAX_SEQUENCE_LENGTH = 100 # 最大序列长度
EMBEDDING_DIM = 200 # embdding 维度
VALIDATION_SPLIT = 0.16 # 验证集比例
TEST_SPLIT = 0.2 # 测试集比例
tokenizer = Tokenizer()
tokenizer.fit_on_sequences(all_texts)
sequences = tokenizer.texts_to_sequences(all_texts)
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))
data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)
labels = to_categorical(np.asarray(all_labels))
print('Shape of data tensor:', data.shape)
print('Shape of label tensor:', labels.shape)
# 数据切分
p1 = int(len(data) * (1 - VALIDATION_SPLIT - TEST_SPLIT))
p2 = int(len(data) * (1 - TEST_SPLIT))
x_train = data[:p1]
y_train = labels[:p1]
x_val = data[p1:p2]
y_val = labels[p1:p2]
x_test = data[p2:]
y_test = labels[p2:]
# LSTM训练模型
model = Sequential()
model.add(Embedding(len(word_index) + 1, EMBEDDING_DIM, input_length=MAX_SEQUENCE_LENGTH))
model.add(LSTM(200, dropout=0.2, recurrent_dropout=0.2))
model.add(Dropout(0.2))
model.add(Dense(64, activation='relu'))
model.add(Dense(labels.shape[1], activation='softmax'))
model.summary()
# 模型编译
model.compile(loss='categorical_crossentropy',
optimizer='rmsprop',
metrics=['acc'])
print(model.metrics_names)
model.fit(x_train, y_train, validation_data=(x_val, y_val), epochs=10, batch_size=128)
model.save('lstm.h5')
模型评估
print(model.evaluate(x_test, y_test))
GRU模型
其实就是在model部分将LSTM修改为GRU,keras还是很方便。
# GRU训练模型
model = Sequential()
model.add(Embedding(len(word_index) + 1, EMBEDDING_DIM, input_length=MAX_SEQUENCE_LENGTH))
model.add(GRU(200, dropout=0.2, recurrent_dropout=0.2))
model.add(Dropout(0.2))
model.add(Dense(64, activation='relu'))
model.add(Dense(labels.shape[1], activation='softmax'))
model.summary()
model.compile(loss='categorical_crossentropy',
optimizer='rmsprop',
metrics=['acc'])
print(model.metrics_names)
model.fit(x_train, y_train, validation_data=(x_val, y_val), epochs=10, batch_size=128)
model.save('lstm.h5')
print(model.evaluate(x_test, y_test))
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding_1 (Embedding) (None, 100, 200) 200
_________________________________________________________________
lstm_1 (LSTM) (None, 200) 320800
_________________________________________________________________
dropout_1 (Dropout) (None, 200) 0
_________________________________________________________________
dense_1 (Dense) (None, 64) 12864
_________________________________________________________________
dense_2 (Dense) (None, 4) 260
=================================================================
Total params: 334,124
Trainable params: 334,124
Non-trainable params: 0
_________________________________________________________________
['loss', 'acc']
Train on 93 samples, validate on 23 samples
Epoch 1/10
2019-02-18 09:16:44.521694: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
93/93 [==============================] - 2s 17ms/step - loss: 1.3836 - acc: 0.2151 - val_loss: 1.3627 - val_acc: 0.3913
Epoch 2/10
93/93 [==============================] - 0s 5ms/step - loss: 1.3507 - acc: 0.4086 - val_loss: 1.3492 - val_acc: 0.3913
Epoch 3/10
93/93 [==============================] - 0s 5ms/step - loss: 1.3391 - acc: 0.4086 - val_loss: 1.3504 - val_acc: 0.3913
Epoch 4/10
93/93 [==============================] - 0s 5ms/step - loss: 1.3392 - acc: 0.4086 - val_loss: 1.3403 - val_acc: 0.3913
Epoch 5/10
93/93 [==============================] - 0s 5ms/step - loss: 1.3174 - acc: 0.4086 - val_loss: 1.3378 - val_acc: 0.3913
Epoch 6/10
93/93 [==============================] - 0s 4ms/step - loss: 1.3232 - acc: 0.4086 - val_loss: 1.3369 - val_acc: 0.3913
Epoch 7/10
93/93 [==============================] - 0s 4ms/step - loss: 1.3306 - acc: 0.4086 - val_loss: 1.3350 - val_acc: 0.3913
Epoch 8/10
93/93 [==============================] - 0s 4ms/step - loss: 1.3193 - acc: 0.4086 - val_loss: 1.3421 - val_acc: 0.3913
Epoch 9/10
93/93 [==============================] - 1s 6ms/step - loss: 1.3269 - acc: 0.4086 - val_loss: 1.3352 - val_acc: 0.3913
Epoch 10/10
93/93 [==============================] - 0s 5ms/step - loss: 1.3332 - acc: 0.4086 - val_loss: 1.3364 - val_acc: 0.3913
30/30 [==============================] - 0s 997us/step
[1.3347249031066895, 0.4000000059604645]
网友评论