本周任务:按照标准的数据集划分,即训练集:2 测试集:1的比例划分数据,测试模型效果。
首先,划分数据集,按照上述的比例,源数据:71532条,训练集;50000条,测试集:20000条
没有在原始数据集划分,而是在数据递交中划分,因为lstm需要循环训练,因此需要在源数据的基础上取余操作:
def next_batch(batch_size, step):
return vec_lists[step%(50000//batch_size)], tag_lists[step%(50000//batch_size)]
def test_Data(step):
return vec_list[50000//batch_size + step%(20000//batch_size)], tag_lists[50000//batch_size + step%(20000//batch_size)]
不断的在测试集上检测,求得相关正确率:
if __name__ == "__main__":
tag_list = init_Tag()
vec_list = init_Vec()
init_Data()
pred = RNN(x, weights, biases)
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels = y, logits=pred))
train_op = tf.train.AdamOptimizer(lr).minimize(cost)
correct_pred = tf.equal(tf.argmax(pred, 1), tf.argmax(y, 1))
accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))
# init= tf.initialize_all_variables() # tf 马上就要废弃这种写法
# 替换成下面的写法:
init = tf.global_variables_initializer()
list = []
with tf.Session() as sess:
sess.run(init)
step = 0
test_step = 0
while step * batch_size <= training_iters:
batch_xs, batch_ys = next_batch(batch_size, step)
batch_xs = np.array(batch_xs)
batch_xs = batch_xs.reshape([batch_size, n_steps, n_inputs])
sess.run([train_op], feed_dict={
x: batch_xs,
y: batch_ys,
})
if step == training_iters//batch_size:
while test_step * batch_size < 20000:
test_batch_xs, test_batch_ys = test_Data(test_step)
test_batch_xs = np.array(batch_xs)
test_batch_xs = test_batch_xs.reshape([batch_size, n_steps, n_inputs])
a = sess.run(accuracy, feed_dict={
x: test_batch_xs,
y: test_batch_ys,
})
list.append(a)
print(a)
test_step += 1
step += 1
print(" average num = ")
print(sum(list)/len(list))
print(" max num = ")
print(np.max(list))
得出最终的训练效果:
平均正确率百分之95.3
这样的一个效果,依旧不是很满意,在经过小组讨论之后,发现我的词向量存在问题,因为我们直接使用的是word2vec,自动训练的代码,对中文的向量化效果不是很好,所以我们修改了词向量库,改用了搜狗的词向量库,重新生成的词向量进行训练。
网友评论