step1 准备语料
- 读取json数据集,提取问题及其证据,分词并按行保存到文件中,方便训练词向量。
- 去停用词,这里由于要保留上下文信息,停用词大都包括标点符号及其他特殊字符。
# -*- coding:utf-8 -*-
import json
import os
import shutil
import jieba
import sys
data_set = "./dataset/me_train.json"
target = "./dataset/train_questions_with_evidence.txt"
stopwords_dict = "./dataset/stop_words_ch.txt"
def rm_stopwords(file_path, word_dict):
rm stop word for {file_path}, stop words save in {word_dict} file.
file_path: file path of file generated by function splitwords.
each lines of file is format as <file_unique_id> <file_words>.
word_dict: file containing stop words, and every stop words in one line.
output: file_path which have been removed stop words and overwrite original file.
# read stop word dict and save in stop_dict
stop_dict = {}
with open(word_dict) as d:
for word in d:
stop_dict[word.strip("\n")] = 1
# remove tmp file if exists
if os.path.exists(file_path + ".tmp"):
os.remove(file_path + ".tmp")
print "now remove stop words in %s." % file_path
# read source file and rm stop word for each line.
with open(file_path) as f1, open(file_path + ".tmp", "w") as f2:
for line in f1:
tmp_list = [] # save words not in stop dict
words = line.split()
for word in words:
if word not in stop_dict:
words_without_stop = " ".join(tmp_list)
to_write = words_without_stop + "\n"
# overwrite origin file with file been removed stop words
shutil.move(file_path + ".tmp", file_path)
print "stop words in %s has been removed." % file_path
with open(data_set, "r") as f, open(target, "w") as f2:
data = json.load(f)
count = 0
for key, value in data.iteritems():
question = data[key]["question"]
words = jieba.cut(question, cut_all=False)
f2.write(" ".join(words) + "\n")
for k, v in data[key]['evidences'].iteritems():
words2 = jieba.cut(data[key]['evidences'][k]['evidence'], cut_all=False)
f2.write(" ".join(words2) + "\n")
count += 1
count += 1
print "all question num is %s" % count
rm_stopwords(target, stopwords_dict)
step2 训练词向量
# -*- coding:utf-8 -*-
import logging
from gensim.models.word2vec import LineSentence, Word2Vec
import sys
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
sentences= LineSentence("./dataset/train_questions_with_evidence.txt")
model = Word2Vec(sentences, min_count=1, iter=1000)
- 词向量的维度size=100
- 上下文窗口window=5
- 负采样数negative=5
# -*- coding:utf-8 -*-
from gensim.models import Word2Vec
import numpy as np
import sys
import jieba
target = "./dataset/train_questions.txt"
model = "./model/w2v.mod"
rand_i = np.random.choice(range(36190),size=500,replace=False)
with open(target) as f, open("./dataset/target.txt", "w") as f2:
count = 1
for line in f:
if count in rand_i:
count += 1
class ResultInfo(object):
def __init__(self, index, score, text):
self.id = index
self.score = score
self.text = text
model_loaded = Word2Vec.load(model)
candidates = []
with open(target) as f:
for line in f:
while True:
text = raw_input("input sentence: ").decode("utf-8")
words = list(jieba.cut(text.strip(), cut_all=False))
print len(words)
res = []
index = 0
for candidate in candidates:
# print candidate
score = model_loaded.n_similarity(words, candidate)
res.append(ResultInfo(index, score, " ".join(candidate)))
index += 1
res.sort(cmp=None, key=lambda x:x.score, reverse=True)
k = 0
for i in res:
k += 1
print "text %s: %s, score : %s" % (i.id, i.text, i.score)
if k > 9:
Using TensorFlow backend.
input sentence: 编写史记的人受到了什么处罚
text 414: 司马迁 收到 了 什么 刑罚, score : 0.835533210003
text 437: 孔子 认为 可以 使人 温柔敦厚 的 儒家 经书 是 哪一部, score : 0.685654355168
text 36: 复活 是 谁 的 作品, score : 0.668847026927
text 158: 植物 人 的 神经系统 可能 没有 受到 损伤 的 部位 是, score : 0.666936575118
text 314: 中庸 是 谁 的 著作, score : 0.651872698188
text 487: 毛主席 的 战士 最 听 党 的话 这 首歌 反映 了 什么 地方 边防战士 的 生活, score : 0.643818242781
text 198: 少年 韩寒 中学 肄业 却 出 了 一本 叫做 三重门 的 书 这 本书 的 体裁 是, score : 0.640927475925
text 175: 孔子 创立 了 什么 学派, score : 0.639292126835
text 366: 锯子 是 谁 发明 的, score : 0.630026412448
text 363: 古人 对 幼年 的 儿童 的 代称 是, score : 0.627604867741
input sentence: 谁是百度的总裁
text 409: 百度 的 董事长 是 谁, score : 0.950410820636
text 337: 阿里巴巴 的 总裁 是 谁, score : 0.902061296606
text 330: 中国移动 老总 是 谁 啊, score : 0.817496001508
text 323: 火影忍者 谁 是 名人 的 爸爸, score : 0.760750518611
text 386: china 老大 是 谁, score : 0.758674075597
text 234: 不能 说 的 秘密 导演 是 谁, score : 0.757126307252
text 318: 李连杰 的 老婆 是 谁 呀, score : 0.743579774903
text 317: 姐 的 儿子 是 我 什么 呢, score : 0.726849299939
text 325: 中国 最后 一个 皇帝 是 谁 拜托 了 各位 谢谢, score : 0.726334050126
text 477: 刘备 的 爸爸 是 谁, score : 0.724550888419
input sentence:编写史记的人是谁
Building prefix dict from the default dictionary...
Loading model from cache C:\Users\AMINI~1\AppData\Local\Temp\jieba.cache
Loading model cost 1.038 seconds.
Prefix dict has been built succesfully.