数据集地址:https://nlp.stanford.edu/sentiment/code.html
datasetSentences.txt 格式:句子索引 | 句子内容
datasetSplit.txt 格式:句子索引| 句子属于哪个集合(1 = train 2 = test 3 = dev)
train有8544条,dev有1101条,test有 2210条
dictionary.txt 格式 :句子(或者短语)| 索引值
sentiment_labels.txt 格式:索引值 | 情感值
句子和短语总有239232条
情感值对应类别:[0, 0.2], (0.2, 0.4], (0.4, 0.6], (0.6, 0.8], (0.8, 1.0] 分别对应五分类情感
将其处理成一句对应一个分数,并且分成训练集和验证集和测试集,和原本的数据些微差别,训练集,验证集,测试集都比原来少了几条数据,因为datasetSentences.txt 中有些句子里面的人名表示有特殊字符,和 dictionary.txt 中的匹配不上,你也可以手动加上。
python代码如下:
# Copyright 2018 lww. All Rights Reserved.
# coding: utf-8
from __future__ import print_function
from __future__ import division
from __future__ import absolute_import
def delblankline(infile1, infile2, trainfile, validfile, testfile):
with open(infile1, 'r') as info1, open(infile2, 'r') as info2, \
open(trainfile, 'w') as train, open(validfile, 'w') as valid, open(testfile, 'w') as test:
lines1 = info1.readlines()
lines2 = info2.readlines()
for i in range(1, len(lines1)):
t1 = lines1[i].replace("-LRB-", "(")
t2 = t1.replace("-RRB-", ")")
k = lines2[i].strip().split(",")
t = t2.strip().split('\t')
if k[1] == '1':
train.writelines(t[1])
train.writelines("\n")
elif k[1] == '2':
test.writelines(t[1])
test.writelines("\n")
elif k[1] == '3':
valid.writelines(t[1])
valid.writelines("\n")
print("end")
def tag_sentiment(infile,infile0, infile1, infile2):
# ("sentiment_labels.txt", "dictionary.txt", "train.txt","train_final.txt")
with open(infile, 'r') as info, open(infile0, 'r') as info0, open(infile1, 'r') as info1, \
open(infile2, 'w') as info2:
lines = info.readlines()
lines0 = info0.readlines()
lines1 = info1.readlines()
text2id = {}
for i in range(0, len(lines0)):
s = lines0[i].strip().split("|")
text2id[s[0]] = s[1]
id2sentiment = {}
for i in range(0, len(lines)):
s = lines[i].strip().split("|")
id2sentiment[s[0]] = s[1]
for line in lines1:
if line.strip() not in text2id:
print(line.strip())
# 由于特殊字符不匹配造成
continue
else:
text_id = text2id[line.strip()]
sentiment_score = id2sentiment[text_id]
info2.write(line.strip() + "\n" + str(sentiment_score) + "\n")
print("end3d1")
delblankline("datasetSentences.txt", "datasetSplit.txt", "train.txt", "valid.txt", "test.txt")
# 获取原始的训练集,测试集,验证集
# train有8544条,dev有1101条,test有 2210条
tag_sentiment("sentiment_labels.txt", "dictionary.txt", "train.txt","train_final.txt")
tag_sentiment("sentiment_labels.txt", "dictionary.txt", "test.txt","test_final.txt")
tag_sentiment("sentiment_labels.txt", "dictionary.txt", "valid.txt","valid_final.txt")
# 获取训练集,测试集,验证集句子对应的情感值
# 由于文本里面的特殊字符造成的不匹配,训练集,测试集,验证集会相对于上一步少几条
image.gif
处理过后得到的数据为 train_final.txt,test_final.txt,valid_final.txt
网友评论