美文网首页
深度学习情感分析

深度学习情感分析

作者: Cracks_Yi | 来源:发表于2018-12-28 18:59 被阅读0次

    用词向量加深度学习的方法做情感分析的基本思路是:
    1.训练词向量 2.句子预处理、分词,句子变成一个个词的序列,指定序列最大长度,多砍少补,词分配索引、对应上词向量。3. 定义网络结构,比如可以使用一层LSTM+全连接层,使用dropout增加泛化性,然后开始训练。4.调整参数,看训练集和验证集的loss, accuracy,当验证集的accuracy非偶然地不增反降(一般也对应着loss开始上升)时,说明开始过拟合,停止训练,用这个epoch/iteration和参数重新对所有数据训练出模型。

    1.词向量
    训练词向量的语料和步骤在前面文章已有。可以把情感分析语料加上来一起训练词向量,方法和代码略过。值得一提的是,词向量语料最好是用待分析情感领域的语料,越多越好;另外,分词的好坏会很大程度影响词向量的准确性,可以做一些额外预处理比如去停用词、加入行业词条作为自定义词典等。

    2.文本变数字
    下图很好地解释了文本变数字的过程。通过词向量我们可以得到一个词表,词表里每个词有个index(比如词的index为词在词表中位置+1),且这个数字对应了该词的词向量,得到类似如图中embedding matrix。注意要留一个特殊数字如0代表非词典词。一句话分词后得到一系列词,如“I thought the movie was incredible and inspiring”分词后是“I”,“thought”,“the”,“movie”,“was”,“incredible”,“and”,“inspiring”,每个词对应一个索引数字,得到向量[41 804 201534 1005 15 7446 5 13767]。输入需转化为统一长度(max_len),如10,这句话只有8个词,那么需要补齐剩下的2个空位为0。那么[41 804 201534 1005 15 7446 5 13767 0 0]通过查询embedding matrix就可以得到[batch_size = 1, max_len = 10, word2vec_dimension = 50]的向量。即为输入。


    由文本到数字输入.png

    3.网络结构
    步骤2所说的是输入,输出就是one-hot向量。如3分类(正面、负面、中性),对应输出为[1 0 0]和[0 1 0]和[0 0 1],softmax得到的输出就可以代表各分类的概率。对于二分类,也可以用0,1来代表输出,这样用sigmoid使输出映射到0到1之间,也可以作为概率。那么有了输入和输出,就要定义模型/网络结构,然后让模型自己去学习参数。这里语料不多,模型尽可能简单。可以用一层CNN(这里当然是包括pooling层的),RNN(LSTM, GRU, Bidirectional lstm)等,最后是一层全连接层。实验发现Bidirectional lstm效果最好,测试集上能达到95%以上的正确率。这也与一般认知相符,因为CNN只提取了一段段的词,没考虑上下文信息;而lstm将句子由左向右计算,不能结合右边的信息,所以bi-lstm加一遍反向计算的信息。

    4.训练
    划分训练集和验证集(0.2比例),用训练集做训练,同时对验证集也要算loss和accuracy。正常情况,训练集loss越来越低,accuracy越来越高至收敛;验证集开始也如此,到某个时刻开始loss升高,accuracy降低,说明过拟合,在这一刻early-stopping。用当前参数重新训练整个数据,得到模型。

    5.python代码
    keras训练

    # -*- coding: utf-8 -*-
    import time
    import yaml
    import sys
    from sklearn.model_selection import train_test_split
    import multiprocessing
    import numpy as np
    from gensim.models import Word2Vec
    from gensim.corpora.dictionary import Dictionary
    
    from keras.preprocessing import sequence
    from keras.models import Sequential
    from keras.layers.embeddings import Embedding
    from keras.layers import Bidirectional
    from keras.layers.recurrent import LSTM
    from keras.layers.core import Dense, Dropout,Activation
    from keras.models import model_from_yaml
    np.random.seed(35)  # For Reproducibility
    import jieba
    import pandas as pd
    import sys
    sys.setrecursionlimit(1000000)
    # set parameters:
    vocab_dim = 256
    maxlen = 150
    batch_size = 32
    n_epoch = 5
    input_length = 150
    validation_rate = 0.0
    cpu_count = multiprocessing.cpu_count()
    
    def read_txt(filename):
      f = open(filename)
      res = []
      for i in f:
        res.append(i.replace("\n",""))
      del(res[0])
      return res
    
    
    #加载训练文件
    def loadfile():
        neg = read_txt("./bida_neg.txt")
        pos = read_txt('./bida_pos.txt')
        combined=np.concatenate((pos, neg))
        y = np.concatenate((np.ones(len(pos),dtype=int), np.zeros(len(neg),dtype=int)))
    return combined,y
    
    #对句子经行分词,并去掉换行符
    def tokenizer(text):
        ''' Simple Parser converting each document to lower-case, then
            removing the breaks for new lines and finally splitting on the
            whitespace
        '''
        text = [jieba.lcut(document.replace('\n', '')) for document in text]
        return text
    
    def create_dictionaries(model=None,
                            combined=None):
        ''' Function does are number of Jobs:
            1- Creates a word to index mapping
            2- Creates a word to vector mapping
            3- Transforms the Training and Testing Dictionaries
        '''
        if (combined is not None) and (model is not None):
            gensim_dict = Dictionary()
            gensim_dict.doc2bow(model.wv.vocab.keys(),
                                allow_update=True)
            w2indx = {v: k+1 for k, v in gensim_dict.items()}#所有频数超过10的词语的
    索引
            w2vec = {word: model[word] for word in w2indx.keys()}#所有频数超过10的词
    语的词向量
    
            def parse_dataset(combined):
                ''' Words become integers
                '''
                data=[]
                for sentence in combined:
                    new_txt = []
                    for word in sentence:
                        try:
                            new_txt.append(w2indx[word])
                        except:
                            new_txt.append(0)
                    data.append(new_txt)
                return data
            combined=parse_dataset(combined)
            combined= sequence.pad_sequences(combined, maxlen=maxlen)#每个句子所含词
    语对应的索引
            return w2indx, w2vec,combined
        else:
            print('No data provided...')
    
    
    
    def get_data(index_dict,word_vectors,combined,y):
    
        n_symbols = len(index_dict) + 1  # 所有单词的索引数,频数小于10的词语索引为0,所以加1
        embedding_weights = np.zeros((n_symbols, vocab_dim))#索引为0的词语,词向量全
    为0
        for word, index in index_dict.items():#从索引为1的词语开始,对每个词语对应其
    词向量
            embedding_weights[index, :] = word_vectors[word]
        x_train, x_test, y_train, y_test = train_test_split(combined, y, test_size=validation_rate)
    
        return n_symbols,embedding_weights,x_train,y_train,x_test,y_test
    
    
    def word2vec_train(model, combined):
    
        index_dict, word_vectors,combined = create_dictionaries(model=model,combined=combined)
        return   index_dict, word_vectors, combined
    
    ##定义网络结构
    def train_lstm(n_symbols,embedding_weights,x_train,y_train,x_test,y_test):
    
    
        model = Sequential()
        model.add(Embedding(output_dim=vocab_dim,
                            input_dim=n_symbols,
                            mask_zero=True,
                            weights=[embedding_weights],
                            input_length=input_length))  # Adding Input Length
     
        model.add(Bidirectional(LSTM(32, activation='sigmoid',inner_activation='sigmoid')))
    
        model.add(Dropout(0.4))
        model.add(Dense(1))
        model.add(Activation('sigmoid'))
    
        print('Compiling the Model...')
    model.compile(loss='binary_crossentropy',
                      optimizer='adam',metrics=['accuracy'])
    
        print("Train...")
        model.fit(x_train, y_train, batch_size=batch_size, nb_epoch=n_epoch,verbose=1, validation_data=(x_test, y_test))
    
        print("Evaluate...")
        score = model.evaluate(x_test, y_test,
                                    batch_size=batch_size)
    
        yaml_string = model.to_yaml()
        with open('lstm_data/lstm.yml', 'w') as outfile:
            outfile.write( yaml.dump(yaml_string, default_flow_style=True) )
        model.save_weights('lstm_data/lstm.h5')
        print('Test score:', score)
    
    #训练模型,并保存
    def train():
        combined,y=loadfile()
        combined = tokenizer(combined)
        model = Word2Vec.load("../models/word2vec.model")
        index_dict, word_vectors,combined=create_dictionaries(model, combined)
        n_symbols,embedding_weights,x_train,y_train,x_test,y_test=get_data(index_dict, word_vectors,combined,y)
        train_lstm(n_symbols,embedding_weights,x_train,y_train,x_test,y_test)
    
    
    if __name__=='__main__':
        train()
    
    
    

    以上是二分类,输出映射到0~1之间的代码,如果多分类,激活函数用softmax代替sigmoid,loss='binary_crossentropy'改为loss='categorical_crossentropy',另外y = to_categorical(y, num_classes=classes)

    预测

    # -*- coding: utf-8 -*-
    import time
    import yaml
    import sys
    from sklearn.model_selection import train_test_split
    import multiprocessing
    import numpy as np
    from gensim.models import Word2Vec
    from gensim.corpora.dictionary import Dictionary
    
    from keras.preprocessing import sequence
    from keras.models import Sequential
    from keras.layers.embeddings import Embedding
    from keras.layers.recurrent import LSTM
    from keras.layers.core import Dense, Dropout,Activation
    from keras.models import model_from_yaml
    import jieba
    import pandas as pd
    
    # set parameters:
    vocab_dim = 256
    maxlen = 150
    batch_size = 32
    n_epoch = 5
    input_length = 150
    cpu_count = multiprocessing.cpu_count()
    
    
    def init_dictionaries(w2v_model):
        gensim_dict = Dictionary()
        gensim_dict.doc2bow(w2v_model.wv.vocab.keys(),
                                allow_update=True)
        w2indx = {v: k+1 for k, v in gensim_dict.items()}
        w2vec = {word: w2v_model[word] for word in w2indx.keys()}
        return w2indx, w2vec
    
    def process_words(w2indx, words):
    
        temp = []
        for word in words:
            try:
                    temp.append(w2indx[word])
            except:
                    temp.append(0)
        res = sequence.pad_sequences([temp], maxlen = maxlen)
        return res
    
    def input_transform(string, w2index):
        words=jieba.lcut(string)
        return process_words(w2index, words)
    
    
    def load_model():
        print('loading model......')
        with open('lstm_data/lstm.yml', 'r') as f:
            yaml_string = yaml.load(f)
        model = model_from_yaml(yaml_string)
    
        model.load_weights('lstm_data/lstm.h5')
        model.compile(loss='binary_crossentropy',
                      optimizer='adam',metrics=['accuracy'])
    
        w2v_model=Word2Vec.load('../models/word2vec.model')
        return model,w2v_model
    
    def lstm_predict(string, model, w2index):
        data=input_transform(string, w2index)
        data.reshape(1,-1)
        result=model.predict_classes(data)
        prob = model.predict_proba(data)
        print(string)
        print("prob:" + str(prob))
        if result[0][0]==1:
            #print(string,' positive')
            return 1
        else:
            #print(string,' negative')
            return -1
    
    if __name__=='__main__':
        model,w2v_model = load_model()
        w2index, _ = init_dictionaries(w2v_model)
        lstm_predict("平安大跌", model, w2index)
    
    

    tensorflow训练

    #coding = utf-8
    
    from gensim.corpora import Dictionary
    from gensim.models import Word2Vec
    import numpy as np
    from random import randint
    from sklearn.model_selection import train_test_split
    import tensorflow as tf
    import jieba
    
    def read_txt(filename):
      f = open(filename)
      res = []
      for i in f:
        res.append(i.replace("\n",""))
      del(res[0])
      return res
    
    
    def loadfile():
        neg = read_txt("../data/bida_neg.txt")
        pos = read_txt('../data/bida_pos.txt')
        combined=np.concatenate((pos, neg))
        y = np.concatenate((np.ones(len(pos),dtype=int),np.zeros(len(neg),dtype=int)))
    
        return combined,y
    
    
    def create_dictionaries(model=None):
    
        if (combined is not None) and (model is not None):
            gensim_dict = Dictionary()
            gensim_dict.doc2bow(model.wv.vocab.keys(),
                                allow_update=True)
            w2index = {v: k+1 for k, v in gensim_dict.items()}
            vectors = np.zeros((len(w2index) + 1, num_dimensions), dtype='float32')
            for k, v in gensim_dict.items():
                vectors[k+1] = model[v]
    
        return w2index, vectors
    
    def get_train_batch(batch_size):
        labels = []
        arr = np.zeros([batch_size, max_seq_length])
        for i in range(batch_size):
            num = randint(0,len(X_train) - 1)
            labels.append(y_train[num])
            arr[i] = X_train[num]
    
        return arr, labels
    
    def get_test_batch(batch_size):
        labels = []
        arr = np.zeros([batch_size, max_seq_length])
        for i in range(batch_size):
            num = randint(0,len(X_test) - 1)
            labels.append(y_test[num])
            arr[i] = X_test[num]
        return arr, labels
    
    def get_all_batches(batch_size = 32, mode = "train"):
        X, y = None, None
        if mode == "train":
            X = X_train
            y = y_train
        elif mode == "test":
            X = X_test
            y = y_test
    
        batches = int(len(y)/batch_size)
        arrs = [X[i*batch_size:i*batch_size + batch_size] for i in range(batches)]
        arrs.append(X[batches*batch_size:len(y)])
        labels = [y[i*batch_size:i*batch_size + batch_size] for i in range(batches)]
        labels.append(y[batches*batch_size:len(y)])
        return arrs, labels
    
    def parse_dataset(sentences, w2index, max_len):
        data=[]
        for sentence in sentences:
            words = jieba.lcut(sentence.replace('\n', ''))
            new_txt = np.zeros((max_len), dtype='int32')
            index = 0
            for word in words:
                try:
                    new_txt[index] = w2index[word]
                except:
                    new_txt[index] = 0
    
                index += 1
                if index >= max_len:
                    break
    
            data.append(new_txt)
        return data
    
    batch_size = 32
    lstm_units = 64
    num_classes = 2
    iterations = 50000
    num_dimensions = 256
    max_seq_len = 150
    max_seq_length = 150
    validation_rate = 0.2
    random_state = 9876
    output_keep_prob = 0.5
    learning_rate = 0.001
    
    combined, y = loadfile()
    
    model = Word2Vec.load("../models/word2vec.model")
    w2index, vectors = create_dictionaries(model)
    
    X = parse_dataset(combined, w2index, max_seq_len)
    y = [[1,0] if yi == 1 else [0,1] for yi in y]
    
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=validation_rate, random_state=random_state)
    
    
    tf.reset_default_graph()
    
    labels = tf.placeholder(tf.float32, [None, num_classes])
    input_data = tf.placeholder(tf.int32, [None, max_seq_length])
    
    data = tf.placeholder(tf.float32, [None, max_seq_length, num_dimensions])
    data = tf.nn.embedding_lookup(vectors, input_data)
    
    #bidirectional lstm
    lstm_fw = tf.contrib.rnn.BasicLSTMCell(lstm_units)
    lstm_fw = tf.contrib.rnn.DropoutWrapper(cell=lstm_fw, output_keep_prob=output_keep_prob)
    lstm_bw = tf.contrib.rnn.BasicLSTMCell(lstm_units)
    lstm_bw = tf.contrib.rnn.DropoutWrapper(cell=lstm_bw, output_keep_prob=output_keep_prob)
    (output_fw, output_bw),_ = tf.nn.bidirectional_dynamic_rnn(cell_fw=lstm_fw, cell_bw=lstm_bw,inputs = data, dtype=tf.float32)
    
    outputs = tf.concat([output_fw, output_bw], axis=2)
    # Fully connected layer.
    weight = tf.get_variable(name="W", shape=[2 * lstm_units, num_classes],
                    dtype=tf.float32)
    
    bias = tf.get_variable(name="b", shape=[num_classes], dtype=tf.float32,
                    initializer=tf.zeros_initializer())
    
    last = tf.transpose(outputs, [1,0,2])
    last = tf.gather(last, int(last.get_shape()[0]) - 1)
    
    logits = (tf.matmul(last, weight) + bias)
    prediction = tf.nn.softmax(logits)
    
    correctPred = tf.equal(tf.argmax(prediction,1), tf.argmax(labels,1))
    accuracy = tf.reduce_mean(tf.cast(correctPred, tf.float32))
    
    loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=labels))
    optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(loss)
    
    sess = tf.InteractiveSession()
    saver = tf.train.Saver()
    sess.run(tf.global_variables_initializer())
    
    
    cal_iter = 500
    
    
    loss_train, loss_test = 0.0, 0.0
    acc_train, acc_test = 0.0, 0.0
    print("start training...")
    for i in range(iterations):
    
    #Next Batch of reviews
        next_batch, next_batch_labels = get_train_batch(batch_size);
        sess.run(optimizer, {input_data: next_batch, labels: next_batch_labels})
    
        #Save the network every 10,000 training iterations
    
        if (i % cal_iter == 0):
            save_path = saver.save(sess, "models/pretrained_lstm.ckpt")
            print("iteration: " + str(i))
            train_acc, train_loss = 0.0, 0.0
            test_acc, test_loss = 0.0, 0.0
            train_arrs, train_labels = get_all_batches(300)
            test_arrs, test_labels = get_all_batches(300, "test")
    
          for k in range(len(train_labels)):
                temp1, temp2 = sess.run([accuracy, loss], {input_data: train_arrs[k], labels : train_labels[k]})
                train_acc += temp1
                train_loss += temp2
            train_acc /= len(train_labels)
            train_loss /= len(train_labels)
    
            for k in range(len(test_labels)):
                temp1, temp2 = sess.run([accuracy, loss], {input_data: test_arrs[k], labels : test_labels[k]})
                test_acc += temp1
                test_loss += temp2
            test_acc /= len(test_labels)
            test_loss /= len(test_labels)
    
            print("train accuracy: " + str(train_acc) + ", train loss: " + str(train_loss))
            print("test accucary: " + str(test_acc) + ", test loss: " + str(test_loss))
    
              
    
    

    预测

    import tensorflow as tf
    from gensim.models import Word2Vec
    from gensim.corpora.dictionary import Dictionary
    import numpy as np
    import jieba
    
    def create_dictionaries(model=None):
    
        if model is not None:
            gensim_dict = Dictionary()
            gensim_dict.doc2bow(model.wv.vocab.keys(),
                                allow_update=True)
            w2index = {v: k+1 for k, v in gensim_dict.items()}
            vectors = np.zeros((len(w2index) + 1, num_dimensions), dtype='float32')
            for k, v in gensim_dict.items():
                vectors[k+1] = model[v]
    
        return w2index, vectors
    
    def parse_dataset(sentence, w2index, max_len):
        words = jieba.lcut(sentence.replace('\n', ''))
        new_txt = np.zeros((max_len), dtype='int32')
        index = 0
        for word in words:
                try:
                    new_txt[index] = w2index[word]
                except:
                    new_txt[index] = 0
    
                index += 1
                if index >= max_len:
                    break
    
        return [new_txt]
    
    batch_size = 32
    lstm_units = 64
    num_classes = 2
    iterations = 100000
    num_dimensions = 256
    max_seq_len = 150
    max_seq_length = 150
    validation_rate = 0.2
    random_state = 333
    output_keep_prob = 0.5
    
    
    model = Word2Vec.load("../models/word2vec.model")
    w2index, vectors = create_dictionaries(model)
    
    
    
    
    tf.reset_default_graph()
    
    labels = tf.placeholder(tf.float32, [None, num_classes])
    input_data = tf.placeholder(tf.int32, [None, max_seq_length])
    
    data = tf.placeholder(tf.float32, [None, max_seq_length, num_dimensions])
    data = tf.nn.embedding_lookup(vectors,input_data)
    
    
    """
    bi-lstm
    """
    #bidirectional lstm
    lstm_fw = tf.contrib.rnn.BasicLSTMCell(lstm_units)
    lstm_fw = tf.contrib.rnn.DropoutWrapper(cell=lstm_fw, output_keep_prob=output_keep_prob)
    lstm_bw = tf.contrib.rnn.BasicLSTMCell(lstm_units)
    lstm_bw = tf.contrib.rnn.DropoutWrapper(cell=lstm_bw, output_keep_prob=output_keep_prob)
    (output_fw, output_bw),_ = tf.nn.bidirectional_dynamic_rnn(cell_fw=lstm_fw, cell_bw=lstm_bw,inputs = data, dtype=tf.float32)
    
    outputs = tf.concat([output_fw, output_bw], axis=2)
    
    # Fully connected layer.
    weight = tf.get_variable(name="W", shape=[2 * lstm_units, num_classes],
                    dtype=tf.float32)
    
    bias = tf.get_variable(name="b", shape=[num_classes], dtype=tf.float32,
                    initializer=tf.zeros_initializer())
    
    #last = tf.reshape(outputs, [-1, 2 * lstm_units])
    last = tf.transpose(outputs, [1,0,2])
    last = tf.gather(last, int(last.get_shape()[0]) - 1)
    
    logits = (tf.matmul(last, weight) + bias)
    prediction = tf.nn.softmax(logits)
    
    
    correctPred = tf.equal(tf.argmax(prediction,1), tf.argmax(labels,1))
    accuracy = tf.reduce_mean(tf.cast(correctPred, tf.float32))
    
    sess = tf.InteractiveSession()
    saver = tf.train.Saver()
    #saver.restore(sess, 'models/pretrained_lstm.ckpt-27000.data-00000-of-00001')
    saver.restore(sess, tf.train.latest_checkpoint('models'))
    
    
    l = ["平安银行大跌", "平安银行暴跌", "平安银行扭亏为盈","小米将加深与TCL合作",
    "苹果手机现在卖的不如以前了","苹果和三星的糟糕业绩预示着全球商业领域将经历更加严
    峻的考验。"
    ,"这道菜不好吃"]
    for s in l:
        print(s)
        X = parse_dataset(s, w2index, max_seq_len)
        predictedSentiment = sess.run(prediction, {input_data: X})[0]
        print(predictedSentiment[0], predictedSentiment[1])
    
    

    参考资料:
    https://github.com/adeshpande3/LSTM-Sentiment-Analysis/blob/master/Oriole%20LSTM.ipynb
    https://buptldy.github.io/2016/07/20/2016-07-20-sentiment%20analysis/
    http://colah.github.io/posts/2015-08-Understanding-LSTMs/
    http://www.wildml.com/2015/10/recurrent-neural-network-tutorial-part-4-implementing-a-grulstm-rnn-with-python-and-theano/
    https://arxiv.org/abs/1408.5882

    相关文章

      网友评论

          本文标题:深度学习情感分析

          本文链接:https://www.haomeiwen.com/subject/bzmelqtx.html