美文网首页程序员充电站自然语言处理我爱编程
使用深度学习进行中文自然语言处理之序列标注

使用深度学习进行中文自然语言处理之序列标注

作者: demobin | 来源:发表于2016-05-16 10:13 被阅读15373次

    深度学习简介

    深度学习的资料很多,这里就不展开了讲,本文就介绍中文NLP的序列标注工作的一般方法。

    机器学习与深度学习

    简单来说,机器学习就是根据样本(即数据)学习得到一个模型,再根据这个模型预测的一种方法。
    ML算法很多,Naive Bayes朴素贝叶斯、Decision Tree决策树、Support Vector Machine支持向量机、Logistic Regression逻辑回归、Conditional Random Field条件随机场等。
    而深度学习,简单来说是一种有多层隐层的感知机。
    DL也分很多模型,但一般了解Convolution Neural Network卷积神经网络、Recurrent Neural Network循环神经网络就够了(当然都要学,这里是指前期学习阶段可以侧重这两个)。
    异同:ML是一种浅层学习,一般来说都由人工设计特征,而DL则用pre-training或者无监督学习来抽取特征表示,再使用监督学习来训练预测模型(当然不全都是这样)。
    本文主要用于介绍DL在中文NLP的应用,所以采用了使用最为简单、方便的DL框架keras来开发,它是构建于两个非常受欢迎的DL框架theano和tensorflow之上的上层应用框架。

    NLP简介

    Natural Language Process自然语言处理又分为NLU自然语言理解和NLG自然语言生成。而分词、词性标注、实体识别、依存分析则是NLP的基础工作,它们都可以理解为一种序列标注工作。

    序列标注工作简介

    词向量简介

    Word Embedding词向量方法,用实数向量来表示一个词的方法,是对One-hot Representation的一种优化。优点是低维,而且可以方便的用数学距离衡量词的词义相似度,缺点是词一多,模型就有点大,所以又有工作提出了Char Embedding方法,这种方法训练出来的模型很小,但丢失了很多的语义信息,所以又有基于分词信息的字向量的研究工作。

    中文NLP序列标注之CWS

    CWS简介

    Chinese Word Segmentation中文分词是中文NLP的基础,一般来说中文分词有两种方法,一种是基于词典的方法,一种是基于ML或者DL的方法。CWS的发展可以参考漫话中文分词,简单来说基于词典的方法实现简单、速度快,但是对歧义和未登录词没有什么好的办法,而基于ML和DL的方法实现复杂、速度较慢,但是可以较好地应对歧义和OOV(Out-Of-Vocabulary)。
    基于词典的方法应用最广的应该是正向最大匹配,而基于ML的CWS效果比较好的算法是CRF,本文主要介绍基于DL的方法,但在实际应用中应该合理的结合两种方法。

    标注集与评估方法

    这里采用B(Begin字为词的起始)、M(Middle字为词的中间)、E(End字为词的结束)、S(Single单字词)标注集,训练预料和评估工具采用SIGHAN中的方法,具体可以参考我的另一篇文章SIGHAN测评中文分词的方法与指标介绍

    模型

    原理是采用bi-directional LSTM模型训练后对句子进行预测得到一个标注的概率,再使用Viterbi算法寻找最优的标注序列。在分词的工作中不需要加入词向量,提升效果不明显。

    实现

    预处理

    #!/usr/bin/env python
    #-*- coding: utf-8 -*-
    
    #2016年 03月 03日 星期四 11:01:05 CST by Demobin
    
    import json
    import h5py
    import string
    import codecs
    
    corpus_tags = ['S', 'B', 'M', 'E']
    
    def saveCwsInfo(path, cwsInfo):
        '''保存分词训练数据字典和概率'''
        print('save cws info to %s'%path)
        fd = open(path, 'w')
        (initProb, tranProb), (vocab, indexVocab) = cwsInfo
        j = json.dumps((initProb, tranProb))
        fd.write(j + '\n')
        for char in vocab:
            fd.write(char.encode('utf-8') + '\t' + str(vocab[char]) + '\n')
        fd.close()
    
    def loadCwsInfo(path):
        '''载入分词训练数据字典和概率'''
        print('load cws info from %s'%path)
        fd = open(path, 'r')
        line = fd.readline()
        j = json.loads(line.strip())
        initProb, tranProb = j[0], j[1]
        lines = fd.readlines()
        fd.close()
        vocab = {}
        indexVocab = [0 for i in range(len(lines))]
        for line in lines:
            rst = line.strip().split('\t')
            if len(rst) < 2: continue
            char, index = rst[0].decode('utf-8'), int(rst[1])
            vocab[char] = index
            indexVocab[index] = char
        return (initProb, tranProb), (vocab, indexVocab)
    
    def saveCwsData(path, cwsData):
        '''保存分词训练输入样本'''
        print('save cws data to %s'%path)
        #采用hdf5保存大矩阵效率最高
        fd = h5py.File(path,'w')
        (X, y) = cwsData
        fd.create_dataset('X', data = X)
        fd.create_dataset('y', data = y)
        fd.close()
    
    def loadCwsData(path):
        '''载入分词训练输入样本'''
        print('load cws data from %s'%path)
        fd = h5py.File(path,'r')
        X = fd['X'][:]
        y = fd['y'][:]
        fd.close()
        return (X, y)
    
    def sent2vec2(sent, vocab, ctxWindows = 5):
        
        charVec = []
        for char in sent:
            if char in vocab:
                charVec.append(vocab[char])
            else:
                charVec.append(vocab['retain-unknown'])
        #首尾padding
        num = len(charVec)
        pad = int((ctxWindows - 1)/2)
        for i in range(pad):
            charVec.insert(0, vocab['retain-padding'] )
            charVec.append(vocab['retain-padding'] )
        X = []
        for i in range(num):
            X.append(charVec[i:i + ctxWindows])
        return X
    
    def sent2vec(sent, vocab, ctxWindows = 5):
        chars = []
        for char in sent:
            chars.append(char)
        return sent2vec2(chars, vocab, ctxWindows = ctxWindows)
    
    def doc2vec(fname, vocab):
        '''文档转向量'''
    
        #一次性读入文件,注意内存
        fd = codecs.open(fname, 'r', 'utf-8')
        lines = fd.readlines()
        fd.close()
    
        #样本集
        X = []
        y = []
    
        #标注统计信息
        tagSize = len(corpus_tags)
        tagCnt = [0 for i in range(tagSize)]
        tagTranCnt = [[0 for i in range(tagSize)] for j in range(tagSize)]
    
        #遍历行
        for line in lines:
            #按空格分割
            words = line.strip('\n').split()
            #每行的分词信息
            chars = []
            tags = []
            #遍历词
            for word in words:
                #包含两个字及以上的词
                if len(word) > 1:
                    #词的首字
                    chars.append(word[0])
                    tags.append(corpus_tags.index('B'))
                    #词中间的字
                    for char in word[1:(len(word) - 1)]:
                        chars.append(char)
                        tags.append(corpus_tags.index('M'))
                    #词的尾字
                    chars.append(word[-1])
                    tags.append(corpus_tags.index('E'))
                #单字词
                else: 
                    chars.append(word)
                    tags.append(corpus_tags.index('S'))
    
            #字向量表示
            lineVecX = sent2vec2(chars, vocab, ctxWindows = 7)
    
            #统计标注信息
            lineVecY = []
            lastTag = -1
            for tag in tags:
                #向量
                lineVecY.append(tag)
                #lineVecY.append(corpus_tags[tag])
                #统计tag频次
                tagCnt[tag] += 1
                #统计tag转移频次
                if lastTag != -1:
                    tagTranCnt[lastTag][tag] += 1
                #暂存上一次的tag
                lastTag = tag
    
            X.extend(lineVecX)
            y.extend(lineVecY)
    
        #字总频次
        charCnt = sum(tagCnt)
        #转移总频次
        tranCnt = sum([sum(tag) for tag in tagTranCnt])
        #tag初始概率
        initProb = []
        for i in range(tagSize):
            initProb.append(tagCnt[i]/float(charCnt))
        #tag转移概率
        tranProb = []
        for i in range(tagSize):
            p = []
            for j in range(tagSize):
                p.append(tagTranCnt[i][j]/float(tranCnt))
            tranProb.append(p)
    
        return X, y, initProb, tranProb
    
    def genVocab(fname, delimiters = [' ', '\n']):
        
        #一次性读入文件,注意内存
        fd = codecs.open(fname, 'r', 'utf-8')
        data = fd.read()
        fd.close()
    
        vocab = {}
        indexVocab = []
        #遍历
        index = 0
        for char in data:
            #如果为分隔符则无需加入字典
            if char not in delimiters and char not in vocab:
                vocab[char] = index
                indexVocab.append(char)
                index += 1
    
        #加入未登陆新词和填充词
        vocab['retain-unknown'] = len(vocab)
        vocab['retain-padding'] = len(vocab)
        indexVocab.append('retain-unknown')
        indexVocab.append('retain-padding')
        #返回字典与索引
        return vocab, indexVocab
    
    def load(fname):
        print 'train from file', fname
        delims = [' ', '\n']
        vocab, indexVocab = genVocab(fname)
        X, y, initProb, tranProb = doc2vec(fname, vocab)
        print len(X), len(y), len(vocab), len(indexVocab)
        print initProb
        print tranProb
        return (X, y), (initProb, tranProb), (vocab, indexVocab)
    
    if __name__ == '__main__':
        load('~/work/corpus/icwb2/training/msr_training.utf8')
    
    

    模型

    #!/usr/bin/env python
    #-*- coding: utf-8 -*-
    
    #2016年 03月 03日 星期四 11:01:05 CST by Demobin
    
    import numpy as np
    import json
    import h5py
    import codecs
    
    from dataset import cws
    from util import viterbi
    
    from sklearn.model_selection import train_test_split
    
    from keras.preprocessing import sequence
    from keras.optimizers import SGD, RMSprop, Adagrad
    from keras.utils import np_utils
    from keras.models import Sequential,Graph, model_from_json
    from keras.layers.core import Dense, Dropout, Activation, TimeDistributedDense
    from keras.layers.embeddings import Embedding
    from keras.layers.recurrent import LSTM, GRU, SimpleRNN
    
    from gensim.models import Word2Vec
    
    def train(cwsInfo, cwsData, modelPath, weightPath):
    
        (initProb, tranProb), (vocab, indexVocab) = cwsInfo
        (X, y) = cwsData
    
        train_X, test_X, train_y, test_y = train_test_split(X, y , train_size=0.9, random_state=1)
    
        train_X = np.array(train_X)
        train_y = np.array(train_y)
        test_X = np.array(test_X)
        test_y = np.array(test_y)
        
        outputDims = len(cws.corpus_tags)
        Y_train = np_utils.to_categorical(train_y, outputDims)
        Y_test = np_utils.to_categorical(test_y, outputDims)
        batchSize = 128
        vocabSize = len(vocab) + 1
        wordDims = 100
        maxlen = 7
        hiddenDims = 100
    
        w2vModel = Word2Vec.load('model/sougou.char.model')
        embeddingDim = w2vModel.vector_size
        embeddingUnknown = [0 for i in range(embeddingDim)]
        embeddingWeights = np.zeros((vocabSize + 1, embeddingDim))
        for word, index in vocab.items():
            if word in w2vModel:
                e = w2vModel[word]
            else:
                e = embeddingUnknown
            embeddingWeights[index, :] = e
        
        #LSTM
        model = Sequential()
        model.add(Embedding(output_dim = embeddingDim, input_dim = vocabSize + 1, 
            input_length = maxlen, mask_zero = True, weights = [embeddingWeights]))
        model.add(LSTM(output_dim = hiddenDims, return_sequences = True))
        model.add(LSTM(output_dim = hiddenDims, return_sequences = False))
        model.add(Dropout(0.5))
        model.add(Dense(outputDims))
        model.add(Activation('softmax'))
        model.compile(loss = 'categorical_crossentropy', optimizer = 'adam')
        
        result = model.fit(train_X, Y_train, batch_size = batchSize, 
                        nb_epoch = 20, validation_data = (test_X,Y_test), show_accuracy=True)
        
        j = model.to_json()
        fd = open(modelPath, 'w')
        fd.write(j)
        fd.close()
        
        model.save_weights(weightPath)
    
        return model
    
    def loadModel(modelPath, weightPath):
    
        fd = open(modelPath, 'r')
        j = fd.read()
        fd.close()
        
        model = model_from_json(j)
        
        model.load_weights(weightPath)
    
        return model
    
    
    # 根据输入得到标注推断
    def cwsSent(sent, model, cwsInfo):
        (initProb, tranProb), (vocab, indexVocab) = cwsInfo
        vec = cws.sent2vec(sent, vocab, ctxWindows = 7)
        vec = np.array(vec)
        probs = model.predict_proba(vec)
        #classes = model.predict_classes(vec)
    
        prob, path = viterbi.viterbi(vec, cws.corpus_tags, initProb, tranProb, probs.transpose())
    
        ss = ''
        for i, t in enumerate(path):
            ss += '%s/%s '%(sent[i], cws.corpus_tags[t])
        ss = ''
        word = ''
        for i, t in enumerate(path):
            if cws.corpus_tags[t] == 'S':
                ss += sent[i] + ' '
                word = ''
            elif cws.corpus_tags[t] == 'B':
                word += sent[i]
            elif cws.corpus_tags[t] == 'E':
                word += sent[i]
                ss += word + ' '
                word = ''
            elif cws.corpus_tags[t] == 'M': 
                word += sent[i]
    
        return ss
    
    def cwsFile(fname, dstname, model, cwsInfo):
        fd = codecs.open(fname, 'r', 'utf-8')
        lines = fd.readlines()
        fd.close()
    
        fd = open(dstname, 'w')
        for line in lines:
            rst = cwsSent(line.strip(), model, cwsInfo)
            fd.write(rst.encode('utf-8') + '\n')
        fd.close()
    
    def test():
        print 'Loading vocab...'
        cwsInfo = cws.loadCwsInfo('./model/cws.info')
        cwsData = cws.loadCwsData('./model/cws.data')
        print 'Done!'
        print 'Loading model...'
        #model = train(cwsInfo, cwsData, './model/cws.w2v.model', './model/cws.w2v.model.weights')
        #model = loadModel('./model/cws.w2v.model', './model/cws.w2v.model.weights')
        model = loadModel('./model/cws.model', './model/cws.model.weights')
        print 'Done!'
        print '-------------start predict----------------'
        #s = u'为寂寞的夜空画上一个月亮'
        #print cwsSent(s, model, cwsInfo)
        cwsFile('~/work/corpus/icwb2/testing/msr_test.utf8', './msr_test.utf8.cws', model, cwsInfo)
    
    if __name__ == '__main__':
        test()
    

    viterbi算法

    #!/usr/bin/python
    # -*- coding: utf-8 -*-
    
    #2016年 01月 28日 星期四 17:14:03 CST by Demobin
    
    def _print(hiddenstates, V):
        s = "    " + " ".join(("%7d" % i) for i in range(len(V))) + "\n"
        for i, state in enumerate(hiddenstates):
            s += "%.5s: " % state
            s += " ".join("%.7s" % ("%f" % v[i]) for v in V)
            s += "\n"
        print(s)
    
    #标准viterbi算法,参数为观察状态、隐藏状态、概率三元组(初始概率、转移概率、观察概率)
    def viterbi(obs, states, start_p, trans_p, emit_p):
    
        lenObs = len(obs)
        lenStates = len(states)
    
        V = [[0.0 for col in range(lenStates)] for row in range(lenObs)]
        path = [[0 for col in range(lenObs)] for row in range(lenStates)]
    
        #t = 0时刻
        for y in range(lenStates):
            #V[0][y] = start_p[y] * emit_p[y][obs[0]]
            V[0][y] = start_p[y] * emit_p[y][0]
            path[y][0] = y
    
        #t > 1时
        for t in range(1, lenObs):
            newpath = [[0.0 for col in range(lenObs)] for row in range(lenStates)]
    
            for y in range(lenStates):
                prob = -1
                state = 0
                for y0 in range(lenStates):
                    #nprob = V[t - 1][y0] * trans_p[y0][y] * emit_p[y][obs[t]]
                    nprob = V[t - 1][y0] * trans_p[y0][y] * emit_p[y][t]
                    if nprob > prob:
                        prob = nprob
                        state = y0
                        #记录最大概率
                        V[t][y] = prob
                        #记录路径
                        newpath[y][:t] = path[state][:t]
                        newpath[y][t] = y
    
            path = newpath
    
        prob = -1
        state = 0
        for y in range(lenStates):
            if V[lenObs - 1][y] > prob:
                prob = V[lenObs - 1][y]
                state = y
    
        #_print(states, V)
        return prob, path[state]
    
    def example():
        #隐藏状态
        hiddenstates = ('Healthy', 'Fever')
        #观察状态
        observations = ('normal', 'cold', 'dizzy')
    
        #初始概率
        '''
        Healthy': 0.6, 'Fever': 0.4
        '''
        start_p = [0.6, 0.4]
        #转移概率
        '''
        Healthy' : {'Healthy': 0.7, 'Fever': 0.3},
        Fever' : {'Healthy': 0.4, 'Fever': 0.6}
        '''
        trans_p = [[0.7, 0.3], [0.4, 0.6]]
        #发射概率/输出概率/观察概率
        '''
        Healthy' : {'normal': 0.5, 'cold': 0.4, 'dizzy': 0.1},
        Fever' : {'normal': 0.1, 'cold': 0.3, 'dizzy': 0.6}
        '''
        emit_p = [[0.5, 0.4, 0.1], [0.1, 0.3, 0.6]]
    
        return viterbi(observations,
                       hiddenstates,
                       start_p,
                       trans_p,
                       emit_p)
    
    if __name__ == '__main__':
        print(example())
    

    中文NLP序列标注之POS

    预处理

    #!/usr/bin/env python
    # -*- coding: utf-8 -*-
    
    #2016年 03月 03日 星期四 11:01:05 CST by Demobin
    
    import h5py
    import json
    import codecs
    
    mappings = {
        #人民日报标注集:863标注集
                'w':    'wp',
                't':    'nt',
                'nr':   'nh',
                'nx':   'nz',
                'nn':   'n',
                'nzz':  'n',
                'Ng':   'n',
                'f':    'nd',
                's':    'nl',
                'Vg':   'v',
                'vd':   'v',
                'vn':   'v',
                'vnn':  'v',
                'ad':   'a',
                'an':   'a',
                'Ag':   'a',
                'l':    'i',
                'z':    'a',
                'mq':   'm',
                'Mg':   'm',
                'Tg':   'nt',
                'y':    'u',
                'Yg':   'u',
                'Dg':   'd',
                'Rg':   'r',
                'Bg':   'b',
                'pn':   'p',
            }
    
    tags_863 = {
            'a' :    [0, '形容词'],
            'b' :    [1, '区别词'],
            'c' :    [2, '连词'],
            'd' :    [3, '副词'],
            'e' :    [4, '叹词'],
            'g' :    [5, '语素字'],
            'h' :    [6, '前接成分'],
            'i' :    [7, '习用语'],
            'j' :    [8, '简称'],
            'k' :    [9, '后接成分'],
            'm' :    [10, '数词'],
            'n' :    [11, '名词'],
            'nd':    [12, '方位名词'],
            'nh':    [13, '人名'],
            'ni':    [14, '团体、机构、组织的专名'],
            'nl':    [15, '处所名词'],
            'ns':    [16, '地名'],
            'nt':    [17, '时间名词'],
            'nz':    [18, '其它专名'],
            'o' :    [19, '拟声词'],
            'p' :    [20, '介词'],
            'q' :    [21, '量词'],
            'r' :    [22, '代词'],
            'u' :    [23, '助词'],
            'v' :    [24, '动词'],
            'wp':    [25, '标点'],
            'ws':    [26, '字符串'],
            'x' :    [27, '非语素字'],
        }
    
    def genCorpusTags():
        s = ''
        features = ['b', 'm', 'e', 's']
        for tag in tags:
            for f in features:
                 s += '\'' + tag + '-' + f + '\'' + ','
        print s
    
    corpus_tags = [
            'nh-b','nh-m','nh-e','nh-s','ni-b','ni-m','ni-e','ni-s','nl-b','nl-m','nl-e','nl-s','nd-b','nd-m','nd-e','nd-s','nz-b','nz-m','nz-e','nz-s','ns-b','ns-m','ns-e','ns-s','nt-b','nt-m','nt-e','nt-s','ws-b','ws-m','ws-e','ws-s','wp-b','wp-m','wp-e','wp-s','a-b','a-m','a-e','a-s','c-b','c-m','c-e','c-s','b-b','b-m','b-e','b-s','e-b','e-m','e-e','e-s','d-b','d-m','d-e','d-s','g-b','g-m','g-e','g-s','i-b','i-m','i-e','i-s','h-b','h-m','h-e','h-s','k-b','k-m','k-e','k-s','j-b','j-m','j-e','j-s','m-b','m-m','m-e','m-s','o-b','o-m','o-e','o-s','n-b','n-m','n-e','n-s','q-b','q-m','q-e','q-s','p-b','p-m','p-e','p-s','r-b','r-m','r-e','r-s','u-b','u-m','u-e','u-s','v-b','v-m','v-e','v-s','x-b','x-m','x-e','x-s'
        ]
    
    def savePosInfo(path, posInfo):
        '''保存分词训练数据字典和概率'''
        print('save pos info to %s'%path)
        fd = open(path, 'w')
        (initProb, tranProb), (vocab, indexVocab) = posInfo
        j = json.dumps((initProb, tranProb))
        fd.write(j + '\n')
        for char in vocab:
            fd.write(char.encode('utf-8') + '\t' + str(vocab[char]) + '\n')
        fd.close()
    
    def loadPosInfo(path):
        '''载入分词训练数据字典和概率'''
        print('load pos info from %s'%path)
        fd = open(path, 'r')
        line = fd.readline()
        j = json.loads(line.strip())
        initProb, tranProb = j[0], j[1]
        lines = fd.readlines()
        fd.close()
        vocab = {}
        indexVocab = [0 for i in range(len(lines))]
        for line in lines:
            rst = line.strip().split('\t')
            if len(rst) < 2: continue
            char, index = rst[0].decode('utf-8'), int(rst[1])
            vocab[char] = index
            indexVocab[index] = char
        return (initProb, tranProb), (vocab, indexVocab)
    
    def savePosData(path, posData):
        '''保存分词训练输入样本'''
        print('save pos data to %s'%path)
        #采用hdf5保存大矩阵效率最高
        fd = h5py.File(path,'w')
        (X, y) = posData
        fd.create_dataset('X', data = X)
        fd.create_dataset('y', data = y)
        fd.close()
    
    def loadPosData(path):
        '''载入分词训练输入样本'''
        print('load pos data from %s'%path)
        fd = h5py.File(path,'r')
        X = fd['X'][:]
        y = fd['y'][:]
        fd.close()
        return (X, y)
    
    def sent2vec2(sent, vocab, ctxWindows = 5):
        
        charVec = []
        for char in sent:
            if char in vocab:
                charVec.append(vocab[char])
            else:
                charVec.append(vocab['retain-unknown'])
        #首尾padding
        num = len(charVec)
        pad = int((ctxWindows - 1)/2)
        for i in range(pad):
            charVec.insert(0, vocab['retain-padding'] )
            charVec.append(vocab['retain-padding'] )
        X = []
        for i in range(num):
            X.append(charVec[i:i + ctxWindows])
        return X
    
    def sent2vec(sent, vocab, ctxWindows = 5):
        chars = []
        words = sent.split()
        for word in words:
            #包含两个字及以上的词
            if len(word) > 1:
                #词的首字
                chars.append(word[0] + '_b')
                #词中间的字
                for char in word[1:(len(word) - 1)]:
                    chars.append(char + '_m')
                #词的尾字
                chars.append(word[-1] + '_e')
            #单字词
            else: 
                chars.append(word + '_s')
        
        return sent2vec2(chars, vocab, ctxWindows = ctxWindows)
    
    def doc2vec(fname, vocab):
        '''文档转向量'''
    
        #一次性读入文件,注意内存
        fd = codecs.open(fname, 'r', 'utf-8')
        lines = fd.readlines()
        fd.close()
    
        #样本集
        X = []
        y = []
    
        #标注统计信息
        tagSize = len(corpus_tags)
        tagCnt = [0 for i in range(tagSize)]
        tagTranCnt = [[0 for i in range(tagSize)] for j in range(tagSize)]
    
        #遍历行
        for line in lines:
            #按空格分割
            words = line.strip('\n').split()
            #每行的分词信息
            chars = []
            tags = []
            #遍历词
            for word in words:
                rst = word.split('/')
                if len(rst) <= 0:
                    print word
                    continue
                word, tag = rst[0], rst[1].decode('utf-8')
                if tag not in tags_863:
                    tag = mappings[tag]
                #包含两个字及以上的词
                if len(word) > 1:
                    #词的首字
                    chars.append(word[0] + '_b')
                    tags.append(corpus_tags.index(tag + '-' + 'b'))
                    #词中间的字
                    for char in word[1:(len(word) - 1)]:
                        chars.append(char + '_m')
                        tags.append(corpus_tags.index(tag + '-' + 'm'))
                    #词的尾字
                    chars.append(word[-1] + '_e')
                    tags.append(corpus_tags.index(tag + '-' + 'e'))
                #单字词
                else: 
                    chars.append(word + '_s')
                    tags.append(corpus_tags.index(tag + '-' + 's'))
    
            #字向量表示
            lineVecX = sent2vec2(chars, vocab, ctxWindows = 7)
    
            #统计标注信息
            lineVecY = []
            lastTag = -1
            for tag in tags:
                #向量
                lineVecY.append(tag)
                #lineVecY.append(corpus_tags[tag])
                #统计tag频次
                tagCnt[tag] += 1
                #统计tag转移频次
                if lastTag != -1:
                    tagTranCnt[lastTag][tag] += 1
                #暂存上一次的tag
                lastTag = tag
    
            X.extend(lineVecX)
            y.extend(lineVecY)
    
        #字总频次
        charCnt = sum(tagCnt)
        #转移总频次
        tranCnt = sum([sum(tag) for tag in tagTranCnt])
        #tag初始概率
        initProb = []
        for i in range(tagSize):
            initProb.append(tagCnt[i]/float(charCnt))
        #tag转移概率
        tranProb = []
        for i in range(tagSize):
            p = []
            for j in range(tagSize):
                p.append(tagTranCnt[i][j]/float(tranCnt))
            tranProb.append(p)
    
        return X, y, initProb, tranProb
    
    def vocabAddChar(vocab, indexVocab, index, char):
        if char not in vocab:
            vocab[char] = index
            indexVocab.append(char)
            index += 1
        return index
    
    def genVocab(fname, delimiters = [' ', '\n']):
        
        #一次性读入文件,注意内存
        fd = codecs.open(fname, 'r', 'utf-8')
        lines = fd.readlines()
        fd.close()
    
        vocab = {}
        indexVocab = []
        #遍历所有行
        index = 0
        for line in lines:
            words = line.strip().split()
            if words <= 0: continue
            #遍历所有词
            for word in words:
                word, tag = word.split('/')
                #包含两个字及以上的词
                if len(word) > 1:
                    #词的首字
                    char = word[0] + '_b'
                    index = vocabAddChar(vocab, indexVocab, index, char)
                    #词中间的字
                    for char in word[1:(len(word) - 1)]:
                        char = char + '_m'
                        index = vocabAddChar(vocab, indexVocab, index, char)
                    #词的尾字
                    char = word[-1] + '_e'
                    index = vocabAddChar(vocab, indexVocab, index, char)
                #单字词
                else: 
                    char = word + '_s'
                    index = vocabAddChar(vocab, indexVocab, index, char)
    
        #加入未登陆新词和填充词
        vocab['retain-unknown'] = len(vocab)
        vocab['retain-padding'] = len(vocab)
        indexVocab.append('retain-unknown')
        indexVocab.append('retain-padding')
        #返回字典与索引
        return vocab, indexVocab
    
    def load(fname):
        print 'train from file', fname
        delims = [' ', '\n']
        vocab, indexVocab = genVocab(fname)
        X, y, initProb, tranProb = doc2vec(fname, vocab)
        print len(X), len(y), len(vocab), len(indexVocab)
        print initProb
        print tranProb
        return (X, y), (initProb, tranProb), (vocab, indexVocab)
    
    def test():
        load('../data/pos.train')
    
    if __name__ == '__main__':
        test()
    

    模型

    #!/usr/bin/env python
    #-*- coding: utf-8 -*-
    
    #2016年 03月 03日 星期四 11:01:05 CST by Demobin
    
    import numpy as np
    import json
    import h5py
    import codecs
    
    from dataset import pos
    from util import viterbi
    
    from sklearn.model_selection import train_test_split
    
    from keras.preprocessing import sequence
    from keras.optimizers import SGD, RMSprop, Adagrad
    from keras.utils import np_utils
    from keras.models import Sequential,Graph, model_from_json
    from keras.layers.core import Dense, Dropout, Activation, TimeDistributedDense
    from keras.layers.embeddings import Embedding
    from keras.layers.recurrent import LSTM, GRU, SimpleRNN
    
    from util import pChar
    
    def train(posInfo, posData, modelPath, weightPath):
    
        (initProb, tranProb), (vocab, indexVocab) = posInfo
        (X, y) = posData
    
        train_X, test_X, train_y, test_y = train_test_split(X, y , train_size=0.9, random_state=1)
    
        train_X = np.array(train_X)
        train_y = np.array(train_y)
        test_X = np.array(test_X)
        test_y = np.array(test_y)
        
        outputDims = len(pos.corpus_tags)
        Y_train = np_utils.to_categorical(train_y, outputDims)
        Y_test = np_utils.to_categorical(test_y, outputDims)
        batchSize = 128
        vocabSize = len(vocab) + 1
        wordDims = 100
        maxlen = 7
        hiddenDims = 100
    
        w2vModel, vectorSize = pChar.load('model/pChar.model')
        embeddingDim = int(vectorSize)
        embeddingUnknown = [0 for i in range(embeddingDim)]
        embeddingWeights = np.zeros((vocabSize + 1, embeddingDim))
        for word, index in vocab.items():
            if word in w2vModel:
                e = w2vModel[word]
            else:
                print word
                e = embeddingUnknown
            embeddingWeights[index, :] = e
        
        #LSTM
        model = Sequential()
        model.add(Embedding(output_dim = embeddingDim, input_dim = vocabSize + 1, 
            input_length = maxlen, mask_zero = True, weights = [embeddingWeights]))
        model.add(LSTM(output_dim = hiddenDims, return_sequences = True))
        model.add(LSTM(output_dim = hiddenDims, return_sequences = False))
        model.add(Dropout(0.5))
        model.add(Dense(outputDims))
        model.add(Activation('softmax'))
        model.compile(loss = 'categorical_crossentropy', optimizer = 'adam')
        
        result = model.fit(train_X, Y_train, batch_size = batchSize, 
                        nb_epoch = 20, validation_data = (test_X,Y_test), show_accuracy=True)
        
        j = model.to_json()
        fd = open(modelPath, 'w')
        fd.write(j)
        fd.close()
        
        model.save_weights(weightPath)
    
        return model
        #Bi-directional LSTM
    
    def loadModel(modelPath, weightPath):
    
        fd = open(modelPath, 'r')
        j = fd.read()
        fd.close()
        
        model = model_from_json(j)
        
        model.load_weights(weightPath)
    
        return model
    
    
    # 根据输入得到标注推断
    def posSent(sent, model, posInfo):
        (initProb, tranProb), (vocab, indexVocab) = posInfo
        vec = pos.sent2vec(sent, vocab, ctxWindows = 7)
        vec = np.array(vec)
        probs = model.predict_proba(vec)
        #classes = model.predict_classes(vec)
    
        prob, path = viterbi.viterbi(vec, pos.corpus_tags, initProb, tranProb, probs.transpose())
    
        ss = ''
        words = sent.split()
        index = -1
        for word in words:
            for char in word:
                index += 1
            ss += word + '/' + pos.tags_863[pos.corpus_tags[path[index]][:-2]][1].decode('utf-8') + ' '
            #ss += word + '/' + pos.corpus_tags[path[index]][:-2] + ' '
    
        return ss[:-1]
    
    def posFile(fname, dstname, model, posInfo):
        fd = codecs.open(fname, 'r', 'utf-8')
        lines = fd.readlines()
        fd.close()
    
        fd = open(dstname, 'w')
        for line in lines:
            rst = posSent(line.strip(), model, posInfo)
            fd.write(rst.encode('utf-8') + '\n')
        fd.close()
    
    def test():
        print 'Loading vocab...'
        #(X, y), (initProb, tranProb), (vocab, indexVocab) = pos.load('data/pos.train')
        #posInfo = ((initProb, tranProb), (vocab, indexVocab))
        #posData = (X, y)
        #pos.savePosInfo('./model/pos.info', posInfo)
        #pos.savePosData('./model/pos.data', posData)
        posInfo = pos.loadPosInfo('./model/pos.info')
        posData = pos.loadPosData('./model/pos.data')
        print 'Done!'
        print 'Loading model...'
        #model = train(posInfo, posData, './model/pos.w2v.model', './model/pos.w2v.model.weights')
        model = loadModel('./model/pos.w2v.model', './model/pos.w2v.model.weights')
        #model = loadModel('./model/pos.model', './model/pos.model.weights')
        print 'Done!'
        print '-------------start predict----------------'
        s = u'为 寂寞 的 夜空 画 上 一个 月亮'
        print posSent(s, model, posInfo)
        #posFile('~/work/corpus/icwb2/testing/msr_test.utf8', './msr_test.utf8.pos', model, posInfo)
    
    if __name__ == '__main__':
        test()
    

    中文NLP序列标注之NER

    预处理

    
    

    模型

    
    

    中文NLP序列标注之DP

    To be continue...
    PS:全贴代码有点长,等我找时间再整理一下。

    相关文章

      网友评论

      • Iris_3fd4:楼主,请问能分享一下啊代码吗,谢谢!
      • 5a0f728b0199:楼主,请问代码在github上有么?
      • e1172706c5fa:楼主,您好。能把中文NLP序列标注之NER这个发我一份吗?最近在做这个实验,多谢。邮箱:601906695@qq.com
      • 4e2771cb7d72:请问楼主的代码在github上有么?:yum:
      • HwT_:楼主请问 模型里面w2vModel = Word2Vec.load('model/sougou.char.model')这个模型是你之前训练好的嘛?
      • 苹果_78c1:非常好,给我的启发很大
      • __徐宁__:给楼主点赞,期待ner
      • fcfb836e48b6:期待楼主的ner
      • vic_log: model = Sequential()
        model.add(Embedding(output_dim = embeddingDim, input_dim = vocabSize + 1,
        input_length = maxlen, mask_zero = True, weights = [embeddingWeights]))
        model.add(LSTM(output_dim = hiddenDims, return_sequences = True))
        model.add(LSTM(output_dim = hiddenDims, return_sequences = False))
        model.add(Dropout(0.5))
        model.add(Dense(outputDims))
        model.add(Activation('softmax'))
        model.compile(loss = 'categorical_crossentropy', optimizer = 'adam')

        请问楼主,为何第二个lstm的return_sequences = False,这样返回的不是只有最后一步的结果吗?为什么可以得到整个序列的标注结果呢? 谢谢~
        fcfb836e48b6:期待楼主的ner,我也在做这个
        205a476ead3e:加我微信13269666689
      • bd333614a237:期待楼主的NER和DP,真的学习到了很多!!! :+1: :clap:
      • 205a476ead3e:楼主在吗?我最近 也在做NER,求交流!
        205a476ead3e:@翼浪飞星bp 好啊,交流交流,我手机号和微信号13269666689 ,邮箱1547645745@qq.com
        翼浪飞星bp:@执念418 你好,我也在做NER,咱们可以交流交流啊
      • nuaazouhui:求参考ner代码
      • f0dad21f69e1:您好,能参考一下您做的NER代码吗?
      • 谢一零一零:很好,学习了。最近在写一个用LSTM做中文分词。
        Cedarli:@谢一零一零 :smile: 你的中文分词最后精确度是多少?为什么我的LSTM分词后精度只有84.21%呢?看别人论文里都是96%之上啊。你的代码可以发我一份供参考么?cedarli@foxmail.com
        谢一零一零:@翼浪飞星bp http://www.sighan.org/bakeoff2005/
        翼浪飞星bp:@谢一零一零 你知道楼主的中文分词训练语料是什么样子的么?能否告知一下,谢谢

      本文标题:使用深度学习进行中文自然语言处理之序列标注

      本文链接:https://www.haomeiwen.com/subject/mmwvrttx.html