美文网首页
Bio-NLP CRF小作业

Bio-NLP CRF小作业

作者: caokai001 | 来源:发表于2019-05-01 15:35 被阅读0次

    BIO 格式介绍
    AGAC Track 官网
    Bio-NLP课程链接

    data dowmload

    1.json 模块使用 资料

    import json
    data=[{"a":1,"b":2,"c":3,"d":4,"e":5}]
    json.dumps(data)
    print(json.dumps({'a': 'Runoob', 'b': 7}, sort_keys=True, indent=4, separators=(',', ':')))
    
    ###
    jsonData = '{"a":1,"b":2,"c":3,"d":4,"e":5}';
    text=json.loads(jsonData)
    

    1、json.dumps()和json.loads()是json格式处理函数(可以这么理解,json是字符串)
      (1)json.dumps()函数是将一个Python数据类型列表进行json格式的编码(可以这么理解,json.dumps()函数是将字典转化为字符串)
      (2)json.loads()函数是将json格式数据转换为字典(可以这么理解,json.loads()函数是将字符串转化为字典)

    2、json.dump()和json.load()主要用来读写json文件函数

    2.spacy 模块使用 资料1 资料2

    pycharm 安装第三方模块
    win 添加conda 到环境变量

    linux 环境如下:

    • pip install spacy
    • python -m spacy download en
    下载en
    结果如下:
    前10个token

    列出部分的spacy用法:请参考资料1

    text = "The sequel, Yes, Prime Minister, ran from 1986 to 1988. In total there were 38 episodes, of which all but one lasted half an hour. Almost all episodes ended with a variation of the title of the series spoken as the answer to a question posed by the same character, Jim Hacker. Several episodes were adapted for BBC Radio, and a stage play was produced in 2010, the latter leading to a new television series on UKTV Gold in 2013."
    text
    import spacy
    nlp = spacy.load("en")
    doc = nlp(text)
    doc
    ###让Spacy帮我们分析这段话中出现的全部词例(token)
    for token in doc:
        print('"' + token.text + '"')
    
    for token in doc[:10]:
        print("{0}\t{1}\t{2}\t{3}\t{4}\t{5}\t{6}\t{7}".format(
            token.text,
            token.idx,
            token.lemma_,
            token.is_punct,
            token.is_space,
            token.shape_,
            token.pos_,
            token.tag_
        ))
        
    for ent in doc.ents:
        print(ent.text, ent.label_)
        
    
    from spacy import displacy
    displacy.render(doc, style='ent', jupyter=True)
    

    3.glob 模块 资料 类似linux find - name 功能

    glob.glob
    • 返回所有匹配的文件路径列表。它只有一个参数pathname,定义了文件路径匹配规则,这里可以是绝对路径,也可以是相对路径。
    glob.iglob
    • 获取一个可编历对象,使用它可以逐个获取匹配的文件路径名。与glob.glob()的区别是:glob.glob同时获取所有的匹配路径,而 glob.iglob一次只获取一个匹配路径。这有点类似于.NET中操作数据库用到的DataSet与DataReader。


      glob 模块

    4.训练集,测试集。7:3分

    import random
    import os
    import shutil
    
    def random_copyfile(srcPath,dstPath,lastpath,numfiles):
        name_list=list(os.path.join(srcPath,name) for name in os.listdir(srcPath))
        random_name_list=list(random.sample(name_list,numfiles))
        last=[ item for item in name_list if item not in random_name_list ]
        if not os.path.exists(dstPath):
            os.mkdir(dstPath)
        for oldname in random_name_list:
            shutil.copyfile(oldname,oldname.replace(srcPath, dstPath))
        for file in last:
            shutil.copyfile(file,file.replace(srcPath, lastpath))
    
    srcPath='/home/kcao/test/tmp/AGAC_training'
    dstPath = '/home/kcao/test/tmp/kcao_train_data'
    lastpath='/home/kcao/test/tmp/kcao_test_data'
    random_copyfile(srcPath,dstPath,lastpath,175)
    

    5.将josn转换成BIO格式,供CRF使用

    将训练的josn,与测试的json 分别转成BIO格式
    # -*- coding: utf-8 -*-
    """
    Created on Tue Apr 10 09:35:15 2019
    
    @author: wyx
    """
    
    import json
    from glob import glob
    import spacy
    
    nlp = spacy.load('en')
    
    def json2bio(fpath,output,splitby = 's'):
        '''
        输入json文件,返回bio(token pmid label)
        splitby = 's' ----以句子空格
        splitby = 'a' ----以摘要空格
        '''
        with open(fpath) as f:
            pmid = fpath[-13:-5]
            annotations = json.load(f)
            text = annotations['text'].replace('\n',' ')
            all_words = text.split(' ')
            all_words2 = [token for token in nlp(text)]
            all_label = ['O']*len(all_words)
            for i in annotations['denotations']:
                b_location = i['span']['begin']
                e_location = i['span']['end']
                label = i['obj']
                B_wordloc = text.count(' ',0,b_location)
                I_wordloc = text.count(' ',0,e_location)
                all_label[B_wordloc] = 'B-'+label
                if B_wordloc != I_wordloc:
                    for word in range(B_wordloc+1,I_wordloc+1):
                        all_label[word] = 'I-'+label
            #得到以空格分词的词列表和对应标签列表
            for w,_ in enumerate(all_words):
                all_words[w] = nlp(all_words[w])
            #对单个元素分词 
            labelchange = []
            for i,_ in enumerate(all_words):
                token = [token for token in all_words[i]]
                if len(token)==1:
                    labelchange.append(all_label[i])
                else:
                    if all_label[i] == 'O':
                        labelchange.extend(['O']*len(token))
                    if all_label[i] != 'O':
                        labelchange.append(all_label[i])
                        if str(token[-1]) == '.' or str(token[-1]) == ',':
                            labelchange.extend(['I-'+all_label[i][2:]]*(len(token)-2))
                            labelchange.append('O')
                        else:
                            labelchange.extend(['I-'+all_label[i][2:]]*(len(token)-1))
            
            #写入文件
            with open(output,'a',encoding='utf-8') as f:
                #以句子空行
                if splitby == 's':
                    for j,_ in enumerate(all_words2):
                        if str(all_words2[j]) == '.' and str(all_words2[j-1]) != 'p':
                            line =str(all_words2[j])+'\t'+pmid+'\t'+labelchange[j]+'\n'
                            f.write(line+'\n')
                        else:
                            line =str(all_words2[j])+'\t'+pmid+'\t'+labelchange[j]+'\n'
                            f.write(line)
                #以摘要空行
                if splitby == 'a':
                    for j,_ in enumerate(all_words2):
                        line =str(all_words2[j])+'\t'+pmid+'\t'+labelchange[j]+'\n'
                        f.write(line)
                    f.write('\n')
    
    
    if __name__ == "__main__":
        fpathlist = glob('/home/kcao/test/tmp/kcao_train_data/*.json')
        output = "/home/kcao/test/tmp/kcao_train_data/train.tab"
        for i in fpathlist:
            json2bio(i,output,'s')
    

    6.使用Wapiti进行训练

    pat内容如下:
    *
    
    U:tok:1:-1:%X[-1,0]
    U:tok:1:+0:%X[0,0]
    U:tok:1:+1:%X[1,0]
    
    U:tok:2:-1:%X[-1,0]/%X[0,0]
    U:tok:2:+0:%X[0,0]/%X[1,0]
    
    U:tok:3:-2:%X[-2,0]/%X[-1,0]/%X[0,0]
    U:tok:3:-1:%X[-1,0]/%X[0,0]/%X[1,0]
    U:tok:3:+0:%X[0,0]/%X[1,0]/%X[2,0]
    
    
    U:pre:1:+0:4:%M[0,0,"^.?.?.?.?"]
    
    U:suf:1:+0:4:%M[0,0,".?.?.?.?$"]
    
    test-wapiti.sh 如下
    #! bin/bash
    traininput_dir="kcao_train_data"
    testinput_dir="kcao_test_data"
    output_dir="kcao_output"
    pattern_file="pat/Tok321dis.pat"
    training_options=' -a sgd-l1 -t 3 -i 10 '
    debug=0
    verbose=0
    patname=$(basename $pattern_file .pat)
    corpus_name=$(basename $traininput_dir)
    
    echo "================ Training $corpus_name (this may take some time) ================" 1>&2
    # training: create a MODEL based on PATTERNS and TRAINING-CORPUS
    # wapiti train -p PATTERNS TRAINING-CORPUS MODEL
    echo "wapiti train $training_options -p $pattern_file <(cat $1) $output_dir/$patname-train-$corpus_name-$3.mod" 1>&2
    
    wapiti train $training_options -p $pattern_file <(cat $traininput_dir/*.tab) $output_dir/$patname-train-$corpus_name.mod
    # wapiti train -a bcd -t 2 -i 5 -p t.pat train-bio.tab t-train-bio.mod
    #
    # Note: The default algorithm, l-bfgs, stops early and does not succeed in annotating any token (all O)
    # sgd-l1 works; bcd works
    
    wapiti dump $output_dir/$patname-train-$corpus_name.mod $output_dir/$patname-train-$corpus_name.txt
    
    echo "================ Inference $corpus_name ================" 1>&2
    # inference (labeling): apply the MODEL to label the TEST-CORPUS, put results in TEST-RESULTS
    # wapiti label -m MODEL TEST-CORPUS TEST-RESULTS
    # -c : check (= evaluate)
    # <(COMMAND ARGUMENTS ...) : runs COMMAND on ARGUMENTS ... and provides the results as if in a file
    echo "wapiti label -c -m $output_dir/$patname-train-$corpus_name-$3.mod <(cat $1) $output_dir/$patname-train-test-$corpus_name-$3.tab" 1>&2
    wapiti label -c -m $output_dir/$patname-train-$corpus_name.mod <(cat $testinput_dir/*) $output_dir/$patname-train-test-$corpus_name.tab
    # wapiti label -c -m t-train-bio.mod test-bio.tab t-train-test-bio.tab
    #echo "================ Evaluation with conlleval.pl $corpus_name ================" 1>&2
    echo "Finished!"
    # evaluate the resulting entities
    # $'\t' is a way to obtain a tabulation in bash
    #echo "$BINDIR/conlleval.pl -d $'\t' < $output_dir/$patname-train-test-$corpus_name-$3.tab | tee $output_dir/$patname-train-test-$corpus_name-$3.eval" 1>&2
    perl conlleval.pl -d $'\t' < $output_dir/$patname-train-test-$corpus_name.tab | tee -a $output_dir/$patname-train-test-$corpus_name.eval
    
    结果统计
    显示的FBI只有12.73还是太低,可能需要进行修改tab 里面特征。

    1.修改pat 增加到U:tok:4:-3:%X[-3,0]/%X[-2,0]/%X[-1,0]/%X[0,0]


    image.png

    相关文章

      网友评论

          本文标题:Bio-NLP CRF小作业

          本文链接:https://www.haomeiwen.com/subject/oamenqtx.html