coreNLP-java使用（中文）

作者: wxrg2012 | 来源:发表于2018-07-18 17:09 被阅读0次

coreNLP-java使用（中文）
简书Markdown的使用2019-06-24
中文使用
写作规范指南
IDEA相关
MySQL命令不支持中文空格
Jetson Nano配置与使用（3）中文输入法ibus配置
Axios 中文使用
URL参数问题
断路器

1. IntelliJ IDEA中建立maven工程

2. pom.xml 新增内容

<properties>
    <corenlp.version>3.9.1</corenlp.version>
</properties>
<dependency>
    <groupId>edu.stanford.nlp</groupId>
    <artifactId>stanford-corenlp</artifactId>
    <version>${corenlp.version}</version>
</dependency>

<dependency>
    <groupId>edu.stanford.nlp</groupId>
    <artifactId>stanford-corenlp</artifactId>
    <version>${corenlp.version}</version>
    <classifier>models</classifier>
</dependency>

<dependency>
    <groupId>edu.stanford.nlp</groupId>
    <artifactId>stanford-corenlp</artifactId>
    <version>${corenlp.version}</version>
    <classifier>models-chinese</classifier>
</dependency>

这时maven工程会自动从maven repository 中下载指定版本的jar包，并把该资源存储到指定仓库中。
如果下载速度很慢，可能会出现中断现象，这时可以尝试使用国内镜像，修改maven根目录下的conf文件夹中的setting.xml文件，内容如下：

<mirrors>
   <mirror>
     <id>alimaven</id>
     <name>aliyun maven</name>
     <url>http://maven.aliyun.com/nexus/content/groups/public/</url>
     <mirrorOf>central</mirrorOf>        
   </mirror>
 </mirrors>

注：如果使用的是idea内置的maven，打开该文件的方法，如下：

cd .m2/
open settings.xml

3. CoreNLP-chinese.properties文件内容

# Pipeline options - lemma is no-op for Chinese but currently needed because coref demands it (bad old requirements system)

#设定了管道中包括哪些Annotators（一个Annotator就是你需要的文本分析分析工具， 他的结果就是一个或多个Annotation）
#segment:分词, ssplit:分隔, pos: 词性标注, lemma: has->have, ner:命名实体识别, parse：语法分析 
annotators = segment, ssplit, pos, lemma, ner, parse, sentiment, mention, coref
#annotators = segment, ssplit, pos, parse, sentiment 


# segment 分词
customAnnotatorClass.segment = edu.stanford.nlp.pipeline.ChineseSegmenterAnnotator
segment.model = edu/stanford/nlp/models/segmenter/chinese/pku.gz
segment.sighanCorporaDict = edu/stanford/nlp/models/segmenter/chinese
segment.serDictionary = edu/stanford/nlp/models/segmenter/chinese/dict-chris6.ser.gz
segment.sighanPostProcessing = true

# sentence split
ssplit.boundaryTokenRegex = [.]|[!?]+|[\u3002]|[\uFF01\uFF1F]+

# pos
pos.model = edu/stanford/nlp/models/pos-tagger/chinese-distsim/chinese-distsim.tagger

#ner 此处设定了ner使用的语言、模型（crf），目前SUTime只支持英文，不支持中文，所以设置为false。
ner.language = chinese
ner.model = edu/stanford/nlp/models/ner/chinese.misc.distsim.crf.ser.gz
ner.applyNumericClassifiers = true
ner.useSUTime = false

#parse
parse.model = edu/stanford/nlp/models/lexparser/chineseFactored.ser.gz

# coref
coref.sieves = ChineseHeadMatch, ExactStringMatch, PreciseConstructs, StrictHeadMatch1, StrictHeadMatch2, StrictHeadMatch3, StrictHeadMatch4, PronounMatch
coref.input.type = raw
coref.postprocessing = true
coref.calculateFeatureImportance = false
coref.useConstituencyTree = true
coref.useSemantics = false
coref.md.type = RULE
coref.mode = hybrid
coref.path.word2vec =
coref.language = zh
coref.print.md.log = false
coref.defaultPronounAgreement = true
coref.zh.dict = edu/stanford/nlp/models/dcoref/zh-attributes.txt.gz

4. 主要功能

1）名词解释

nameEg	nameCh
Summary	功能总结
Annotator dependencies	注释器依赖项
Tokenization	符号化<=>tokenize
Sentence Splitting	句子切分<=>ssplit
Lemmatization	词干提取<=>lemma （例: has=>have）
Parts of Speech	词性<=>pos
Named Entity Recognition	命名实体识别<=>ner
RegexNER(Named Entity Recognition)	RegexNER（命名实体识别）<=>regexner（常常作为ner的补充）
Constituency Parsing	选区解析<=>parse（把句子组织成短语形式）
Dependency Parsing	依存句法分析<=>depparse
Coreference Resolution	共指消解<=>coref （例：美国总统<=>特朗普）
Natural Logic	自然的逻辑<=>natlog
Open information Extraction	开放的信息提取<=>openIE
Sentiment	情感分析<=>sentiment
Relation Extraction	关系抽取<=>relation
Quote Annotator	引用注释<=>quote（获取引用的内容）
CleanXML Annotator	cleanxml注释<=>clean（解析xml文件）

2) 各个功能模块的依赖项

功能模块	ANNOTATOR 类	依赖项
tokenize	TokenizerAnnotator	None
cleanxml	CleanXmlAnnotator	tokenize/segment
ssplit	WordsToSentenceAnnotator	tokenize/segment
pos	POSTaggerAnnotator	tokenize/segment, ssplit
lemma	MorphaAnnotator	tokenize/segment, ssplit, pos
ner	NERClassifierCombiner	tokenize/segment, ssplit, pos, lemma
regexner	RegexNERAnnotator	tokenize/segment,ssplit,pos,lemma,ner
sentiment	SentimentAnnotator	tokenize/segment, ssplit, parse
parse	ParserAnnotator	tokenize/segment, ssplit,pos,lemma
depparse	DependencyParseAnnotator	tokenize, ssplit, pos
coref	CorefAnnotator	tokenize/segment,ssplit,pos,lemma,ner,parse
relation	RelationExtractorAnnotator	tokenize, ssplit, pos, lemma, ner,depparse
natlog	NaturalLogicAnnotator	tokenize, ssplit, pos, lemma, depparse/parse
quote	QuoteAnnotator	None

5. 案例

1）加载所用的模型

创建加载模型的类（各个模块通用的类）

import edu.stanford.nlp.pipeline.StanfordCoreNLP;
public class CoreNLPHel {
    private static CoreNLPHel instance = new CoreNLPHel();
    private StanfordCoreNLP pipeline;
    private CoreNLPHel(){
        String props="CoreNLP-chinese.properties";  //第三步骤的配置文件，放在main/resources目录下
        pipeline = new StanfordCoreNLP(props);
    };
    public static CoreNLPHel getInstance(){
        return instance;
    }
    public StanfordCoreNLP getPipeline(){
        return pipeline;
    }
}

2) 中文分词

修改配置文件
只需分句和分词模块，其他模块注释掉

annotators = segment, ssplit, pos, lemma, ner, parse, sentiment, mention, coref

修改成

annotators = segment, ssplit

创建中文分词所用的类

import edu.stanford.nlp.ling.CoreAnnotations;
import edu.stanford.nlp.ling.CoreLabel;
import edu.stanford.nlp.pipeline.Annotation;
import edu.stanford.nlp.pipeline.StanfordCoreNLP;
import edu.stanford.nlp.util.CoreMap;
import java.util.List;

public class Segmentation {
   private String segtext="";

   public String getSegtext() {
       return segtext;
   }
   public Segmentation(String text){
       CoreNLPHel coreNLPHel = CoreNLPHel.getInstance();
       StanfordCoreNLP pipeline = coreNLPHel.getPipeline();
       Annotation annotation = new Annotation(text);
       pipeline.annotate(annotation);
       List<CoreMap> sentences = annotation.get(CoreAnnotations.SentencesAnnotation.class);
       //ArrayList<String> array = new ArrayList<String>();
       StringBuffer sb = new StringBuffer();
       for (CoreMap sentence:sentences){
           for (CoreLabel token : sentence.get(CoreAnnotations.TokensAnnotation.class)){
               String word = token.get(CoreAnnotations.TextAnnotation.class);
               sb.append(word);
               sb.append(" ");
           }
       }
      segtext = sb.toString().trim();
       //segtext = array.toString();
   }

}

中文分词案例

public class Test {
    public static void main(String []args){
        System.out.println(new Segmentation("这家酒店很好，我很喜欢。").getSegtext());
        System.out.println(new Segmentation("他和我在学校里常打桌球。").getSegtext());
        System.out.println(new Segmentation("貌似实际用的不是这几篇。").getSegtext());
        System.out.println(new Segmentation("硕士研究生产。").getSegtext());
        System.out.println(new Segmentation("我是中国人。").getSegtext());
    }
}
out:
这 家 酒店 很 好 ， 我 很 喜欢 。
他 和 我 在 学校 里 常 打 桌球 ; 毕业 了
貌似 实际 用 的 不 是 这 几 篇
硕士 研究生 产 ？ 批判 官员 的 尺度
我 是 中国人 。

自定义字典
修改配置文件CoreNLP-chinese.properties

segment.serDictionary = edu/stanford/nlp/models/segmenter/chinese/dict-chris6.ser.gz

修改成

segment.serDictionary = edu/stanford/nlp/models/segmenter/chinese/dict-chris6.ser.gz,自己的路径/cedict.txt

亲测没用，待补充（唯一能查到的方法）

3) 中文分句

修改配置文件
保留分句部分，注释其他部分

annotators = segment, ssplit, pos, lemma, ner, parse, sentiment, mention, coref

修改成

annotators = tokenize, ssplit

并且可以修改下述句子分割符

ssplit.boundaryTokenRegex = [.。；;]|[!?！？]+

创建分句所需的类

import edu.stanford.nlp.ling.CoreAnnotations;
import edu.stanford.nlp.pipeline.Annotation;
import edu.stanford.nlp.pipeline.StanfordCoreNLP;
import edu.stanford.nlp.util.CoreMap;

import java.util.ArrayList;
import java.util.List;

public class SenSplit {

    private ArrayList<String>sensRes = new ArrayList<String>();

    public ArrayList<String> getSensRes() {
        return sensRes;   //返回存储句子的数组(ArrayList类型)
    }
    public SenSplit(String text){
        CoreNLPHel coreNLPHel = CoreNLPHel.getInstance();
        StanfordCoreNLP pipeline = coreNLPHel.getPipeline();
        Annotation annotation = new Annotation(text);
        pipeline.annotate(annotation);
        List<CoreMap>sentences = annotation.get(CoreAnnotations.SentencesAnnotation.class);
        for (CoreMap setence:sentences){
            sensRes.add(setence.get(CoreAnnotations.TextAnnotation.class));
        }
    }
}

中文分句案例

import java.util.ArrayList;
public class Test {
    public static void main(String []args){
        String text = "巴拉克·奥巴马是美国总统。他在2008年当选?今年的美国总统是特朗普？普京的粉丝";
        ArrayList<String>sensRes = new SenSplit(text).getSensRes();
        for(String str:sensRes){
            System.out.println(str);
        }
    }
}
out:
巴拉克·奥巴马是美国总统。
他在2008年当选?
今年的美国总统是特朗普？
普京的粉丝

4) 词性标注

修改配置文件
保留分词、分句、词性标注模块，注释其他

annotators = segment, ssplit, pos, lemma, ner, parse, sentiment, mention, coref

修改成

annotators = segment, ssplit, pos

创建词性标注所需的类

import edu.stanford.nlp.ling.CoreAnnotations;
import edu.stanford.nlp.ling.CoreLabel;
import edu.stanford.nlp.pipeline.Annotation;
import edu.stanford.nlp.pipeline.StanfordCoreNLP;
import edu.stanford.nlp.util.CoreMap;

import java.util.List;

public class PosTag {

    private String postext = "";

    public String getPostext() {
        return postext;
    }
    public PosTag(String text){

        CoreNLPHel coreNLPHel = CoreNLPHel.getInstance();
        StanfordCoreNLP pipeline = coreNLPHel.getPipeline();
        Annotation annotation = new Annotation(text);
        pipeline.annotate(annotation);
        List<CoreMap> sentences = annotation.get(CoreAnnotations.SentencesAnnotation.class);
        StringBuffer sb = new StringBuffer();
        for (CoreMap sentence:sentences){
            for (CoreLabel token : sentence.get(CoreAnnotations.TokensAnnotation.class)){
                String word = token.get(CoreAnnotations.TextAnnotation.class);
                String pos = token.get(CoreAnnotations.PartOfSpeechAnnotation.class);
                sb.append(word);
                sb.append("/");
                sb.append(pos);
                sb.append(" ");
            }
        }
        postext = sb.toString().trim();
    }
}

词性标注案例

public class Test {
    public static void main(String []args){
        String text = "巴拉克·奥巴马是美国总统。他在2008年当选?今年的美国总统是特朗普？普京的粉丝";
        System.out.println(new PosTag(text).getPostext());
    }
}
out:
巴拉克·奥巴马/NR 是/VC 美国/NR 总统/NN 。/PU 他/PN 在/P 2008年/NT 当选/VV ?/PU 今年/NT 的/DEG 美国/NR 总统/NN 是/VC 特朗普/NN ？/PU 普京/NR 的/DEG 粉丝/NN

词性标注符号含义

符号	含义	例子
VA	谓词性形容词	没有宾语且能被“很”修饰的谓语。
VC	系动词	连接两个名词短语或者主语：他是/VC 学生
VE	“有”作为主要动词	有当“有，没{有}”和“无”作为主要动词
VV	其他动词	情态动词、提升谓词、控制动词、行为动词、心理动词等
NR	专有名词	人名、政治或地理上定义的地方、组织
NT	时间名词	一月、汉朝、当今、何时、今后
NN	其他名词	种族、国籍、职称、疾病
LC	方位词	表示方向、位置。前，后，里，外，之间，以北
PN	代词	我、你、这、那、如其
DT	限定词	限定词包括指示词（如这、那、该）和诸如“每、各、前、后”等词
CD	基数词	好些、若干、半、许多、很多
OD	序列词	我们把第+CD看做一个词，并标注它为OD。第一百
M	度量词	个、群、公里、升
AD	副词	仍然、很、最、大大、又、约
P	介词	从、对
CC	并列连接词	与、和、或、或者、还是（or）
CS	从属连词	如果/CS，……就/AD……
DEC	“的”作为补语标记/名词化标记	拿来吃的/DEC
DEG	“的”作为关联标记或所有格标记	普京/NR 的/DEG 粉丝/NN
DER	补语短语	在V-得-R和V-得结构中，“得”标记为DER
DEV	方式“地”	当“地”出现在“地VP”，很大程度地完成
AS	动态助词	动态助词仅包括“着，了，过，的”
SP	句末助词	他好吧[SP]？
ETC	ETC用于标注等	None
MSP	其他助词	“所，以，来，而”，当它们出现在VP前时，标注为MSP
IJ	感叹词	出现在句首位置的感叹词，如：啊
ON	拟声词	雨哗哗，砰[ON]的/DEG一声！，砰砰[ON]！
LB	长“被”结构	仅包括“被，叫，给，为（口语中）”，他叫/VV你去
SB	短“被”结构	NP0+SB+VP，他被/SB 训了/AS一顿/M
BA	把字结构	他把/BA你骗了/AS
JJ	其他名词修饰语	共同/JJ的/DEG目标/NN
FW	外来词	词性标注标记在上下文中不是很清楚的词
PU	标点	。？！

5) 命名实体识别

修改配置文件
保留分词、分句、词性标注、命名实体识别模块，注释其他

annotators = segment, ssplit, pos, lemma, ner, parse, sentiment, mention, coref

修改成

annotators = segment, ssplit, pos,ner

创建命名实体识别所需的类

import edu.stanford.nlp.ling.CoreAnnotations;
import edu.stanford.nlp.ling.CoreLabel;
import edu.stanford.nlp.pipeline.Annotation;
import edu.stanford.nlp.pipeline.StanfordCoreNLP;
import edu.stanford.nlp.util.CoreMap;

import java.util.List;

public class NamedEntity {

    private String nertext = "";

    public String getNertext() {
        return nertext;
    }

    public NamedEntity(String text){
        CoreNLPHel coreNLPHel = CoreNLPHel.getInstance();
        StanfordCoreNLP pipeline = coreNLPHel.getPipeline();
        Annotation annotation = new Annotation(text);
        pipeline.annotate(annotation);
        List<CoreMap> sentences = annotation.get(CoreAnnotations.SentencesAnnotation.class);
        StringBuffer sb = new StringBuffer();
        for (CoreMap sentence:sentences){
            // 获取句子的token（可以是作为分词后的词语）
            for (CoreLabel token : sentence.get(CoreAnnotations.TokensAnnotation.class)){
                String word = token.get(CoreAnnotations.TextAnnotation.class);
                //String pos = token.get(CoreAnnotations.PartOfSpeechAnnotation.class);
                //String ne = token.get(CoreAnnotations.NormalizedNamedEntityTagAnnotation.class);
                String ner = token.get(CoreAnnotations.NamedEntityTagAnnotation.class);
                //System.out.println(word + "\t" + pos + " | analysis : {  original : " + ner + "," + " normalized : " + ne + "}");
                sb.append(word);
                sb.append("/");
                sb.append(ner);
                sb.append(" ");
            }
        }
        nertext = sb.toString().trim();
    }
}

命名实体识别案例

public class Test {
    public static void main(String []args){
        String text = "巴拉克·奥巴马是美国总统。他在2008年当选?今年的美国总统是特朗普？普京的粉丝";
       System.out.println(new NamedEntity(text).getNertext());
    }
}
out:
巴拉克·奥巴马/PERSON 是/O 美国/GPE 总统/O 。/O 他/O 在/O 2008年/DATE 当选/O ?/O 今年/DATE 的/O 美国/GPE 总统/O 是/O 特朗普/O ？/O 普京/PERSON 的/O 粉丝/O

6) 句子的解析树 && 句子依存分析

修改配置文件
保留分词、分句、词性标注、lemma、prase模块，注释其他

annotators = segment, ssplit, pos, lemma, ner, parse, sentiment, mention, coref

修改成

annotators = segment, ssplit, pos, lemma, parse

创建句子解析和依存分析所需的类

import edu.stanford.nlp.ling.CoreAnnotations;
import edu.stanford.nlp.pipeline.Annotation;
import edu.stanford.nlp.pipeline.StanfordCoreNLP;
import edu.stanford.nlp.semgraph.SemanticGraph;
import edu.stanford.nlp.semgraph.SemanticGraphCoreAnnotations;
import edu.stanford.nlp.trees.Tree;
import edu.stanford.nlp.trees.TreeCoreAnnotations;
import edu.stanford.nlp.util.CoreMap;
import java.util.List;

public class SPTree {
    List<CoreMap>sentences;

    public SPTree(String text){
        CoreNLPHel coreNLPHel = CoreNLPHel.getInstance();
        StanfordCoreNLP pipeline = coreNLPHel.getPipeline();
        Annotation annotation = new Annotation(text);
        pipeline.annotate(annotation);
        sentences = annotation.get(CoreAnnotations.SentencesAnnotation.class);
    }

    //句子的依赖图(依存分析)
    public String getDepprasetext() {
        StringBuffer sb2 = new StringBuffer();
        for (CoreMap sentence:sentences){
            String sentext = sentence.get(CoreAnnotations.TextAnnotation.class);
            SemanticGraph graph = sentence.get(SemanticGraphCoreAnnotations.BasicDependenciesAnnotation.class);
            //System.out.println("句子的依赖图");
            sb2.append(sentext);
            sb2.append("\n");
            sb2.append(graph.toString(SemanticGraph.OutputFormat.LIST));
            sb2.append("\n");
        }
        return sb2.toString().trim();
    }
    // 句子的解析树
    public String getPrasetext() {
        StringBuffer sb1 = new StringBuffer();
        for (CoreMap sentence:sentences){
            Tree tree = sentence.get(TreeCoreAnnotations.TreeAnnotation.class);
            String sentext = sentence.get(CoreAnnotations.TextAnnotation.class);
            sb1.append(sentext);
            sb1.append("/");
            sb1.append(tree.toString());
            sb1.append("\n");
        }
        return sb1.toString().trim();
    }
}

句子解析案例

public class Test {
    public static void main(String []args){
        String text = "巴拉克·奥巴马是美国总统。他在2008年当选?今年的美国总统是特朗普？普京的粉丝";
        SPTree spTree = new SPTree(text);
        System.out.println(spTree.getPrasetext());
    }
}
out:
巴拉克·奥巴马是美国总统。/(ROOT (IP (NP (NR 巴拉克·奥巴马)) (VP (VC 是) (NP (NP (NR 美国)) (NP (NN 总统)))) (PU 。)))
他在2008年当选?/(ROOT (IP (NP (PN 他)) (VP (PP (P 在) (NP (NT 2008年))) (VP (VV 当选))) (PU ?)))
今年的美国总统是特朗普？/(ROOT (IP (NP (DNP (NP (NT 今年)) (DEG 的)) (NP (NR 美国)) (NP (NN 总统))) (VP (VC 是) (NP (NN 特朗普))) (PU ？)))
普京的粉丝/(ROOT (NP (DNP (NP (NR 普京)) (DEG 的)) (NP (NN 粉丝))))

句子依存分析案例

public class Test {
    public static void main(String []args){
        String text = "巴拉克·奥巴马是美国总统。他在2008年当选?今年的美国总统是特朗普？普京的粉丝";
        SPTree spTree = new SPTree(text);
        System.out.println(spTree.getDepprasetext());
    }
}
out:
巴拉克·奥巴马是美国总统。:
root(ROOT-0, 总统-4)
nsubj(总统-4, 巴拉克·奥巴马-1)
cop(总统-4, 是-2)
nmod:assmod(总统-4, 美国-3)
punct(总统-4, 。-5)

他在2008年当选?:
root(ROOT-0, 当选-4)
nsubj(当选-4, 他-1)
case(2008年-3, 在-2)
nmod:prep(当选-4, 2008年-3)
punct(当选-4, ?-5)

今年的美国总统是特朗普？:
root(ROOT-0, 特朗普-6)
nmod(总统-4, 今年-1)
case(今年-1, 的-2)
nmod:assmod(总统-4, 美国-3)
nsubj(特朗普-6, 总统-4)
cop(特朗普-6, 是-5)
punct(特朗普-6, ？-7)

普京的粉丝:
root(ROOT-0, 粉丝-3)
nmod:assmod(粉丝-3, 普京-1)
case(普京-1, 的-2)

Stanford-parser依存句法关系解释

coreNLP-java使用（中文）
1. IntelliJ IDEA中建立maven工程推荐参考 maven工程建立具体过程 2. pom.xml...
简书Markdown的使用2019-06-24
1.代码块的使用效果显示: 输入示例: 例如: 2.标题的使用效果显示: 中文中文中文中文中文中文 ...
中文使用
今天试了下HTTP认证的资料. 主要是基本认证与摘要认证.其中基本认证是指 Base64(user:pwd)后，放...
写作规范指南
1、只使用中文样式的直角引号（「」） 2、使用全角中文标点，使用半角中文拼音 3、段首顶格写，不加空格也不使用Ta...
IDEA相关
一、使用System.out.println("中文") 打印中文显示问号(?) 使用 maven 项目，配置了本...
MySQL命令不支持中文空格
在使用MySQL的时候一定要在不能使用中文空格，中文空格会报错，语法错误
Jetson Nano配置与使用（3）中文输入法ibus配置
Jetson Nano配置与使用（3）中文输入法ibus配置由于经常使用到中文搜索，所以安装了中文输入法，Jet...
Axios 中文使用
概述 Axios是一个基于promise的HTTP库，可以用在浏览器和Node.js中。axios具有特征：从浏...
URL参数问题
尽量避免使用中文参数
断路器
Hystrix使用入门手册（中文）