NLP 步骤拆解
原文:
Overview of Artificial Intelligence and Role of Natural Language Processing in Big Data
by Jagreet Kaur
Comment:原文说明得有些零散,我按照自己的理解重新整理了一遍。
Step1. 语法分析Syntax Analysis
1.1 句子切分Sentence Segmentation
1.2 词语标记Tokenization
1.3 变体词元化Stemming / Lemmatization
1.4 词性标注Part-of-Speech Tagging
1.5 语法解析Parsing
1.6 指定实体识别Named Entity Recognition
Step 2. 语义分析Semantic Analysis / Natural Language Understanding
2.1 词义理解
2.2 歧义化解Ambiguity Resolving
2.1.1 词汇歧义Lexical Ambiguity
2.1.2 语法歧义Syntactic Ambiguity
2.1.3 语义歧义Semantic Ambiguity
2.1.4 回指歧义Anaphoric Ambiguity
Step 3. 意图理解Pragmatics Analysis
Step 4.自然语言生成Natural Language Generation
3.1 文字材料规划Text Planning
3.2 句子规划Sentence Planning
3.3 整合Realization
大型NLP服务提供商产品定义简析
1. Google Cloud Platform
1.1 Natural Language API
Notes: 需配合Speech API 来对音频进行支持。
-
Syntax Analysis
定义:Extract tokens and sentences, identify parts of speech (POS) and create dependency parse trees for each sentence.
即语法分析,大致包含上文所述步骤的1.1~1.5。 -
Entity Analysis
定义: Inspects the given text for known entities (Proper nouns such as public figures, landmarks, and so on. Common nouns such as restaurant, stadium, and so on.) and returns information about those entities.
即实体识别。 -
Sentiment Analysis
定义:Understand the overall sentiment expressed in a block of text. Identify the prevailing emotional opinion within the text, especially to determine a writer's attitude as positive, negative, or neutral.
即情感分析。 -
Entity Sentiment Analysis
定义:Understand the sentiment for each mention of an entity within a block of text.
即针对实体的情感分析。
1.2 Cloud Translation API
- Text Translation
- Language Detection
Comment: 机器翻译其实是NLP的一种实际应用。本文为了表现各厂商的布局情况,也简单列一下。
2. Microsoft Azure
2.1 Language Understanding Intelligent Service
定义:Enable developers to build smart applications that can understand human language and react accordingly to user request. Extract intents and entities that correspond to activities in client application's logic.
即意图+实体分析。
2.2 Text Analytics API
-
Sentiment Analysis
定义:Extract features from POS tags, and embedded words of the text, then using classification techniques to get a score which reflects the attitude of people.
即情感分析。 -
Key Phrase Extraction
定义:Extract key phrases to quickly identify the main points.
Notes: 该技术来自于Microsoft Office的NLP toolkit。
-
Language Detection
定义:The API returns the detected language and a numeric score to indicate the certainty. 120 languages are supported.
参考链接;
https://azure.microsoft.com/en-us/services/cognitive-services/text-analytics/
2.3 Linguistic Analysis API
-
Sentence separation and tokenization
定义:Break the text into sentences and tokens.
即句子切分和词语标记。 -
Part-of-Speech Tagging
即词性标注。 -
Constitency Parsing
定义:Identify the phrases in the text. A phrase is a sequence of words. It can be moved together or replaced as a whole, and the sentence should remain fluent and grammatical.
即语法分析。
参考链接:
https://azure.microsoft.com/en-us/services/cognitive-services/text-analytics/
2.4 Bing Spell Check API
定义:Help users correct spelling errors, recognize the difference among names, brand names, and slang, as well as understand homophones as they're typing.
Notes: 与Microsoft Word的常规拼写检查程序不同,Bing采用的是第三代系统。它的更新与壮大,不依赖词典及背后的维护人员,而是利用机器学习和基于统计的机器翻译、基于大量的网络搜索和文档来训练算法。
该API分为Proof和Spell两种模式。前者对于语法错误有着更高的捕捉率,但仅支持美式英文。
参考链接:
https://azure.microsoft.com/en-us/services/cognitive-services/spell-check/
2.5 Microsoft Translator API
Notes: 目前,该API还是基于统计的机器翻译(SMT)。这项技术在性能提升方面已进入稳定阶段,翻译质量较难有所突破。基于深度神经网络(DNN)的翻译技术蓄势待发,但截止至8月27日,该技术仅对Microsoft Translator Speech API的用户开放。目前,Skypy Translator 采用DNN翻译引擎,Bing Translator采用SMT翻译引擎。
- Text Translation API
-
Speech Translation API
定义:Transcribe conversational speech from one language into text of another language. The API also integrates text-to-speech capabilities to speak the translated text back.
Notes: 翻译的过程包括通过ASR从源语言音频识别出对应文本。微软在ASR的基础上,采用TrueText的新技术,来优化识别文本。TrueText支持过滤口水词、咳嗽、不敬词,也能进行标点及大小写的修正。
-
Collaborative Translation Framework Reporting API
定义:Allowing users to recommend alternative translations to those privided by Translator's automatic translation engine. -
*Microsoft Translator Hub
定义:Let developers customize a language pair for a specific domain (area of terminology and style) or to build automatic translation for a language that is not yet covered by Microsoft Translation API.
It is an extension of the Microsoft Translator API and service.
参考链接:
https://azure.microsoft.com/en-us/services/cognitive-services/translator-text-api/
2.6 Web Language Model API
定义:Automate a variety of standard natural language processing tasks.
- Word Breaking
-
Joint Probabilities
定义:Calculate how often a particular sequence of words appear together. -
Conditional Probabilities
定义:Given a sequence of words, calculate how often a particular word tends to follow. -
Next word completions
定义:Given a sequnce of words, get the list of words most likely to follow.
参考链接:
https://azure.microsoft.com/en-us/services/cognitive-services/web-language-model/
网友评论