写在前面

关于parsing，目前有两种主流的方法，一种是Constituency Parsing，另一种是Dependency Parsing，在本文中我们主要讨论Dependency Parsing
知识点参见: 阿衡学姐的笔记 Dependency Parsing and Treebank
Stanford cs224N讲义: Dependency Parsing
Basic论文: A Fast and Accurate Dependency Parser using Neural Networks，基于greedy transition-based parsing，结合了word_embedding和neural networks
Advanced论文
Globally Normalized Transition-Based Neural Networks
Universal Dependencies: A cross-linguistic typology
Incrementality in Deterministic Dependency Parsing
主要改进点有: 使用更深层次的神经网络，处理non-projectivity问题，使用graph-based parsing
基础项目: Neural Dependency Parsing

Neural transition-based Dependency Parsing

Conventional transition-based Dependency Parsing

Structure: Stack s, Buffer b and A(record dependencies)
At each move, use a discriminative classifier(such as SVM) to decide what kind of operation next time

transition_based parsing.png

Feature Extract

In conventional ways, we represent word|tag|label as one-hot vectors, which is sparse and costs much time
In Neural transition-based dependency parsing model, we represent word|tag|label as pre-trained word_embeddings

feature_extracted.png

Choices of featurs

The choice of Sw, St, Sl
Following (Zhang and Nivre, 2011), we pick a rich set of elements for our final parser. In de- tail, Sw contains nw = 18 elements: (1) The top 3 words on the stack and buffer: s1, s2, s3, b1, b2, b3; (2) The first and second leftmost / rightmost children of the top two words on the stack: lc1(si), rc1(si), lc2(si), rc2(si), i = 1, 2. (3) The leftmost of leftmost / rightmost of right- most children of the top two words on the stack: lc1(lc1(si)), rc1(rc1(si)), i = 1, 2.
We use the corresponding POS tags for St (nt = 18), and the corresponding arc labels of words excluding those 6 words on the stack/buffer for Sl (nl = 12). A good advantage of our parser is that we can add a rich set of elements cheaply, instead of hand-crafting many more indicator fea- tures.

Neural Networks

after word_embedding,we feed the input to hidden-layer, which has a novel activation function(f(x) = x^3)，then use a softmax-layer to predict next transition

neural_model.png