Enhancing Pre-trained Chinese Character Representation with Word-aligned Attention

1. 背景

关键词：character-level representation

中文的基本语义单元是词汇，而大多中文预训练语言模型都是character-level的表示，即基于单个字符的上下文对字符进行表示，这忽略了词汇的部分语义信息。另一方面，相关研究显示，考虑分词信息有助于语言理解

2. 思路

关键词：expand the character-level attention mechanism

对中文预训练模型的字符级注意力机制进行扩展，让预训练语言模型获得一些word-level信息，from character-level to word-level

其实考虑中文词汇级别信息也早有研究，如ERNIE【basic-level masking（word piece）+ phrase level masking（WWM style） + entity level masking】、Chinese-BERT-wwm（whole word masking）都是通过改变预训练时的masking策略，获得全新的预训练模型。这样的过程比较笨重，成本很高

本文作者则不对预训练阶段进行调整，而是对模型的微调过程进行了重新设计，将分词信息整合到微调过程中以提高性能

3. 具体设计

针对fine-tune过程，提出了Word-aligned Attention

Word-aligned Attention

3.1 Character-level Pre-trained Encoder。adopt BERT and its updated variants (ERNIE, BERT-wwm) as the basic encoder in this work, and the outputs from the last layer of encoder are treated as the character-level enriched contextual representations H.——不对现有的预训练模型做结构更改，而是直接拿来用
3.2 Word-aligned Attention。
- 对3.1的H，不直接送入下游任务层网络进行微调，而是再进行一次self attention，得到attention score matrix——Ac。对应Figure 1的 F 操作
- 然后，使用分词工具将输入的文本进行分词，称为 π（parition），得到划分 π 后，将其应用于正常得到的 attention 权重矩阵Ac上，可以得到按词划分的（word-based）字级别（character-level）的 attention 权重组合。对应Figure 1的 Tokenizer & Gain Patition & Apply Patition 操作
- 对每个word-based的权重组合，为了同时考虑：1. 句子中所有词的语义表示；2. 句子中最重要的词的语义表示这两种情况，使用 mix-pooling 来对 mean-pooling 和 max-pooling 进行混合
  
  MixPooling = λ MeanPooling + (1−λ) MaxPooling
  
  对应Figure 1的 f MixPooling 操作，得到aligned attention矩阵Aˆc。这里就是从character-level to word-level的对齐
- V = H, 而 Hˆ才是真正喂给下游任务层网络的representation，称为enhanced character representation
3.3 Multi-head Word-aligned Attention。还可以基于multi-head attention思想，得到K个不同的Aˆc，然后得到K个不同的 Hˆ，直接拼接降维得到multi-head Word-aligned Attention的enhanced character representation，喂给下游任务层网络进行微调
3.4 Multi-source Word-aligned Attention。中文分词器很多，分词结果不一样。在多分词器情况下，直接将多个分词对应得到的Hˆ经过一个线性变换并tanh激活后加总在一起。tanh激活主要是增加模型的非线性表达，并且限制了representation在各个维度的数值在（-1，1），不然直接将多个Hˆ相加可能导致数值过大