论文阅读“tBERT: Topic Models and BER

作者: 掉了西红柿皮_Kee | 来源:发表于2021-09-05 15:53 被阅读0次

论文阅读“tBERT: Topic Models and BER
Referat
Web Information Paper Review (3)
论文阅读_用字典提升基于BERT的中文标注效果
推荐系统论文阅读（七)-借鉴DSSM构建双塔召回模型
论文阅读“Topic discovery and future
【2018-11-11】贝叶斯认识
论文阅读_ 解释黑盒模型方法综述
对机器学习模型的可解释性讨论（一）
CLIP论文阅读笔记

Peinelt N, Nguyen D, Liakata M. tBERT: Topic models and BERT joining forces for semantic similarity detection[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020: 7047-7055.

摘要粗读

语义相似性检测是自然语言理解中的一项基本任务。添加主题信息对于以前的特征工程语义相似性模型以及其他任务的神经模型都很有用。目前还没有标准的方法来将主题与预先训练好的上下文表示（如BERT）相结合。本文提出了一种新的基于主题信息的bert的框架用于成对语义相似性检测，实验表明提出的模型在各种英语数据集的强神经基线上提高了性能。因此发现，在BERT中添加主题特别有助于解决特定领域的情况。

如果所提出的方法并不能很好的覆盖全面的研究领域，可以使用如下叙述：We find that the addition of < topics to BERT > helps particularly with resolving domain-specific cases.

We, therefore, introduce a novel architecture for semantic similarity detection which incorporates topic models and BERT. More specifically, we make the following contributions: We propose tBERT — a simple architecture combining topics with BERT for semantic similarity prediction (section 3). We show in our error analysis that tBERT’s gains are prominent on domain-specific cases, such as those encountered in < CQA > (section 5).

tBERT(topic-informed BERT-based model)结构

整体结构
该论文研究了主题模型是否可以进一步提高BERT在语义相似性检测方面的性能。整体的模型结构如下：

Architecture of tBERT with word topics.

about BERT

对于待检测的句子对， $S_1$ 长度为N， $S_2$ 长度为M，分别作为 $BERT_{BASE}$ 的text_a和text_b得到BERT最后一层CLS token的输出，并将其作为句子对的表示，形式化为如下：

对于

BERT_{BASE}

模型来讲，

d=768

为内部隐含层的维度。

about Topic Model

该模型的实验中，使用了两种较为流行的主题模型，分别为LDA和GSDMM。
当然对于主题模型来说，论文指出了两种使用的策略：
（1）句子主题表示：
对于文档主题 $D_1$ 和 $D_2$ ，将一个句子中的所有token传递给主题模型，以推断每个句子的一个主题分布。

此处，t为主题的个数。
（2）主题词表示：
对于词主题

W_1

和

W_2

，关于词的主题分布

w_i

是由句子中的每个token

T_i

推断的（这一点可以对照LDA中每个词都对应一个topic-word分布的矩阵。）：

对于得到的矩阵然后将它们进行平均，以便在句子级别上获得固定长度的主题表示：

这一点跟之前关于BERT的语义meanpooling是一样的操作。

由此可以对应两种< 句子对向量与句子级的主题表示 >相结合的表现形式：

（1）for document topics

（2）for word topics

对于拼接之后的表示，传入到一个隐藏层进行相关权重的调整，然后用softmax layer进行分类。对应的损失函数也仅仅是交叉熵损失。

主题模型的选择
主题的数量和alpha值是重要的主题模型超参数，并且是依赖于对应数据集的。而对于不同长度的文本及：论文则使用了LDA（最流行和广泛使用的主题模型，但它不太适合短文本）和短文本主题模型GSDMM。

about conclusion
In this work, we proposed a flexible framework for combining topic models with BERT. We demonstrated that adding LDA topics to BERT consistently improved performance across a range of semantic similarity prediction datasets. In our qualitative analysis, we showed that these improvements were mainly achieved on examples involving domain-specific words. Future work may focus on how to directly induce topic information into BERT without corrupting pretrained information and whether combining topics with other pretrained contextual models can lead to similar gains.

论文的行文较为明确，也以很简单的形式给出了主题模型和预训练语言模型的结合，提出的tBERT从实验来看，取得了较高的效果，但主题与文本表示的集合都是在两个模型外部进行的融合，从模型的统一性而言还不够高，但是作者也在conclusion部分给出了展望，大概是将主题信息融入到BERT内部，从而指导语义相似度的检测，这一点很值得期待。