论文阅读“SimCTC: A Simple Contrast L

作者: 掉了西红柿皮_Kee | 来源:发表于2022-10-17 19:02 被阅读0次

论文阅读“SimCTC: A Simple Contrast L
【NLP论文笔记】SimCSE: Simple Contrast
MoCo：无监督视觉表示学习的动量对比
直方图均衡&对比度保持
论文阅读“Simple Contrastive Graph Cl
论文粗读“HCSC: Hierarchical Contrast
1-1 Throw an Error with a Simple
无监督表示学习（三）：2020 Simple Contrast
Happy life
pcanet解析

Li, Chen, et al. "SimCTC: A Simple Contrast Learning Method of Text Clustering (Student Abstract)." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 36. No. 11. 2022.

摘要导读

本文提出了一种简单的对比学习方法（SimCTC），大大提升了最先进的文本聚类模型。在SimCTC中，一个预先训练好的BERT模型首先将输入序列映射到表示空间，然后再由三个不同的损失函数--clustering head、instance-contrastive learning head和cluster-contrastive learning head进行训练。

模型浅析

该模型主要包含四个主要的组件：

BERT模型，为每个句子的输入编码为一个embedding表示；
使用nlpaug库里面的Bertbase和Roberta分别产生原始输入 $x$ 的两个增强句子 $x^{aug1}$ 和 $x^{aug2}$ 。

# 核心代码块如下
import nlpaug.augmenter.word as naw
aug_bert = naw.ContextualWordEmbsAug(
    model_path='bert-base-uncased', 
    action=ACTION, 
    top_k=TOP_K,
    aug_p=AUG_P
    )
text = """Come into town with me today to buy food!"""
augmented_text = aug_bert.augment(text, n=3) # n: num. of outputs
print(augmented_text)

然后将 $x$ 、 $x^{aug1}$ 和 $x^{aug2}$ 输入到BERT-Like模型 $M$ 中，得到对应的句子编码表示 $e$ 、 $e^{aug1}$ 和 $e^{aug2}$ 。然后通过同时优化以下三个损失得到文本聚类结果。

clustering head，试图将同一语义类别的句子表述聚类在一起；
这里使用的是深度聚类模型DEC中KL散度针对原始文本表示 $e$ 进行聚类分配：
其中 $q_{jk}$ 表示 $e_j$ 分配到第 $k$ 个类簇 $\mu_k$ 中的概率， $p_{jk}$ 则是 $q_{ik}$ 对应的目标分布：
instance contrastive learning head，在实例层面应用对比学习；
该模块是基于 $e^{aug1}$ 和 $e^{aug2}$ 设置的。给定表示 $e^{aug1}$ 可以形成2M-1个样本表示对，其中正例样本对儿定义为 $\{e^{aug1}, e^{aug2}\}$ ，剩余的2M-2个样本对为负例样本对。使用 $g_I(\cdot)$ 对样本表示进行实例级映射： $\tilde{y_i}^{aug1}=g_I(e_i^{aug1}) \in \mathcal{R}^{1 \times H}, i \in [1, \cdots, M]$ ，因此给定 $e_i^{aug1}$ ，其对应的对比损失如下：
其中 $s(\cdot)$ 是cosine similarity， $\tau$ 是temperature parameter。
由此，该模块总体的损失为：
cluster contrastive learning head，在类簇层面应用对比学习；
与instance contrastive learning head类似，该模块也设置了映射head-- $g_c(\cdot)$ 完成对 $e^{aug1}$ 和 $e^{aug2}$ 的转换，例如 ${y_i}^{aug1}=g_c(e_i^{aug1}) \in \mathcal{R}^{1 \times C}, i \in [1, \cdots, M]$ ， $C$ 是真实的聚类个数。其余操作与nstance contrastive learning head一致。