[ICLR2020]Distilled embedding no

作者: Ber66666 | 来源:发表于2020-05-29 00:32 被阅读0次

[ICLR2020]Distilled embedding no
Bourbon&Robert Frost
Domain-Driven.Design.Distilled.2
推荐领域-深度GBDT在在线学习中的应用
推荐系统论文阅读（二十四)-基于回话推荐的知识蒸馏模型
利用误分类样本来防御对抗样本
【ML | Graph Data】Node embeddings
第四章 Embedding
Chrome阅读模式
tensorflow学习-embedding_lookup()用

Motivation

Better representations of words have come at the cost of huge memory footprints, which has made deploying NLP models on edge-devices challenging due to memory limitations.

减少embedding层的参数以减少内存限制

Abstraction

Distilled Embedding, an (input/output) embedding compression method based on low-rank matrix decomposition with an added non-linearity. First, we initialize the weights of our decomposition by learning to reconstruct the full word-embedding and then ﬁne-tune on the downstream task employing knowledge distillation on the factorized embedding.

Distilled Embedding, 基于低秩矩阵分解，添加一个非线性层。首先以重建参数来训练，然后在downstream任务上微调。

Methodology

Funneling(漏斗?) Decomposition and Embedding Distillation

传统的SVD降维:
$E=U_{|V|\times|V|}\Sigma_{|V|\times d}V_{d\times d}^T$
where $|V|, \ d$ is the vocabulary size and the embedding dimension, respectively.

降维后：
$\tilde{E}=U_{|V|\times r}\Sigma_{r\times r}V_{r\times d}^T=\mathbb{U_{|V|\times r}}V_{r\times d}^T$
参数量的下降：
$r\times (|V|+d)\rightarrow |V|\times d$
本文提出的方法，加入非线性层(ReLU)：
$\tilde{E}=f(\mathbb{U_{|V|\times r}})V_{r\times d}^T$
两种loss: (放弃手打公式了orz)