美文网首页深度学习 | Deep Learning
[ICLR2020]Distilled embedding no

[ICLR2020]Distilled embedding no

作者: Ber66666 | 来源:发表于2020-05-29 00:32 被阅读0次

Motivation

Better representations of words have come at the cost of huge memory footprints, which has made deploying NLP models on edge-devices challenging due to memory limitations.

减少embedding层的参数以减少内存限制

Abstraction

Distilled Embedding, an (input/output) embedding compression method based on low-rank matrix decomposition with an added non-linearity. First, we initialize the weights of our decomposition by learning to reconstruct the full word-embedding and then fine-tune on the downstream task employing knowledge distillation on the factorized embedding.

Distilled Embedding, 基于低秩矩阵分解,添加一个非线性层。首先以重建参数来训练,然后在downstream任务上微调。

Methodology

Funneling(漏斗?) Decomposition and Embedding Distillation

传统的SVD降维:
E=U_{|V|\times|V|}\Sigma_{|V|\times d}V_{d\times d}^T
where |V|, \ d is the vocabulary size and the embedding dimension, respectively.

降维后:
\tilde{E}=U_{|V|\times r}\Sigma_{r\times r}V_{r\times d}^T=\mathbb{U_{|V|\times r}}V_{r\times d}^T
参数量的下降:
r\times (|V|+d)\rightarrow |V|\times d
本文提出的方法,加入非线性层(ReLU):
\tilde{E}=f(\mathbb{U_{|V|\times r}})V_{r\times d}^T
两种loss: (放弃手打公式了orz)

  1. seq2seq的cross entropy

    文中对这里seq2seq的解释: the output embedding is the transpose of the input embedding matrix.

image-20200529001825268.png
  1. reconstruction loss

    每一个重建词向量和原始的MSE

image.png

最后做加权平均:

image.png

Related Work

待补充

Experimental Results

待补充

Comments

这个reconstruction loss不就是个autoencoder吗?
SVD在数学上有什么好的性质?

相关文章

网友评论

    本文标题:[ICLR2020]Distilled embedding no

    本文链接:https://www.haomeiwen.com/subject/fystzhtx.html