Motivation
Better representations of words have come at the cost of huge memory footprints, which has made deploying NLP models on edge-devices challenging due to memory limitations.
减少embedding层的参数以减少内存限制
Abstraction
Distilled Embedding, an (input/output) embedding compression method based on low-rank matrix decomposition with an added non-linearity. First, we initialize the weights of our decomposition by learning to reconstruct the full word-embedding and then fine-tune on the downstream task employing knowledge distillation on the factorized embedding.
Distilled Embedding, 基于低秩矩阵分解,添加一个非线性层。首先以重建参数来训练,然后在downstream任务上微调。
Methodology
Funneling(漏斗?) Decomposition and Embedding Distillation
传统的SVD降维:
where is the vocabulary size and the embedding dimension, respectively.
降维后:
参数量的下降:
本文提出的方法,加入非线性层(ReLU):
两种loss: (放弃手打公式了orz)
-
seq2seq的cross entropy
文中对这里seq2seq的解释: the output embedding is the transpose of the input embedding matrix.
-
reconstruction loss
每一个重建词向量和原始的MSE
最后做加权平均:
image.pngRelated Work
待补充
Experimental Results
待补充
Comments
这个reconstruction loss不就是个autoencoder吗?
SVD在数学上有什么好的性质?
网友评论