美文网首页深度学习 | Deep Learning
[ICLR2020]Distilled embedding no

[ICLR2020]Distilled embedding no

作者: Ber66666 | 来源:发表于2020-05-29 00:32 被阅读0次

    Motivation

    Better representations of words have come at the cost of huge memory footprints, which has made deploying NLP models on edge-devices challenging due to memory limitations.

    减少embedding层的参数以减少内存限制

    Abstraction

    Distilled Embedding, an (input/output) embedding compression method based on low-rank matrix decomposition with an added non-linearity. First, we initialize the weights of our decomposition by learning to reconstruct the full word-embedding and then fine-tune on the downstream task employing knowledge distillation on the factorized embedding.

    Distilled Embedding, 基于低秩矩阵分解,添加一个非线性层。首先以重建参数来训练,然后在downstream任务上微调。

    Methodology

    Funneling(漏斗?) Decomposition and Embedding Distillation

    传统的SVD降维:
    E=U_{|V|\times|V|}\Sigma_{|V|\times d}V_{d\times d}^T
    where |V|, \ d is the vocabulary size and the embedding dimension, respectively.

    降维后:
    \tilde{E}=U_{|V|\times r}\Sigma_{r\times r}V_{r\times d}^T=\mathbb{U_{|V|\times r}}V_{r\times d}^T
    参数量的下降:
    r\times (|V|+d)\rightarrow |V|\times d
    本文提出的方法,加入非线性层(ReLU):
    \tilde{E}=f(\mathbb{U_{|V|\times r}})V_{r\times d}^T
    两种loss: (放弃手打公式了orz)

    1. seq2seq的cross entropy

      文中对这里seq2seq的解释: the output embedding is the transpose of the input embedding matrix.

    image-20200529001825268.png
    1. reconstruction loss

      每一个重建词向量和原始的MSE

    image.png

    最后做加权平均:

    image.png

    Related Work

    待补充

    Experimental Results

    待补充

    Comments

    这个reconstruction loss不就是个autoencoder吗?
    SVD在数学上有什么好的性质?

    相关文章

      网友评论

        本文标题:[ICLR2020]Distilled embedding no

        本文链接:https://www.haomeiwen.com/subject/fystzhtx.html