word2vec/lstm on mxnet with NCE

作者: xlvector | 来源:发表于2016-07-18 18:27 被阅读3776次

    Softmax是用来实现多类分类问题常见的损失函数。但如果类别特别多,softmax的效率就是个问题了。比如在word2vec里,每个词都是一个类别,在这种情况下可能有100万类。那么每次都得预测一个样本在100万类上属于每个类的概率,这个效率是非常低的。

    为了解决这个问题,在word2vec里面提出了基于Huffman编码的层次Softmax(HS)。HS的结构还是过于复杂,因此后来又有人提出了基于采样的NCE(其实NCE和Negative Sampling是2个不同的paper提出的东西,形式上有所区别,不过我觉得本质是没有区别的)。因此我们可以把HS或者NCE作为多类分类问题的Loss Layer。

    所有的代码目前在https://github.com/xlvector/learning-dl/tree/master/mxnet/nce-loss

    为了体验一下Softmax和NCE的速度差别,我们实现了两个例子 toy_softmax.py 和 toy_nce.py。我们虚构了一个多类分类问题,他的构造方法如下:

    def mock_sample(self):
        ret = np.zeros(self.feature_size)
        rn = set()
        while len(rn) < 3:
            rn.add(random.randint(0, self.feature_size - 1))
        s = 0
        for k in rn:
            ret[k] = 1.0
            s *= self.feature_size
            s += k
        return ret, s % self.vocab_size
    

    上面feature_size 是输入特征的维度,vocab_size是类别的数目。

    toy_softmax.py 用普通的softmax来做多类分类问题,网络结构如下:

    def get_net(vocab_size):
        data = mx.sym.Variable('data')
        label = mx.sym.Variable('label')
        pred = mx.sym.FullyConnected(data = data, num_hidden = 100)
        pred = mx.sym.FullyConnected(data = pred, num_hidden = vocab_size)
        sm = mx.sym.SoftmaxOutput(data = pred, label = label)
        return sm
    

    运行速度和类别个数的关系如下

    类别数 每秒处理的样本数
    100 40000
    1000 30000
    10000 10000
    100000 1000

    可以看到,在类别数从10000提高到100000时,速度直接降为原来的1/10。

    在看看toy_nce.py,他的网络结构如下:

    def get_net(vocab_size, num_label):
        data = mx.sym.Variable('data')
        label = mx.sym.Variable('label')
        label_weight = mx.sym.Variable('label_weight')
        embed_weight = mx.sym.Variable('embed_weight')
        pred = mx.sym.FullyConnected(data = data, num_hidden = 100)
        return nce_loss(data = pred,
                        label = label,
                        label_weight = label_weight,
                        embed_weight = embed_weight,
                        vocab_size = vocab_size,
                        num_hidden = 100,
                        num_label = num_label)
    

    其中,nce_loss的结构如下:

    def nce_loss(data, label, label_weight, embed_weight, vocab_size, num_hidden, num_label):
        label_embed = mx.sym.Embedding(data = label, input_dim = vocab_size,
                                       weight = embed_weight,
                                       output_dim = num_hidden, name = 'label_embed')
        label_embed = mx.sym.SliceChannel(data = label_embed,
                                          num_outputs = num_label,
                                          squeeze_axis = 1, name = 'label_slice')
        label_weight = mx.sym.SliceChannel(data = label_weight,
                                           num_outputs = num_label,
                                           squeeze_axis = 1)
        probs = []
        for i in range(num_label):
            vec = label_embed[i]
            vec = vec * data
            vec = mx.sym.sum(vec, axis = 1)
            sm = mx.sym.LogisticRegressionOutput(data = vec,
                                                 label = label_weight[i])
            probs.append(sm)
        return mx.sym.Group(probs)
    

    NCE的主要思想是,对于每一个样本,除了他自己的label,同时采样出N个其他的label,从而我们只需要计算样本在这N+1个label上的概率,而不用计算样本在所有label上的概率。而样本在每个label上的概率最终用了Logistic的损失函数。再来看看NCE的速度和类别数之间的关系:

    类别数 每秒处理的样本数
    100 30000
    1000 30000
    10000 30000
    100000 20000

    可以看到NCE的速度相对于类别数并不敏感。

    有了NCE Loss后,就可以用mxnet来训练word2vec了。word2vec的其中一个CBOW模型是用一个词周围的N个词去预测这个词,我们可以设计如下的网络结构:

    def get_net(vocab_size, num_input, num_label):
        data = mx.sym.Variable('data')
        label = mx.sym.Variable('label')
        label_weight = mx.sym.Variable('label_weight')
        embed_weight = mx.sym.Variable('embed_weight')
        data_embed = mx.sym.Embedding(data = data, input_dim = vocab_size,
                                      weight = embed_weight,
                                      output_dim = 100, name = 'data_embed')
        datavec = mx.sym.SliceChannel(data = data_embed,
                                         num_outputs = num_input,
                                         squeeze_axis = 1, name = 'data_slice')
        pred = datavec[0]
        for i in range(1, num_input):
            pred = pred + datavec[i]
        return nce_loss(data = pred,
                        label = label,
                        label_weight = label_weight,
                        embed_weight = embed_weight,
                        vocab_size = vocab_size,
                        num_hidden = 100,
                        num_label = num_label)
    

    如上面的结构,输入是num_input个词语。输出是num_label个词语,其中有1个词语是正样本,剩下是负样本。这里,input的embeding和label的embeding都用了同一个embed矩阵embed_weight。

    执行wordvec.py (需要把text8放在./data/下面),就可以看到训练结果。

    接着word2vec的思路,可以继续把lstm也用上NCE loss。网络结构如下:

    def get_net(vocab_size, seq_len, num_label, num_lstm_layer, num_hidden):
        param_cells = []
        last_states = []
        for i in range(num_lstm_layer):
            param_cells.append(LSTMParam(i2h_weight=mx.sym.Variable("l%d_i2h_weight" % i),
                                         i2h_bias=mx.sym.Variable("l%d_i2h_bias" % i),
                                         h2h_weight=mx.sym.Variable("l%d_h2h_weight" % i),
                                         h2h_bias=mx.sym.Variable("l%d_h2h_bias" % i)))
            state = LSTMState(c=mx.sym.Variable("l%d_init_c" % i),
                              h=mx.sym.Variable("l%d_init_h" % i))
            last_states.append(state)
            
        data = mx.sym.Variable('data')
        label = mx.sym.Variable('label')
        label_weight = mx.sym.Variable('label_weight')
        embed_weight = mx.sym.Variable('embed_weight')
        label_embed_weight = mx.sym.Variable('label_embed_weight')
        data_embed = mx.sym.Embedding(data = data, input_dim = vocab_size,
                                      weight = embed_weight,
                                      output_dim = 100, name = 'data_embed')
        datavec = mx.sym.SliceChannel(data = data_embed,
                                      num_outputs = seq_len,
                                      squeeze_axis = True, name = 'data_slice')
        labelvec = mx.sym.SliceChannel(data = label,
                                       num_outputs = seq_len,
                                       squeeze_axis = True, name = 'label_slice')
        labelweightvec = mx.sym.SliceChannel(data = label_weight,
                                             num_outputs = seq_len,
                                             squeeze_axis = True, name = 'label_weight_slice')
        probs = []
        for seqidx in range(seq_len):
            hidden = datavec[seqidx]
            
            for i in range(num_lstm_layer):
                next_state = lstm(num_hidden, indata = hidden,
                                  prev_state = last_states[i],
                                  param = param_cells[i],
                                  seqidx = seqidx, layeridx = i)
                hidden = next_state.h
                last_states[i] = next_state
                
            probs += nce_loss(data = hidden,
                              label = labelvec[seqidx],
                              label_weight = labelweightvec[seqidx],
                              embed_weight = label_embed_weight,
                              vocab_size = vocab_size,
                              num_hidden = 100,
                              num_label = num_label)
        return mx.sym.Group(probs)
    

    参考

    1. Tensorflow 关于nce_loss的实现在 这里

    相关文章

      网友评论

      • RainbowSecret:hello, xlvector. I read your implementation, there is some parts I could not full understand. I am not sure whether you could give me more detailed descriptions for the codes below :

        class NceAuc(mx.metric.EvalMetric):
        def __init__(self):
        super(NceAuc, self).__init__('nce-auc')

        def update(self, labels, preds):
        label_weight = labels[1].asnumpy()
        preds = preds[0].asnumpy()
        tmp = []
        for i in range(preds.shape[0]):
        for j in range(preds.shape[1]):
        tmp.append((label_weight[i][j], preds[i][j]))
        tmp = sorted(tmp, key = itemgetter(1), reverse = True)
        m = 0.0
        n = 0.0
        z = 0.0
        k = 0
        for a, b in tmp:
        if a > 0.5:
        m += 1.0
        z += len(tmp) - k
        else:
        n += 1.0
        k += 1
        z -= m * (m + 1.0) / 2.0
        z /= m
        z /= n
        self.sum_metric += z
        self.num_inst += 1
      • FinlayLiu:干货!

      本文标题:word2vec/lstm on mxnet with NCE

      本文链接:https://www.haomeiwen.com/subject/rdwpjttx.html