tensorflow 实战word2Vec 序列embeddin

作者: FreeTheWorld | 来源:发表于2021-01-24 13:51 被阅读0次

tensorflow 实战word2Vec 序列embeddin
Tensorflow 实现 Word2Vec
tensorflow教程-embedding_lookup用法
Python gensim Word2Vec tutorial
tensorflow实战-10.word2vec
TensorFlow训练词向量
TensorFlow-9-词的向量表示
《TensorFlow实战_黄文坚.pdf》PDF高清完整版-免
lstm示例
5.6-RedisTemplate 序列化机制配置实战—小滴课堂

使用word2vec 负采样的方法，训练序列embedding。

数据集介绍

数据集选用开源的 MovieLens 1M Dataset，该数据集包括6000个用户的对4000部电影的观看序列，以及近1百万个评分数据。对观看序列进行negative 负采样，生成训练集，训练word2Vec.

生成训练集

具体的，选定上下文长度，以中心item与上下文item构造训练集正样本对，随机采样的item与中心item构成负样本对.参考代码

def sample(group,context_length):
    f = open("word2VecTrainData.txt","w+")
    for uid,movieLs in group.items():
        i,j=0,context_length
        total = 0
        while j<len(movieLs):
            sample = context_sample(movieLs[i:j])
            i+=1
            j+=1
            total += len(sample)
            for s in sample:
                f.write("%s\n"%(s))
        print("user {}'s sample has Done,with {} samples".format(uid,total))
    f.close()


def context_sample(sequence):
    # 序列长度取奇数
    positive = []
    center = sequence[len(sequence) // 2]
    for i in sequence:
        if i == center: continue
        positive.append("{}|{}|{}".format(center, i, 1))
        t = randint(movieIds[0], movieIds[-1])

    negative = []
    n = len(sequence) - 1
    while len(negative) < n * 3: # 3 负样本比例
        t = randint(movieIds[0], movieIds[-1])
        if t in sequence: continue
        negative.append("{}|{}|{}".format(center, t, 0))
    return positive + negative

模型

def word2vec1():
    input = keras.Input(shape=(2,),dtype=tf.int32)
    x1 = layers.Embedding(output_dim = embedding_dim,
                         input_dim = movieLens,
                         input_length = 2)(input)
    x = layers.Flatten()(x1)
    x = layers.Dense(10,activation="relu")(x)
    output = layers.Dense(1,activation="sigmoid")(x)
    model = keras.Model(inputs = input,outputs = output)
    model.compile(optimizer='adam',
                  loss='binary_crossentropy',
                  metrics=['accuracy'])
    model.summary()
    return model

训练结果：

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
center (InputLayer)          [(None, 2)]               0         
_________________________________________________________________
embedding (Embedding)        (None, 2, 10)             40000     
_________________________________________________________________
flatten (Flatten)            (None, 20)                0         
_________________________________________________________________
dense (Dense)                (None, 10)                210       
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 11        
=================================================================
Total params: 40,221
Trainable params: 40,221
Non-trainable params: 0
_________________________________________________________________
Epoch 1/5
152677/152677 - 180s - loss: 0.3486 - accuracy: 0.8411 - val_loss: 0.3680 - val_accuracy: 0.8368
Epoch 2/5
152677/152677 - 206s - loss: 0.3285 - accuracy: 0.8551 - val_loss: 0.3559 - val_accuracy: 0.8434
Epoch 3/5
152677/152677 - 228s - loss: 0.3249 - accuracy: 0.8575 - val_loss: 0.3548 - val_accuracy: 0.8441
Epoch 4/5
152677/152677 - 225s - loss: 0.3230 - accuracy: 0.8586 - val_loss: 0.3542 - val_accuracy: 0.8448
Epoch 5/5
152677/152677 - 217s - loss: 0.3222 - accuracy: 0.8590 - val_loss: 0.3518 - val_accuracy: 0.8466
0.846561

完整代码参考github