使用word2vec 负采样的方法,训练序列embedding。
数据集介绍
数据集选用开源的 MovieLens 1M Dataset,该数据集包括6000个用户的对4000部电影的观看序列,以及近1百万个评分数据。对观看序列进行negative 负采样,生成训练集,训练word2Vec.
生成训练集
具体的,选定上下文长度,以中心item与上下文item构造训练集正样本对,随机采样的item与中心item构成负样本对.参考代码
def sample(group,context_length):
f = open("word2VecTrainData.txt","w+")
for uid,movieLs in group.items():
i,j=0,context_length
total = 0
while j<len(movieLs):
sample = context_sample(movieLs[i:j])
i+=1
j+=1
total += len(sample)
for s in sample:
f.write("%s\n"%(s))
print("user {}'s sample has Done,with {} samples".format(uid,total))
f.close()
def context_sample(sequence):
# 序列长度取奇数
positive = []
center = sequence[len(sequence) // 2]
for i in sequence:
if i == center: continue
positive.append("{}|{}|{}".format(center, i, 1))
t = randint(movieIds[0], movieIds[-1])
negative = []
n = len(sequence) - 1
while len(negative) < n * 3: # 3 负样本比例
t = randint(movieIds[0], movieIds[-1])
if t in sequence: continue
negative.append("{}|{}|{}".format(center, t, 0))
return positive + negative
模型
def word2vec1():
input = keras.Input(shape=(2,),dtype=tf.int32)
x1 = layers.Embedding(output_dim = embedding_dim,
input_dim = movieLens,
input_length = 2)(input)
x = layers.Flatten()(x1)
x = layers.Dense(10,activation="relu")(x)
output = layers.Dense(1,activation="sigmoid")(x)
model = keras.Model(inputs = input,outputs = output)
model.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy'])
model.summary()
return model
训练结果:
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
center (InputLayer) [(None, 2)] 0
_________________________________________________________________
embedding (Embedding) (None, 2, 10) 40000
_________________________________________________________________
flatten (Flatten) (None, 20) 0
_________________________________________________________________
dense (Dense) (None, 10) 210
_________________________________________________________________
dense_1 (Dense) (None, 1) 11
=================================================================
Total params: 40,221
Trainable params: 40,221
Non-trainable params: 0
_________________________________________________________________
Epoch 1/5
152677/152677 - 180s - loss: 0.3486 - accuracy: 0.8411 - val_loss: 0.3680 - val_accuracy: 0.8368
Epoch 2/5
152677/152677 - 206s - loss: 0.3285 - accuracy: 0.8551 - val_loss: 0.3559 - val_accuracy: 0.8434
Epoch 3/5
152677/152677 - 228s - loss: 0.3249 - accuracy: 0.8575 - val_loss: 0.3548 - val_accuracy: 0.8441
Epoch 4/5
152677/152677 - 225s - loss: 0.3230 - accuracy: 0.8586 - val_loss: 0.3542 - val_accuracy: 0.8448
Epoch 5/5
152677/152677 - 217s - loss: 0.3222 - accuracy: 0.8590 - val_loss: 0.3518 - val_accuracy: 0.8466
0.846561
完整代码参考github
网友评论