背景

最近在学习词向量模型：Skip-Gram，其编程作个简要笔记。skip-gram的coding工作主要分为三部分：

数据预处理
模型构建

数据预处理

image.png
其中细节的编码部分详细可见：https://github.com/NELSONZHAO/zhihu/blob/master/skip_gram/Skip-Gram-English-Corpus.ipynb
需要注意的有以下几点：

对于标点符号转换部分，注意‘ <PERIOD> ’的前后有空格，在替换之后做split可以将标点符号分割开来，也对标点符号做word2vec
下采样部分计算采样概率那里，代码中没有采样处理，而是直接将高频词过滤掉，此处可以做一个采样计算，比如生成随机数与阈值进行比较
batch构造部分略微复杂一点，需要注意的是，输入词预测上下文单词，上下文单词作为y需要去重，举例来说：今天是个好天气，x为个，win_size为2，则为（个，天）、（个，是）、（个，好）、（个、天），可以看到此时（个，天）有重复，因此需要做去重处理，不过这也暴露了skip-gram的简单粗暴，没有考虑位置信息，batch代码部分这里自己动手写一下吧。

import numpy as np
def get_targets(words, idx, window_size=5)
    w_s = np.random.randint(1, window_size+1)
    if idx - w_s <0:
        targets = words[0:idx]+words[idx:(idx+w_s)]
    else:
        targets = words[(idx - w_s):idx]+words[(idx+1):(idx+w_s+1)]
    return set(targets)

def get_batches(words, batch_size, window_size):
    n_batches = len(words)
    for idx in range(0, len(words), batch_size):
        x, y = [], []
        batch = words[idx:(idx+batch_size)]
        for i in range(batch_size):
            _y = get_targets(batch, i, window_size)
            x.extend([batch[i]*len(_y)])
            y.extend(_y)
        yield x, y