概念

google在2013年推出的一个NLP工具
特点
1. 将词向量化, 这样就可以定量的去度量词与词之间的关系, 挖掘词之间的联系
2. 有距离性质, 向量空间上的相似度可以用来表示文本语义上的相似度
3. 保持了上下文信息, 词语的信息更加地丰富
4. 为文本数据寻求更加深层次的特征表示
5. 有计算性质: 美国 + 波士顿 - 伦敦 = 英国
应用, 序列数据 + 局部强关联
1. 聚类, 找同义词, 词性分析
2. 文本序列: 近邻强关联, 可通过上下文预测目标词(选词填空)
3. 社交网络: 随机游走生成序列, 然后使用word2vec训练每个节点的向量.
4. 推荐系统, 广告(APP下载序列: word2vec + similarity = aggr to )

词向量基础

one-hot representation
1. 稀疏, 离散
2. 维度灾难
distribution representation
1. 稠密, 连续
2. 通过训练, 将词映射到一个较小的空间上
3. 训练方法: NN(3 layers, input, hidder, output(softmax, 使用softmax选择概率最大的神经元))

cbow vs skip-gram

cbow和skip-gram只是两种标记方式而已, 将词向量的学习过程转化为有监督学习

CBOW	Skip-gram
上下文 -> 词语	词语 ->上下文
训练集较多	训练集较少
速度更快, 高频词更准确	对低频词有较好
需要更多的上下文条件	有限的数据产生更多的样本

hierarchical softmax vs negative sampling

其中, NN训练是非常慢的(需要计算所有词的softmax概率, 然后找最大的概率, 计算复杂度高)

因此有两种改进算法

hierarchical softmax(base huffman tree)
- 改进
  - 输入层到隐藏层的映射: 直接求和取均值
  - 隐藏层到softmax层的映射: 使用Huffman tree替代
- 训练更快, 对高频词支持较好
negative sampling
- 改进: 负样本抽样降低计算(有点类似mini-bath gradient descent的意思)
- 训练较慢, 对低频词支持较好(解决了huffman树中低频词深度高的问题)

gensim with word2vec

gensim.models.word2vec.Word2Vec(sentences=None, size=100, alpha=0.025, window=5, min_count=5, max_vocab_size=None, sample=0.001, seed=1, workers=3, min_alpha=0.0001, sg=0, hs=0, negative=5, cbow_mean=1, hashfxn=<built-in function hash>, iter=5, null_word=0, trim_rule=None, sorted_vocab=1, batch_words=10000, compute_loss=False, callbacks=())

gensim默认的是cbow + negative sampling

size选择最多300就好了

参考:

word2vec原理

word2vec前世今生

word2vec有什么应用？

https://www.quora.com/What-are-the-continuous-bag-of-words-and-skip-gram-architectures

https://www.quora.com/How-does-word2vec-work-Can-someone-walk-through-a-specific-example

https://stackoverflow.com/questions/39224236/word2vec-cbow-skip-gram-performance-wrt-training-dataset-size

https://stats.stackexchange.com/questions/180076/why-is-hierarchical-softmax-better-for-infrequent-words-while-negative-sampling

https://stackoverflow.com/questions/26569299/word2vec-number-of-dimensions

update 2018-5-19

According to Mikolov:

Skip-gram: works well with small amount of the training data, represents well even rare words or phrases.

CBOW: several times faster to train than the skip-gram, slightly better accuracy for the frequent words

This can get even a bit more complicated if you consider that there are two different ways how to train the models: the normalized hierarchical softmax, and the un-normalized negative sampling. Both work quite differently.

which makes sense since with skip gram, you can create a lot more training instances from limited amount of data, and for CBOW, you will need more since you are conditioning on context, which can get exponentially huge.

skip-gram: 适用于训练集较少的情况, 对低频词较好

cbow: 比skip-gram速度更快, 对高频词较好

skip-gram能够在有限的数据中产生更多的训练样本, 而cbow需要更多的上下文关系

大白话讲解word2vec到底在做些什么

CBOW v.s. skip-gram: why invert context and target words?

http://qr.ae/TUTgNG