Intro

The two main model families for learning word vectors are: 1) global matrix factorization meth- ods, such as latent semantic analysis (LSA) (Deer- wester et al., 1990) and 2) local context window methods, such as the skip-gram model of Mikolov et al. (2013c)
词向量表示主要两种方法： LSA，local context window
While methods like LSA efficiently leverage statistical information, they do relatively poorly on the word analogy task, indicating a sub-optimal vector space structure. Methods like skip-gram may do better on the analogy task, but they poorly utilize the statistics of the corpus since they train on separate local context win- dows instead of on global co-occurrence counts.
这两种方法都有缺点
GloVe一定程度上吸收了两个的有点

In the skip-gram and ivLBL models, the objective is to predict a word’s context given the word itself, whereas the objective in the CBOW and vLBL models is to predict a word given its context.
Word2Vec的两种算法，一种根据中心单词预测上下文，一种根据上下文预测中心单词
suffer from the disadvantage that they do not operate directly on the co-occurrence statistics of the corpus.
缺点在于没有利用到共生矩阵里面的数值信息，大量的重复信息没有好好利用

Model

X word-word co-occurrence matrix
Xij number of times word j occurs in the context of word i
Xi number of times any word appears in the context of word i
Pij = P(j|i) = Xij/Xi probability that word j appear in the context of word i.

一个有趣的事情是，对于Pik, Pjk，给定k，求Pik/Pjk的比例，可以区分跟i(或j)相关的词
Compared to the raw probabilities, the ratio is better able to distinguish relevant words from irrelevant words and it is also better able to discriminate between the two relevant words.
比如i = ice, j = steam
solid跟ice比较相关跟steam无关，所以Pik/Pjk比较大
反之，gas跟ice比较无关跟steam比较相关，所以Pik/Pjk比较小
同时water跟ice, steam都比较相关，fashion跟两者都无关，所以Pik/Pjk趋向于1

Pik/Pjk

定义
![](http://www.forkosh.com/mathtex.cgi? F(w_i, w_j, \widetilde{w}k)=\frac{P{ik}}{P_{jk}})
其中w是词向量，这是在向量空间的比例
左边是向量操作，右边是数值
Since vector spaces are inherently linear structures, the most natural way to do this is with vector differences
搞发搞发，求个差
![](http://www.forkosh.com/mathtex.cgi? F(w_i- w_j, \widetilde{w}k)=\frac{P{ik}}{P_{jk}})
While F could be taken to be a complicated function parameterized by, e.g., a neural network, do- ing so would obfuscate the linear structure we are trying to capture. To avoid this issue, we can first take the dot product of the arguments
搞发搞发，变点乘。（为了不混淆向量维度）
![](http://www.forkosh.com/mathtex.cgi? F((w_i-w_j)^T\widetilde{w}k=\frac{P{ik}}{P_{jk}})

<font color=gray>不是很明白这里的搞发搞发</font>

一个词（word）或者上下文的词（context word）其实是可以交换角色的，共生矩阵也应该是满足X^T=X的，但是上面这个式子却并不总满足这种对称性
但是，假如![](http://www.forkosh.com/mathtex.cgi? F((w_i-w_j)^T \widetilde{w}k=\frac{F(w_i^{T\widetilde{w}_k)}{F(w_j}T\widetilde{w}k)})
就可以了
<font color=gray>不是很明白为什么这样就可以了</font>
这个时候![](http://www.forkosh.com/mathtex.cgi? F(w_i^T\widetilde{w}k) = P{ik} = \frac{X{ik}}{X_i})
F是exp函数，两边都取log的话
![](http://www.forkosh.com/mathtex.cgi? w_i^T\widetilde{w}k = log(P{ik}) = log(X_{ik})-log(X_i))
X_i是一个跟k无关的量，所以可以作为bias被$$X{ik}$$吸收进去。b_i for w_i
所以最终的式子，保证了上述的对称性
![](http://www.forkosh.com/mathtex.cgi? w_i^T\widetilde{w}k + b_i + \widetilde{b}k = log(X{ik}))
log操作可以遇到log(0)的情况，可以用log(1+X{ik})代替log(X_{ik})

A main drawback to this model is that it weighs all co-occurrences equally, even those that happen rarely or never.
缺点是，rare的词权重跟normal的词一样

最终的loss function
![](http://www.forkosh.com/mathtex.cgi? J = \sum_{i,j=1}^{V{f(X_{ij})[{w_i}T\widetilde{w}_i + b_i + \widetilde{b}j - log(X{ij})}]^2 )

f(X_{ij})是weight
V是词汇量

f要满足三个特性：

f(0) = 0，或者极限在0处值为0
f (x) should be non-decreasing so that rare co-occurrences are not overweighted.
f ( x ) should be relatively small for large val- ues of x, so that frequent co-occurrences are not overweighted.

找了一个f：
![](http://www.forkosh.com/mathtex.cgi? f(x)=\begin{cases} (x/x_{max})^\alpha & \text{$x<x_{max}$}\ 1& \text{otherwise} \end{cases})