美文网首页
聊天机器人-word2vec

聊天机器人-word2vec

作者: 魏鹏飞 | 来源:发表于2019-10-24 11:12 被阅读0次

1. 为什么使用word2vec?

分布式表示

2. How To Learn Word2Vec?

We are working on NLP project, it is interesting.

相邻词的相似度更大。对于指代问题,可能并非如此。

3. CBOW Model(不常用的方法-相邻词预测中间词

We are working on NLP project, it is interesting
We are _ on NLP project, it is interesting
We are working _ NLP project, it is interesting
We are working on _ project, it is interesting
We are working on NLP _, it is interesting
We are working on NLP project, _ is interesting

4. Skip-Gram Model(常用的方法-中间词预测相邻词

Text:We are working on NLP project, it is interesting
_ _ working _ _ project, it is interesting
We _ _ on _ _, it is interesting
We are _ _ NLP _, _ is interesting
We are working _ _ project, _ _ interesting
We are working on _ _, it _ _
目标:数学形式就是求下面式子极大
P(We|working)P(are|working)P(on|working)P(NLP|working)\\P(are|on)P(working|on)P(NLP|on)P(project|on)\\P(working|NLP)P(on|NLP)P(project|NLP)P(it|NLP)...

arg\max_\theta\prod_{w\in Text}\prod_{c\in context(w)}P(c\|w;\theta)\tag{4.01}
arg\max_\theta\sum_{w\in Text}\sum_{c\in context(w)}logP(c\|w;\theta)\tag{4.02}

示例

4.1 模型构建
Text = (今天 天气 很好 今天 上 NLP 课程 NLP 是 目前 最 火 的 方向)
window_size = 1 w = 中心词 c = 上下文词

arg\max_\theta P(天气|今天)P(今天|天气)P(很好|天气)P(天气|很好)\\P(今天|很好)P(很好|今天)P(上|今天)P(今天|上)P(NLP|上)\\···P(最|火)P(的|火)P(火|的)P(方向|的)P(的|方向) \\ = arg\max_\theta \prod_{w\in Text}\prod_{c\in context(w)}P(c|w;\theta) \\ \iff arg\max_\theta log\prod_{w\in Text}\prod_{c\in context(w)}P(c|w;\theta) \\ arg\max_\theta \sum_{w\in Text}\sum_{c\in context(w)}logP(c|w;\theta) \tag{4.11}

P(c|w;\theta)可以通过softmax方式写出来。如下:
P(c|w;\theta) = \frac{e^{u_c*v_w}}{\sum_{c'\in 词库}e^{u_{c'}*v_w}}\tag{4.12} \\ 0 < P < 1 \\ \sum_{c'}P(...) = 1

4.2 训练优化:Negative Sampling(常用方法) and Hierarchical Softmax(不常用方法)
arg\max_\theta \sum_{w\in Text}\sum_{c\in context(w)}logP(c|w;\theta) \\ = arg\max_\theta \sum_{w\in Text}\sum_{c\in context(w)}log\frac{e^{u_c*v_w}}{\sum_{c'\in 词库}e^{u_{c'}*v_w}} \\ = arg\max_\theta \sum_{w\in Text}\sum_{c\in context(w)}[u_c*v_w-log\sum_{c'\in 词库}e^{u_{c'}*v_w}] \tag{4.2.1}

此时可以用SGD去求出参数,但是对于当前目标arg\max_\theta \sum_{w\in Text}\sum_{c\in context(w)}[u_c*v_w-log\sum_{c'\in 词库}e^{u_{c'}*v_w}时间复杂度较高len(Text)·len(window_size)·O(|V|),训练时需要找到一种更好的方法去训练。

转换目标函数Negative Sampling

如何把上述概率描述起来呢?
答案:可以借助于逻辑回归条件概率表示形式。

(w_i,w_j)\rightarrow P(D=1|w_i,w_j;\theta) \uparrow= \frac{1}{1+exp(-u_{w_i}·v_{w_j})}\tag{4.2.2}

4.3 Negative Sampling (采样一部分负样本)
i.e:I am a student
vocab = [I, am, a, student]

D = \{(I, am), (am, I), (am, a), (a, am), (a, student), (student, a) \}
\tilde{D} = \{(I, a), (I, student), (am, student), (a, I), (student, I), (student, am)\}

arg\max_\theta \prod_{(w,c)\in D}P(D=1|w,c;\theta)\prod_{(w,c)\in \tilde{D}}P(\tilde{D}=0|w,c;\theta) \\ = arg\max_\theta \prod_{(w,c)\in D}\frac{1}{1+exp(-u_c·v_w)}\prod_{(w,c)\in \tilde{D}}[1-\frac{1}{1+exp(-u_c·v_w)}] \\ \Rightarrow arg\max_\theta \sum_{(w,c)\in D}log\sigma(u_c·v_w) + \sum_{(w,c)\in \tilde{D}}log\sigma(-u_c·v_w) 【描述:所有负样本】 \\ arg\max_\theta \sum_{(w,c)\in D}[log\sigma(u_c·v_w) + \sum_{c'\in N(w)}log\sigma(-u_{c'}·v_w)] 【描述:针对每一个正样本,只考虑其中的10?个负样本】 \tag{4.3.1}

例子:

S=“I like NLP, it is interesting, but it is hard” vocab={I,like,NLP,it,is,interesting,but,hard}
正样本 负样本
(NLP,like) (NLP,I), (NLP,but)
(NLP,it) (NLP,hard), (NLP,I)
(it,is) (it,interesting), (it,hard)
(it,NLP) (it,hard), (it,I)

对下式进行梯度下降法:
arg\max_\theta \sum_{(w,c)\in D}[log\sigma(u_c·v_w) + \sum_{c'\in N(w)}log\sigma(-u_{c'}·v_w)\tag{4.3.2}

\frac{\partial \ell(\theta)}{\partial u_c}=\frac{\sigma(u_c·v_w)·[1-\sigma(u_c·v_w)]·v_w}{\sigma(u_c·v_w)} \\=[1-\sigma(u_c·v_w)]·v_w\tag{4.3.3}

\frac{\partial \ell(\theta)}{\partial u_{c'}}=\frac{\sigma(-u_{c'}·v_w)·[1-\sigma(-u_{c'}·v_w)]·(-v_w)}{\sigma(-u_{c'}·v_w)} \\=[\sigma(-u_{c'}·v_w)-1]·v_w\tag{4.3.4}

\frac{\partial \ell(\theta)}{\partial v_w}=\frac{\sigma(u_c·v_w)·[1-\sigma(u_c·v_w)]·u_c}{\sigma(u_c·v_w)} + \sum_{{c'}\in N(w)}\frac{\sigma(-u_{c'}·v_w)·[1-\sigma(-u_{c'}·v_w)]·(-u_{c'})}{\sigma(-u_{c'}·v_w)} \\ = [1-\sigma(u_c·v_w)]·u_c + \sum_{{c'}\in N(w)}[\sigma(-u_{c'}·v_w)-1]·u_{c'} \tag{4.3.5}

SGD(Skip-Gram)=\begin{cases} u_c \leftarrow u_c+\eta·\frac{\partial \ell(\theta)}{\partial u_c}\\ u_{c'} \leftarrow u_{c'}+\eta·\frac{\partial \ell(\theta)}{\partial u_{c'}} \\ v_w \leftarrow v_w+\eta·\frac{\partial \ell(\theta)}{\partial v_w} \end{cases} \tag{4.3.6}

C++代码:https://github.com/dav/word2vec最好读一下。

SG with Negative Sampling
for each (w, c)\in D\leftarrow正样本集合
    集合\leftarrow N(w):针对中心词w,采样(负)
           SGD=\begin{cases} u_c \leftarrow u_c+\eta·\frac{\partial \ell(\theta)}{\partial u_c} \\ u_{c'} \leftarrow u_{c'}+\eta·\frac{\partial \ell(\theta)}{\partial u_{c'}}, & {c'}\in N(w) \\ v_w \leftarrow v_w+\eta·\frac{\partial \ell(\theta)}{\partial v_w} \end{cases}

5. 词向量评估

  • Visualization(可视化TSNE)
  • Similarity(相似度)
  • Analogy(类比 woman-man=girl-boy)

6. word2vec中skip-gram有哪些缺点?

  • 窗口长度有限(Language Model(RNN/LSTM))
  • 无法考虑全局(全局模型(MF))
  • 无法有效学习低频词向量(subword embedding)
  • 未登录词(OOV:out-of-vocabulary)(subword embedding)
  • 不考虑上下文(context-aware embedding(ELMO/BERT))
  • 没有严格意义上的语序(LM(RNN/LSTM/Transformer))
  • 可解释性不够(非欧氏空间)

7. 词表示分类图

词表示总结图

参考文献

  1. 自然语言处理实战

相关文章

网友评论

      本文标题:聊天机器人-word2vec

      本文链接:https://www.haomeiwen.com/subject/ovpqvctx.html