论文48

作者: Milkmilkmilk | 来源:发表于2018-11-01 14:49 被阅读0次

论文48
写毕业论文不知道去哪儿找文献资料？48个全球免费电子数据库请收下
2019年7月第一周总结
大学毕业论文字数有上限吗，我写了一万六千字，指导老师让我删掉一万
April：游离在意念消失的断层
考研初试落榜后的三小时
今日碎碎念
SNH48GROUP联合招募暨SNH48九期生招募校园通道10月
花贞节.三（48）菊花池边评黄巢，林荫道上论文化
2018-05-08

前言：
Ubuntu Dialogue Corpus：a dataset containing almost 1 million multi-turn dialogues, with a total of over 7 million utterances and 100 million words。

Dialog State Tracking Challenge ：跟踪用户行为的任务。
benchmark：基准

介绍：
unstructured dialogues ： there is no a priori logical representation for the information exchanged during the conversation.（对话是没有结构的，和slot类型方法区分开来。）

神经网络在那些领域获得了好的结果的。因为：
1）公共分布的数据很丰富
2）足够的计算能力
3）有不同的变种神经网络。

而Dialogue systems 并没有那么好。假设可能是因为缺少足够的数据集。
这个数据集是从Ubuntu chat logs上面提取下来的。所以才叫做Ubuntu Dialogue Corpus。
这个数据集平均8轮。最小3轮。
本文介绍了TF-IDF（frequency-inverse document frequency）、neural models（RNN）和（LSTM）。

相比其他的Datasets。
Switchboard dataset、Dialogue State Tracking Challenge datasets。往往是将问题视为slot filling task。（structural）where agents attempt to predict the goal of a user during the conversation.
（尽管他们对于训练神经网络而言，数据量很小，但是对于结构化的对话工作还是很有用的）。（应该是这些数据集往往是很有结构的问答。）

学习结构的发展。

数据集是如何产生的以及有什么特点。

三种学习方法：
TF-IDF ： Term frequency 和 inverse document frequency。
计算一个word对于一个document的重要性。（在这个例子里面，document就是上下文）。
这个经常被用来文档分类以及信息检索。
term-frequency 就是word在该document里面出现的次数。
inverse document frequency 就是一个惩罚用来估量这个单词是否在很多的文档上都出现。

$tfidf(w,d,D) = f(w,d) * log \frac{N}{| \{ d\epsilon D: w\epsilon d \}|}$
其中 $f(w,d)$ 就是值 word在上下文 d 中出现的次数。
N是所有的dialogues的数目。
惩罚就是w出现在了多少个dialogues中。
分母越大，log越小，最后值越小。