LDA建模

作者: lwyaoshen | 来源:发表于2018-05-20 20:15 被阅读0次

LDA建模
LDA主题建模
LDA模型分析（三）：LDA建模与求参
用线性判别分析 LDA 降维
10 主题模型 - 代码案例一 - LDA主题模型初识
sklearn学习笔记——线性判别分析LDA
统计学习 - Linear Discriminant Analy
92-预测分析－Ｒ语言实现-主题模型
文本建模：主题模型和LDA(Latent Dirichlet A
Revisit Topic Model and Latent D

数据：

首先我们来看一眼数据：
语料库中有9篇文档，每篇文档为1行。数据保存在文件名为16.LDA_test.txt的文本文件中。

Human machine interface for lab abc computer applications
A survey of user opinion of computer system response time
The EPS user interface management system
System and human system engineering testing of EPS
Relation of user perceived response time to error measurement
The generation of random binary unordered trees
The intersection graph of paths in trees
Graph minors IV Widths of trees and well quasi ordering
Graph minors A survey

程序：

（1）首先，将这个文件读进来：

f = open('LDA_test.txt')

（2）然后对每行的文档进行分词，并去掉停用词：

stop_list = set('for a of the and to in'.split())
texts = [[word for word in line.strip().lower().split() if word not in stop_list] for line in f]
print 'Text = '
pprint(texts)

打印结果：

Text = 
[['human', 'machine', 'interface', 'lab', 'abc', 'computer', 'applications'],
 ['survey', 'user', 'opinion', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'management', 'system'],
 ['system', 'human', 'system', 'engineering', 'testing', 'eps'],
 ['relation', 'user', 'perceived', 'response', 'time', 'error', 'measurement'],
 ['generation', 'random', 'binary', 'unordered', 'trees'],
 ['intersection', 'graph', 'paths', 'trees'],
 ['graph', 'minors', 'iv', 'widths', 'trees', 'well', 'quasi', 'ordering'],
 ['graph', 'minors', 'survey']]

（3）构建字典：

dictionary = corpora.Dictionary(texts)
print dictionary

V = len(dictionary) # 字典的长度

打印字典：总共有35个词

Dictionary(35 unique tokens: [u'minors', u'generation', u'testing', u'iv', u'engineering']...)

（4）计算每个文档中的TF-IDF值：

# 根据字典，将每行文档都转换为索引的形式
corpus = [dictionary.doc2bow(text) for text in texts]
# 逐行打印
for line in corpus:
       print line

转换后还是每行一片文章，只是原来的文字变成了（索引，1）的形式，这个索引根据的是字典中的（索引，词）。打印结果如下：

[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1)]
[(4, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1)]
[(6, 1), (7, 1), (9, 1), (13, 1), (14, 1)]
[(5, 1), (7, 2), (14, 1), (15, 1), (16, 1)]
[(9, 1), (10, 1), (12, 1), (17, 1), (18, 1), (19, 1), (20, 1)]
[(21, 1), (22, 1), (23, 1), (24, 1), (25, 1)]
[(25, 1), (26, 1), (27, 1), (28, 1)]
[(25, 1), (26, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1)]
[(8, 1), (26, 1), (29, 1)]

现在对每篇文档中的每个词都计算tf-idf值

corpus_tfidf = models.TfidfModel(corpus)[corpus]

#逐行打印
print 'TF-IDF:'
for c in corpus_tfidf:
    print c

仍然是每一行一篇文档，只是上面一步中的1的位置，变成了每个词索引所对应的tf-idf值了。

TF-IDF:
[(0, 0.4301019571350565), (1, 0.4301019571350565), (2, 0.4301019571350565), (3, 0.4301019571350565), (4, 0.2944198962221451), (5, 0.2944198962221451), (6, 0.2944198962221451)]
[(4, 0.3726494271826947), (7, 0.27219160459794917), (8, 0.3726494271826947), (9, 0.27219160459794917), (10, 0.3726494271826947), (11, 0.5443832091958983), (12, 0.3726494271826947)]
[(6, 0.438482464916089), (7, 0.32027755044706185), (9, 0.32027755044706185), (13, 0.6405551008941237), (14, 0.438482464916089)]
[(5, 0.3449874408519962), (7, 0.5039733231394895), (14, 0.3449874408519962), (15, 0.5039733231394895), (16, 0.5039733231394895)]
[(9, 0.21953536176370683), (10, 0.30055933182961736), (12, 0.30055933182961736), (17, 0.43907072352741366), (18, 0.43907072352741366), (19, 0.43907072352741366), (20, 0.43907072352741366)]
[(21, 0.48507125007266594), (22, 0.48507125007266594), (23, 0.48507125007266594), (24, 0.48507125007266594), (25, 0.24253562503633297)]
[(25, 0.31622776601683794), (26, 0.31622776601683794), (27, 0.6324555320336759), (28, 0.6324555320336759)]
[(25, 0.20466057569885868), (26, 0.20466057569885868), (29, 0.2801947048062438), (30, 0.40932115139771735), (31, 0.40932115139771735), (32, 0.40932115139771735), (33, 0.40932115139771735), (34, 0.40932115139771735)]
[(8, 0.6282580468670046), (26, 0.45889394536615247), (29, 0.6282580468670046)]

（5）应用LDA模型
前面4步可以说是特征数据的准备。因为这里我们使用每篇文章的tf-idf值来作为特征输入进LDA模型。

训练模型：

print '\nLDA Model:'
# 设置主题的数目
num_topics = 2
# 训练模型
lda = models.LdaModel(corpus_tfidf, num_topics=num_topics, id2word=dictionary,
                      alpha='auto', eta='auto', minimum_probability=0.001)

打印一下每篇文档被分在各个主题的概率：

doc_topic = [a for a in lda[corpus_tfidf]]
    print 'Document-Topic:\n'
    pprint(doc_topic)

LDA Model:
Document-Topic:

[[(0, 0.25865201763870671), (1, 0.7413479823612934)],
 [(0, 0.6704214035190138), (1, 0.32957859648098625)],
 [(0, 0.34722886288787302), (1, 0.65277113711212698)],
 [(0, 0.64268836524831052), (1, 0.35731163475168948)],
 [(0, 0.67316053818546506), (1, 0.32683946181453505)],
 [(0, 0.37897103968594514), (1, 0.62102896031405486)],
 [(0, 0.6244681672561716), (1, 0.37553183274382845)],
 [(0, 0.74840501728867792), (1, 0.25159498271132213)],
 [(0, 0.65364678163446832), (1, 0.34635321836553179)]]

打印每个主题中，每个词出现的概率：
因为前面训练模型时传入了参数minimum_probability=0.001，所以小于这个概率的词将不被输出了。

for topic_id in range(num_topics):
    print 'Topic', topic_id
    pprint(lda.show_topic(topic_id))

Topic 0
[(u'system', 0.041635423550867606),
 (u'survey', 0.040429107770606001),
 (u'graph', 0.038913672197129358),
 (u'minors', 0.038613604352799001),
 (u'trees', 0.035093470419085344),
 (u'time', 0.034314182442026844),
 (u'user', 0.032712431543062859),
 (u'response', 0.032562733895067024),
 (u'eps', 0.032317332054789358),
 (u'intersection', 0.031074066863528784)]
Topic 1
[(u'interface', 0.038423961073724748),
 (u'system', 0.036616390857180062),
 (u'management', 0.03585869312482335),
 (u'graph', 0.034776623890248701),
 (u'user', 0.03448476247382859),
 (u'survey', 0.033892977987880241),
 (u'eps', 0.033683486487186061),
 (u'computer', 0.032741732328417393),
 (u'minors', 0.031949259380969104),
 (u'human', 0.03156868862825063)]

计算文档与文档之间的相似性：
相似性是通过tf-idf计算的。

similarity = similarities.MatrixSimilarity(lda[corpus_tfidf])
print 'Similarity:'
pprint(list(similarity))

Similarity:
[array([ 0.99999994,  0.71217406,  0.98829806,  0.74671113,  0.70895636,
        0.97756702,  0.76893044,  0.61318189,  0.73319417], dtype=float32),
 array([ 0.71217406,  1.        ,  0.81092042,  0.99872446,  0.99998957,
        0.8440569 ,  0.99642557,  0.99123365,  0.99953747], dtype=float32),
 array([ 0.98829806,  0.81092042,  1.        ,  0.83943164,  0.808236  ,
        0.99825525,  0.85745317,  0.72650033,  0.82834125], dtype=float32),
 array([ 0.74671113,  0.99872446,  0.83943164,  0.99999994,  0.99848306,
        0.87005669,  0.99941987,  0.98329824,  0.99979806], dtype=float32),
 array([ 0.70895636,  0.99998957,  0.808236  ,  0.99848306,  1.        ,
        0.84159577,  0.99602884,  0.99182749,  0.99938792], dtype=float32),
 array([ 0.97756702,  0.8440569 ,  0.99825525,  0.87005669,  0.84159577,
        0.99999994,  0.88634008,  0.76580745,  0.85997516], dtype=float32),
 array([ 0.76893044,  0.99642557,  0.85745317,  0.99941987,  0.99602884,
        0.88634008,  1.        ,  0.9765296 ,  0.99853373], dtype=float32),
 array([ 0.61318189,  0.99123365,  0.72650033,  0.98329824,  0.99182749,
        0.76580745,  0.9765296 ,  0.99999994,  0.9867571 ], dtype=float32),
 array([ 0.73319417,  0.99953747,  0.82834125,  0.99979806,  0.99938792,
        0.85997516,  0.99853373,  0.9867571 ,  1.        ], dtype=float32)]

LDA建模
数据：首先我们来看一眼数据：语料库中有9篇文档，每篇文档为1行。数据保存在文件名为16.LDA_test.txt...
LDA主题建模
本文首发于我的博客：gongyanli.com 前言:本文用到的方法叫做主题建模（topic model)或主题抽...
LDA模型分析（三）：LDA建模与求参
pLSA与LDA对比： LDA就是在pLSA的基础上加层贝叶斯框架，即LDA就是pLSA的贝叶斯版本。 pLSA与...
用线性判别分析 LDA 降维
本文结构：什么是 LDA 和 PCA 区别 LDA 投影的计算过程 LDA 降维的例子 1. 什么是 LDA 先...
10 主题模型 - 代码案例一 - LDA主题模型初识
08 主题模型 - LDA09 主题模型 - LDA参数学习-Gibbs采样安装 lda 库使用第三方的lda库...
sklearn学习笔记——线性判别分析LDA
LDA降维 Linear and Quadratic Discriminant Analysis LDA、PDA ...
统计学习 - Linear Discriminant Analy
本文主要分为三个部分：LDA背景一维情形的LDA高维情形的LDA （1）提出LDA的背景当类别分的比较开的时候，...
92-预测分析－Ｒ语言实现-主题模型
主题建模的主要技术是隐含狄式分布（LDA），它假定在文档里能找到的主题和单词分布来源于事先按照狄式分布抽样的隐藏多...
文本建模：主题模型和LDA(Latent Dirichlet A
@[toc] LDA概念原理 The Problem 有许多事先未知主题的文本，我们想要根据主题对文本进行筛选，使...
Revisit Topic Model and Latent D
在机器学习中，LDA 是一个进行文本建模的模型。主题模型认为每一个文档都有对应的主题，每个主题都对应着一些词，所以...