数据:
首先我们来看一眼数据:
语料库中有9篇文档,每篇文档为1行。数据保存在文件名为16.LDA_test.txt的文本文件中。
Human machine interface for lab abc computer applications
A survey of user opinion of computer system response time
The EPS user interface management system
System and human system engineering testing of EPS
Relation of user perceived response time to error measurement
The generation of random binary unordered trees
The intersection graph of paths in trees
Graph minors IV Widths of trees and well quasi ordering
Graph minors A survey
程序:
(1)首先,将这个文件读进来:
f = open('LDA_test.txt')
(2)然后对每行的文档进行分词,并去掉停用词:
stop_list = set('for a of the and to in'.split())
texts = [[word for word in line.strip().lower().split() if word not in stop_list] for line in f]
print 'Text = '
pprint(texts)
打印结果:
Text =
[['human', 'machine', 'interface', 'lab', 'abc', 'computer', 'applications'],
['survey', 'user', 'opinion', 'computer', 'system', 'response', 'time'],
['eps', 'user', 'interface', 'management', 'system'],
['system', 'human', 'system', 'engineering', 'testing', 'eps'],
['relation', 'user', 'perceived', 'response', 'time', 'error', 'measurement'],
['generation', 'random', 'binary', 'unordered', 'trees'],
['intersection', 'graph', 'paths', 'trees'],
['graph', 'minors', 'iv', 'widths', 'trees', 'well', 'quasi', 'ordering'],
['graph', 'minors', 'survey']]
(3)构建字典:
dictionary = corpora.Dictionary(texts)
print dictionary
V = len(dictionary) # 字典的长度
打印字典:总共有35个词
Dictionary(35 unique tokens: [u'minors', u'generation', u'testing', u'iv', u'engineering']...)
(4)计算每个文档中的TF-IDF值:
# 根据字典,将每行文档都转换为索引的形式
corpus = [dictionary.doc2bow(text) for text in texts]
# 逐行打印
for line in corpus:
print line
转换后还是每行一片文章,只是原来的文字变成了(索引,1)的形式,这个索引根据的是字典中的(索引,词)。打印结果如下:
[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1)]
[(4, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1)]
[(6, 1), (7, 1), (9, 1), (13, 1), (14, 1)]
[(5, 1), (7, 2), (14, 1), (15, 1), (16, 1)]
[(9, 1), (10, 1), (12, 1), (17, 1), (18, 1), (19, 1), (20, 1)]
[(21, 1), (22, 1), (23, 1), (24, 1), (25, 1)]
[(25, 1), (26, 1), (27, 1), (28, 1)]
[(25, 1), (26, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1)]
[(8, 1), (26, 1), (29, 1)]
现在对每篇文档中的每个词都计算tf-idf值
corpus_tfidf = models.TfidfModel(corpus)[corpus]
#逐行打印
print 'TF-IDF:'
for c in corpus_tfidf:
print c
仍然是每一行一篇文档,只是上面一步中的1的位置,变成了每个词索引所对应的tf-idf值了。
TF-IDF:
[(0, 0.4301019571350565), (1, 0.4301019571350565), (2, 0.4301019571350565), (3, 0.4301019571350565), (4, 0.2944198962221451), (5, 0.2944198962221451), (6, 0.2944198962221451)]
[(4, 0.3726494271826947), (7, 0.27219160459794917), (8, 0.3726494271826947), (9, 0.27219160459794917), (10, 0.3726494271826947), (11, 0.5443832091958983), (12, 0.3726494271826947)]
[(6, 0.438482464916089), (7, 0.32027755044706185), (9, 0.32027755044706185), (13, 0.6405551008941237), (14, 0.438482464916089)]
[(5, 0.3449874408519962), (7, 0.5039733231394895), (14, 0.3449874408519962), (15, 0.5039733231394895), (16, 0.5039733231394895)]
[(9, 0.21953536176370683), (10, 0.30055933182961736), (12, 0.30055933182961736), (17, 0.43907072352741366), (18, 0.43907072352741366), (19, 0.43907072352741366), (20, 0.43907072352741366)]
[(21, 0.48507125007266594), (22, 0.48507125007266594), (23, 0.48507125007266594), (24, 0.48507125007266594), (25, 0.24253562503633297)]
[(25, 0.31622776601683794), (26, 0.31622776601683794), (27, 0.6324555320336759), (28, 0.6324555320336759)]
[(25, 0.20466057569885868), (26, 0.20466057569885868), (29, 0.2801947048062438), (30, 0.40932115139771735), (31, 0.40932115139771735), (32, 0.40932115139771735), (33, 0.40932115139771735), (34, 0.40932115139771735)]
[(8, 0.6282580468670046), (26, 0.45889394536615247), (29, 0.6282580468670046)]
(5)应用LDA模型
前面4步可以说是特征数据的准备。因为这里我们使用每篇文章的tf-idf值来作为特征输入进LDA模型。
训练模型:
print '\nLDA Model:'
# 设置主题的数目
num_topics = 2
# 训练模型
lda = models.LdaModel(corpus_tfidf, num_topics=num_topics, id2word=dictionary,
alpha='auto', eta='auto', minimum_probability=0.001)
打印一下每篇文档被分在各个主题的概率:
doc_topic = [a for a in lda[corpus_tfidf]]
print 'Document-Topic:\n'
pprint(doc_topic)
LDA Model:
Document-Topic:
[[(0, 0.25865201763870671), (1, 0.7413479823612934)],
[(0, 0.6704214035190138), (1, 0.32957859648098625)],
[(0, 0.34722886288787302), (1, 0.65277113711212698)],
[(0, 0.64268836524831052), (1, 0.35731163475168948)],
[(0, 0.67316053818546506), (1, 0.32683946181453505)],
[(0, 0.37897103968594514), (1, 0.62102896031405486)],
[(0, 0.6244681672561716), (1, 0.37553183274382845)],
[(0, 0.74840501728867792), (1, 0.25159498271132213)],
[(0, 0.65364678163446832), (1, 0.34635321836553179)]]
打印每个主题中,每个词出现的概率:
因为前面训练模型时传入了参数minimum_probability=0.001,所以小于这个概率的词将不被输出了。
for topic_id in range(num_topics):
print 'Topic', topic_id
pprint(lda.show_topic(topic_id))
Topic 0
[(u'system', 0.041635423550867606),
(u'survey', 0.040429107770606001),
(u'graph', 0.038913672197129358),
(u'minors', 0.038613604352799001),
(u'trees', 0.035093470419085344),
(u'time', 0.034314182442026844),
(u'user', 0.032712431543062859),
(u'response', 0.032562733895067024),
(u'eps', 0.032317332054789358),
(u'intersection', 0.031074066863528784)]
Topic 1
[(u'interface', 0.038423961073724748),
(u'system', 0.036616390857180062),
(u'management', 0.03585869312482335),
(u'graph', 0.034776623890248701),
(u'user', 0.03448476247382859),
(u'survey', 0.033892977987880241),
(u'eps', 0.033683486487186061),
(u'computer', 0.032741732328417393),
(u'minors', 0.031949259380969104),
(u'human', 0.03156868862825063)]
计算文档与文档之间的相似性:
相似性是通过tf-idf计算的。
similarity = similarities.MatrixSimilarity(lda[corpus_tfidf])
print 'Similarity:'
pprint(list(similarity))
Similarity:
[array([ 0.99999994, 0.71217406, 0.98829806, 0.74671113, 0.70895636,
0.97756702, 0.76893044, 0.61318189, 0.73319417], dtype=float32),
array([ 0.71217406, 1. , 0.81092042, 0.99872446, 0.99998957,
0.8440569 , 0.99642557, 0.99123365, 0.99953747], dtype=float32),
array([ 0.98829806, 0.81092042, 1. , 0.83943164, 0.808236 ,
0.99825525, 0.85745317, 0.72650033, 0.82834125], dtype=float32),
array([ 0.74671113, 0.99872446, 0.83943164, 0.99999994, 0.99848306,
0.87005669, 0.99941987, 0.98329824, 0.99979806], dtype=float32),
array([ 0.70895636, 0.99998957, 0.808236 , 0.99848306, 1. ,
0.84159577, 0.99602884, 0.99182749, 0.99938792], dtype=float32),
array([ 0.97756702, 0.8440569 , 0.99825525, 0.87005669, 0.84159577,
0.99999994, 0.88634008, 0.76580745, 0.85997516], dtype=float32),
array([ 0.76893044, 0.99642557, 0.85745317, 0.99941987, 0.99602884,
0.88634008, 1. , 0.9765296 , 0.99853373], dtype=float32),
array([ 0.61318189, 0.99123365, 0.72650033, 0.98329824, 0.99182749,
0.76580745, 0.9765296 , 0.99999994, 0.9867571 ], dtype=float32),
array([ 0.73319417, 0.99953747, 0.82834125, 0.99979806, 0.99938792,
0.85997516, 0.99853373, 0.9867571 , 1. ], dtype=float32)]
网友评论