LDA模型伪代码块2

作者: 阿门aaa | 来源:发表于2018-12-27 14:40 被阅读0次

LDA模型伪代码块2
LDA模型伪代码块
12 主题模型 - 代码案例三 - scikit-learn中的
14 主题模型 - 代码参考文档 - 1.txt\2.txt\3
13 主题模型 - 代码案例四 - scikit-learn中的
10 主题模型 - 代码案例一 - LDA主题模型初识
11 主题模型 - 代码案例二 - scikit-learn中的
浅谈人工智能产品设计——LDA主题模型
二、Kafka基础实战：消费者和生产者实例
词典构造方法之LDA主题模型

LDA模型应用：一眼看穿希拉里的邮件

我们拿到希拉里泄露的邮件，跑一把LDA，看看她平时都在聊什么。

首先，导入我们需要的一些库

import numpy as np
import pandas as pd
import re

然后，把邮件读取进来。

这里我们用pandas。不熟悉pandas的朋友，可以用python标准库csv

df = pd.read_csv("../input/HillaryEmails.csv")
a=df[['Id','ExtractedBodyText']]
print(a)

大概是这样的

image.png

# 原邮件数据中有很多Nan的值，直接扔了。
df = df[['Id','ExtractedBodyText']].dropna()

文本预处理：

上过我其他NLP课程的同学都知道，文本预处理这个东西，对NLP是很重要的。

我们这里，针对邮件内容，写一组正则表达式：

（不熟悉正则表达式的同学，直接百度关键词，可以看到一大张Regex规则表）

def clean_email_text(text):
    text = text.replace('\n'," ") #新行，我们是不需要的
    text = re.sub(r"-", " ", text) #把 "-" 的两个单词，分开。（比如：july-edu ==> july edu）
    text = re.sub(r"\d+/\d+/\d+", "", text) #日期，对主体模型没什么意义
    text = re.sub(r"[0-2]?[0-9]:[0-6][0-9]", "", text) #时间，没意义
    text = re.sub(r"[\w]+@[\.\w]+", "", text) #邮件地址，没意义
    text = re.sub(r"/[a-zA-Z]*[:\//\]*[A-Za-z0-9\-_]+\.+[A-Za-z0-9\.\/%&=\?\-_]+/i", "", text) #网址，没意义
    pure_text = ''
    # 以防还有其他特殊字符（数字）等等，我们直接把他们loop一遍，过滤掉
    for letter in text:
        # 只留下字母和空格
        if letter.isalpha() or letter==' ':
            pure_text += letter
    # 再把那些去除特殊字符后落单的单词，直接排除。
    # 我们就只剩下有意义的单词了。
    text = ' '.join(word for word in pure_text.split() if len(word)>1)
    return text

好的，现在我们新建一个colum，并把我们的方法跑一遍：

---为什么要建立colum呢，因为panda结构可以直接apply
过滤

docs = df['ExtractedBodyText']
# print(type(docs))
docs = docs.apply(lambda s: clean_email_text(s))  
docs.head(3).values
#放到doclist中
doclist = docs.values

LDA模型构建：

好，我们用Gensim来做一次模型构建

首先，我们得把我们刚刚整出来的一大波文本数据

[[一条邮件字符串]，[另一条邮件字符串], ...]

转化成Gensim认可的语料库形式：

[[一，条，邮件，在，这里],[第，二，条，邮件，在，这里],[今天，天气，肿么，样],...]

引入库：

from gensim import corpora, models, similarities
import gensim

垃圾手动导入停止词

stoplist = ['very', 'ourselves', 'am', 'doesn', 'through', 'me', 'against', 'up', 'just', 'her', 'ours', 
            'couldn', 'because', 'is', 'isn', 'it', 'only', 'in', 'such', 'too', 'mustn', 'under', 'their', 
            'if', 'to', 'my', 'himself', 'after', 'why', 'while', 'can', 'each', 'itself', 'his', 'all', 'once', 
            'herself', 'more', 'our', 'they', 'hasn', 'on', 'ma', 'them', 'its', 'where', 'did', 'll', 'you', 
            'didn', 'nor', 'as', 'now', 'before', 'those', 'yours', 'from', 'who', 'was', 'm', 'been', 'will', 
            'into', 'same', 'how', 'some', 'of', 'out', 'with', 's', 'being', 't', 'mightn', 'she', 'again', 'be', 
            'by', 'shan', 'have', 'yourselves', 'needn', 'and', 'are', 'o', 'these', 'further', 'most', 'yourself', 
            'having', 'aren', 'here', 'he', 'were', 'but', 'this', 'myself', 'own', 'we', 'so', 'i', 'does', 'both', 
            'when', 'between', 'd', 'had', 'the', 'y', 'has', 'down', 'off', 'than', 'haven', 'whom', 'wouldn', 
            'should', 've', 'over', 'themselves', 'few', 'then', 'hadn', 'what', 'until', 'won', 'no', 'about', 
            'any', 'that', 'for', 'shouldn', 'don', 'do', 'there', 'doing', 'an', 'or', 'ain', 'hers', 'wasn', 
            'weren', 'above', 'a', 'at', 'your', 'theirs', 'below', 'other', 'not', 're', 'him', 'during', 'which','am','pm']

人工分词：

这里，英文的分词，直接就是对着空白处分割就可以了。

中文的分词稍微复杂点儿，具体可以百度：CoreNLP, HaNLP, 结巴分词，等等

分词的意义在于，把我们的长长的字符串原文本，转化成有意义的小元素：

texts = [[word for word in doc.lower().split() if word not in stoplist] for doc in doclist]

这时候，我们的texts就是我们需要的样子了：

image.png

建立语料库

用词袋的方法，把每个单词用一个数字index指代，并把我们的原文本变成一条长长的数组：---建立字典表，词频统计

dictionary = corpora.Dictionary(texts)
print(str(dictionary))
corpus = [dictionary.doc2bow(text) for text in texts]

# create a transformation, from initial model to tf-idf model
tfidf = models.TfidfModel(corpus)     # step 1 -- initialize a model
# "tfidf" is treated as a read-only object that can be used to convert any vector from the old representation
# (bag-of-words integer counts) to the new representation (TfIdf real-valued weights):
# apply a transformation to a whole corpus
corpus_tfidf = tfidf[corpus]
# for doc in corpus_tfidf:
#     print(doc)
   
   # lda.get_document_topics(bow)
num_show_topic = 5  # 每个文档显示前几个主题
print('下面，显示前5个文档的主题分布：')
doc_topics = lda.get_document_topics(corpus)  # 所有文档的主题分布
for i in range(5):
   topic = np.array(doc_topics[i])
   topic_distribute = np.array(topic[:, 1])
   topic_idx = list(topic_distribute)
   print('第%d个文档的 %d 个主题分布概率分别为：' % (i, num_show_topic))
   print(topic_idx)

image.png

num_topics = 9
num_show_term = 5   # 每个主题下显示几个词
for topic_id in range(num_topics):
   print(lda.get_document_topics(corpus[topic_id]))

image.png

作业：

我这里有希拉里twitter上的几条(每一空行是单独的一条)：

To all the little girls watching...never doubt that you are valuable and powerful & deserving of every chance & opportunity in the world.

I was greeted by this heartwarming display on the corner of my street today. Thank you to all of you who did this. Happy Thanksgiving. -H

Hoping everyone has a safe & Happy Thanksgiving today, & quality time with family & friends. -H

Scripture tells us: Let us not grow weary in doing good, for in due season, we shall reap, if we do not lose heart.

Let us have faith in each other. Let us not grow weary. Let us not lose heart. For there are more seasons to come and...more work to do

We have still have not shattered that highest and hardest glass ceiling. But some day, someone will

To Barack and Michelle Obama, our country owes you an enormous debt of gratitude. We thank you for your graceful, determined leadership

Our constitutional democracy demands our participation, not just every four years, but all the time

You represent the best of America, and being your candidate has been one of the greatest honors of my life

Last night I congratulated Donald Trump and offered to work with him on behalf of our country

Already voted? That's great! Now help Hillary win by signing up to make calls now

It's Election Day! Millions of Americans have cast their votes for Hillary—join them and confirm where you vote

We don’t want to shrink the vision of this country. We want to keep expanding it

We have a chance to elect a 45th president who will build on our progress, who will finish the job

I love our country, and I believe in our people, and I will never, ever quit on you. No matter what

各位同学请使用训练好的LDA模型，判断每句话各自属于哪个potic
导入

s = open('../input/NewFile.txt','r')
q = s.read()
# q=clean_email_text(q)
f = q.split('\n\n')

后面是照着写的，没新东西，截图得了

image.png

LDA模型伪代码块2
LDA模型应用：一眼看穿希拉里的邮件我们拿到希拉里泄露的邮件，跑一把LDA，看看她平时都在聊什么。首先，导入我...
LDA模型伪代码块
Gensim的基本用法 Gensim非常适合用来实现各种文本模型、主题模型，包括tf-idf模型、LSI模型以及L...
12 主题模型 - 代码案例三 - scikit-learn中的
11 主题模型 - 代码案例二 - scikit-learn中的LDA模型13 主题模型 - 代码案例四 - sc...
14 主题模型 - 代码参考文档 - 1.txt\2.txt\3
11 主题模型 - 代码案例二 - scikit-learn中的LDA模型12 主题模型 - 代码案例三 - sc...
13 主题模型 - 代码案例四 - scikit-learn中的
11 主题模型 - 代码案例二 - scikit-learn中的LDA模型12 主题模型 - 代码案例三 - sc...
10 主题模型 - 代码案例一 - LDA主题模型初识
08 主题模型 - LDA09 主题模型 - LDA参数学习-Gibbs采样安装 lda 库使用第三方的lda库...
11 主题模型 - 代码案例二 - scikit-learn中的
10 主题模型 - 代码案例一 - LDA主题模型初识 ['温暖晨光光照照耀生辉王宫一夜未眠赶到 ...
浅谈人工智能产品设计——LDA主题模型
一、LDA模型简介 LDA是Latent Dirichlet Allocation（潜在狄利克雷分配模型）的缩写，...
二、Kafka基础实战：消费者和生产者实例
一、Kafka消费者编程模型 1.分区消费模型分区消费伪代码描述 2.组(Group)消费模型按组(Group...
词典构造方法之LDA主题模型
词典构造方法之LDA主题模型主题模型LDA原理理解 LDA是一种非监督学习技术，可以用来识别大规模文档集（doc...