LDA模型应用:一眼看穿希拉里的邮件
我们拿到希拉里泄露的邮件,跑一把LDA,看看她平时都在聊什么。
首先,导入我们需要的一些库
import numpy as np
import pandas as pd
import re
然后,把邮件读取进来。
这里我们用pandas。不熟悉pandas的朋友,可以用python标准库csv
df = pd.read_csv("../input/HillaryEmails.csv")
a=df[['Id','ExtractedBodyText']]
print(a)
大概是这样的
image.png
# 原邮件数据中有很多Nan的值,直接扔了。
df = df[['Id','ExtractedBodyText']].dropna()
文本预处理:
上过我其他NLP课程的同学都知道,文本预处理这个东西,对NLP是很重要的。
我们这里,针对邮件内容,写一组正则表达式:
(不熟悉正则表达式的同学,直接百度关键词,可以看到一大张Regex规则表)
def clean_email_text(text):
text = text.replace('\n'," ") #新行,我们是不需要的
text = re.sub(r"-", " ", text) #把 "-" 的两个单词,分开。(比如:july-edu ==> july edu)
text = re.sub(r"\d+/\d+/\d+", "", text) #日期,对主体模型没什么意义
text = re.sub(r"[0-2]?[0-9]:[0-6][0-9]", "", text) #时间,没意义
text = re.sub(r"[\w]+@[\.\w]+", "", text) #邮件地址,没意义
text = re.sub(r"/[a-zA-Z]*[:\//\]*[A-Za-z0-9\-_]+\.+[A-Za-z0-9\.\/%&=\?\-_]+/i", "", text) #网址,没意义
pure_text = ''
# 以防还有其他特殊字符(数字)等等,我们直接把他们loop一遍,过滤掉
for letter in text:
# 只留下字母和空格
if letter.isalpha() or letter==' ':
pure_text += letter
# 再把那些去除特殊字符后落单的单词,直接排除。
# 我们就只剩下有意义的单词了。
text = ' '.join(word for word in pure_text.split() if len(word)>1)
return text
好的,现在我们新建一个colum,并把我们的方法跑一遍:
---为什么要建立colum呢, 因为panda结构可以直接apply
过滤
docs = df['ExtractedBodyText']
# print(type(docs))
docs = docs.apply(lambda s: clean_email_text(s))
docs.head(3).values
#放到doclist中
doclist = docs.values
LDA模型构建:
好,我们用Gensim来做一次模型构建
首先,我们得把我们刚刚整出来的一大波文本数据
[[一条邮件字符串],[另一条邮件字符串], ...]
转化成Gensim认可的语料库形式:
[[一,条,邮件,在,这里],[第,二,条,邮件,在,这里],[今天,天气,肿么,样],...]
引入库:
from gensim import corpora, models, similarities
import gensim
垃圾手动导入停止词
stoplist = ['very', 'ourselves', 'am', 'doesn', 'through', 'me', 'against', 'up', 'just', 'her', 'ours',
'couldn', 'because', 'is', 'isn', 'it', 'only', 'in', 'such', 'too', 'mustn', 'under', 'their',
'if', 'to', 'my', 'himself', 'after', 'why', 'while', 'can', 'each', 'itself', 'his', 'all', 'once',
'herself', 'more', 'our', 'they', 'hasn', 'on', 'ma', 'them', 'its', 'where', 'did', 'll', 'you',
'didn', 'nor', 'as', 'now', 'before', 'those', 'yours', 'from', 'who', 'was', 'm', 'been', 'will',
'into', 'same', 'how', 'some', 'of', 'out', 'with', 's', 'being', 't', 'mightn', 'she', 'again', 'be',
'by', 'shan', 'have', 'yourselves', 'needn', 'and', 'are', 'o', 'these', 'further', 'most', 'yourself',
'having', 'aren', 'here', 'he', 'were', 'but', 'this', 'myself', 'own', 'we', 'so', 'i', 'does', 'both',
'when', 'between', 'd', 'had', 'the', 'y', 'has', 'down', 'off', 'than', 'haven', 'whom', 'wouldn',
'should', 've', 'over', 'themselves', 'few', 'then', 'hadn', 'what', 'until', 'won', 'no', 'about',
'any', 'that', 'for', 'shouldn', 'don', 'do', 'there', 'doing', 'an', 'or', 'ain', 'hers', 'wasn',
'weren', 'above', 'a', 'at', 'your', 'theirs', 'below', 'other', 'not', 're', 'him', 'during', 'which','am','pm']
人工分词:
这里,英文的分词,直接就是对着空白处分割就可以了。
中文的分词稍微复杂点儿,具体可以百度:CoreNLP, HaNLP, 结巴分词,等等
分词的意义在于,把我们的长长的字符串原文本,转化成有意义的小元素:
texts = [[word for word in doc.lower().split() if word not in stoplist] for doc in doclist]
这时候,我们的texts就是我们需要的样子了:
image.png建立语料库
用词袋的方法,把每个单词用一个数字index指代,并把我们的原文本变成一条长长的数组:---建立字典表,词频统计
dictionary = corpora.Dictionary(texts)
print(str(dictionary))
corpus = [dictionary.doc2bow(text) for text in texts]
# create a transformation, from initial model to tf-idf model
tfidf = models.TfidfModel(corpus) # step 1 -- initialize a model
# "tfidf" is treated as a read-only object that can be used to convert any vector from the old representation
# (bag-of-words integer counts) to the new representation (TfIdf real-valued weights):
# apply a transformation to a whole corpus
corpus_tfidf = tfidf[corpus]
# for doc in corpus_tfidf:
# print(doc)
# lda.get_document_topics(bow)
num_show_topic = 5 # 每个文档显示前几个主题
print('下面,显示前5个文档的主题分布:')
doc_topics = lda.get_document_topics(corpus) # 所有文档的主题分布
for i in range(5):
topic = np.array(doc_topics[i])
topic_distribute = np.array(topic[:, 1])
topic_idx = list(topic_distribute)
print('第%d个文档的 %d 个主题分布概率分别为:' % (i, num_show_topic))
print(topic_idx)
image.png
num_topics = 9
num_show_term = 5 # 每个主题下显示几个词
for topic_id in range(num_topics):
print(lda.get_document_topics(corpus[topic_id]))
image.png
作业:
我这里有希拉里twitter上的几条(每一空行是单独的一条):
To all the little girls watching...never doubt that you are valuable and powerful & deserving of every chance & opportunity in the world.
I was greeted by this heartwarming display on the corner of my street today. Thank you to all of you who did this. Happy Thanksgiving. -H
Hoping everyone has a safe & Happy Thanksgiving today, & quality time with family & friends. -H
Scripture tells us: Let us not grow weary in doing good, for in due season, we shall reap, if we do not lose heart.
Let us have faith in each other. Let us not grow weary. Let us not lose heart. For there are more seasons to come and...more work to do
We have still have not shattered that highest and hardest glass ceiling. But some day, someone will
To Barack and Michelle Obama, our country owes you an enormous debt of gratitude. We thank you for your graceful, determined leadership
Our constitutional democracy demands our participation, not just every four years, but all the time
You represent the best of America, and being your candidate has been one of the greatest honors of my life
Last night I congratulated Donald Trump and offered to work with him on behalf of our country
Already voted? That's great! Now help Hillary win by signing up to make calls now
It's Election Day! Millions of Americans have cast their votes for Hillary—join them and confirm where you vote
We don’t want to shrink the vision of this country. We want to keep expanding it
We have a chance to elect a 45th president who will build on our progress, who will finish the job
I love our country, and I believe in our people, and I will never, ever quit on you. No matter what
各位同学请使用训练好的LDA模型,判断每句话各自属于哪个potic
导入
s = open('../input/NewFile.txt','r')
q = s.read()
# q=clean_email_text(q)
f = q.split('\n\n')
后面是照着写的,没新东西,截图得了
image.png
网友评论