NLTK文本预处理与文本分析

作者: 谦行看商业 | 来源:发表于2019-03-22 19:47 被阅读0次

NLTK文本预处理与文本分析
自然语言分析——利用NLTK进行文本预处理
NLP基本步骤及原理
文本挖掘一般流程
2019-05-29 文本预处理
python自然语言处理学习笔记（一）
动手学深度学习(八) NLP 文本预处理
pytorch之文本预处理,语言模型,循环神经网络基础
第一次打卡 Task02
第二天-文本预处理,语言模型,循环神经网络

本文主要介绍Python中NLTK文本分析的内容，咱先来看看文本分析的整个流程：

原始文本 - 分词 - 词性标注 - 词形归一化 - 去除停用词 - 去除特殊字符 - 单词大小写转换 - 文本分析

一、分词

使用DBSCAN聚类算法的英文介绍文本为例：

from nltk import word_tokenize
sentence = "DBSCAN - Density-Based Spatial Clustering of Applications with Noise. Finds core samples of high density and expands clusters from them. Good for data which contains clusters of similar density "
token_words = word_tokenize(sentence)
print(token_words)

输出分词结果：

['DBSCAN', '-', 'Density-Based', 'Spatial', 'Clustering', 'of', 'Applications', 'with', 'Noise', '.', 'Finds', 'core', 'samples', 'of', 'high', 'density', 'and', 'expands', 'clusters', 'from', 'them', '.', 'Good', 'for', 'data', 'which', 'contains', 'clusters', 'of', 'similar', 'density']

二、词性标注

为什么要进行词性标注？咱先来看看不做词性标注，直接按照第一步分词结果进行词形归一化的情形：

常见词形归一化有两种方式(词干提取与词形归并）：

1、词干提取

from nltk.stem.lancaster import LancasterStemmer
lancaster_stemmer = LancasterStemmer()
words_stemmer = [lancaster_stemmer.stem(token_word) for token_word in token_words]
print(words_stemmer)

输出结果：

['dbscan', '-', 'density-based', 'spat', 'clust', 'of', 'apply', 'with', 'nois', '.', 'find', 'cor', 'sampl', 'of', 'high', 'dens', 'and', 'expand', 'clust', 'from', 'them', '.', 'good', 'for', 'dat', 'which', 'contain', 'clust', 'of', 'simil', 'dens']

说明：词干提取默认提取单词词根，容易得出一些不具实际意义的单词，比如上面的”Spatial“变为”spat“，”Noise“变为”nois“，在常规文本分析中没意义，在信息检索中用该方法会比较合适。

2、词形归并（单词变体还原）

from nltk.stem import WordNetLemmatizer
wordnet_lematizer = WordNetLemmatizer()
words_lematizer = [wordnet_lematizer.lemmatize(token_word) for token_word in token_words]
print(words_lematizer)

输出结果：

['DBSCAN', '-', 'Density-Based', 'Spatial', 'Clustering', 'of', 'Applications', 'with', 'Noise', '.', 'Finds', 'core', 'sample', 'of', 'high', 'density', 'and', 'expands', 'cluster', 'from', 'them', '.', 'Good', 'for', 'data', 'which', 'contains', 'cluster', 'of', 'similar', 'density']

说明：这种方法主要在于将过去时、将来时、第三人称等单词还原为原始词，不会产生词根这些无意义的单词，但是仍存在有些词无法还原的情况，比如“Finds”、“expands”、”contains“仍是第三人称的形式，原因在于wordnet_lematizer.lemmatize函数默认将其当做一个名词，以为这就是单词原型，如果我们在使用该函数时指明动词词性，就可以将其变为”contain“了。所以要先进行词性标注获取单词词性（详情如下）。

3、词性标注

先分词，再词性标注：

from nltk import word_tokenize,pos_tag
sentence = "DBSCAN - Density-Based Spatial Clustering of Applications with Noise. Finds core samples of high density and expands clusters from them. Good for data which contains clusters of similar density"
token_word = word_tokenize(sentence)  #分词
token_words = pos_tag(token_word)     #词性标注
print(token_words)

输出结果：

[('DBSCAN', 'NNP'), ('-', ':'), ('Density-Based', 'JJ'), ('Spatial', 'NNP'), ('Clustering', 'NNP'), ('of', 'IN'), ('Applications', 'NNP'), ('with', 'IN'), ('Noise', 'NNP'), ('.', '.'), ('Finds', 'NNP'), ('core', 'NN'), ('samples', 'NNS'), ('of', 'IN'), ('high', 'JJ'), ('density', 'NN'), ('and', 'CC'), ('expands', 'VBZ'), ('clusters', 'NNS'), ('from', 'IN'), ('them', 'PRP'), ('.', '.'), ('Good', 'JJ'), ('for', 'IN'), ('data', 'NNS'), ('which', 'WDT'), ('contains', 'VBZ'), ('clusters', 'NNS'), ('of', 'IN'), ('similar', 'JJ'), ('density', 'NN')]

说明：列表中每个元组第二个元素显示为该词的词性，具体每个词性注释可运行代码”nltk.help.upenn_tagset()“或参看说明文档：词性标签说明

三、词形归一化（指明词性）

from nltk.stem import WordNetLemmatizer
words_lematizer = []
wordnet_lematizer = WordNetLemmatizer()
for word, tag in token_words:
    if tag.startswith('NN'):
        word_lematizer =  wordnet_lematizer.lemmatize(word, pos='n')  # n代表名词
    elif tag.startswith('VB'): 
        word_lematizer =  wordnet_lematizer.lemmatize(word, pos='v')   # v代表动词
    elif tag.startswith('JJ'): 
        word_lematizer =  wordnet_lematizer.lemmatize(word, pos='a')   # a代表形容词
    elif tag.startswith('R'): 
        word_lematizer =  wordnet_lematizer.lemmatize(word, pos='r')   # r代表代词
    else: 
        word_lematizer =  wordnet_lematizer.lemmatize(word)
    words_lematizer.append(word_lematizer)
print(words_lematizer)

输出结果：

['DBSCAN', '-', 'Density-Based', 'Spatial', 'Clustering', 'of', 'Applications', 'with', 'Noise', '.', 'Finds', 'core', 'sample', 'of', 'high', 'density', 'and', 'expand', 'cluster', 'from', 'them', '.', 'Good', 'for', 'data', 'which', 'contain', 'cluster', 'of', 'similar', 'density']

说明：可以看到单词变体已经还原成单词原型，如“Finds”、“expands”、”contains“均已变为各自的原型。

四、去除停用词

经过分词与词形归一化之后，得到各个词性单词的原型，但仍存在一些无实际意义的介词、量词等在文本分析中不重要的词（这类词在文本分析中称作停用词），需要将其去除。

from nltk.corpus import stopwords 
cleaned_words = [word for word in words_lematizer if word not in stopwords.words('english')]
print('原始词：', words_lematizer)
print('去除停用词后：', cleaned_words)

输出结果：

原始词： ['DBSCAN', '-', 'Density-Based', 'Spatial', 'Clustering', 'of', 'Applications', 'with', 'Noise', '.', 'Finds', 'core', 'sample', 'of', 'high', 'density', 'and', 'expand', 'cluster', 'from', 'them', '.', 'Good', 'for', 'data', 'which', 'contain', 'cluster', 'of', 'similar', 'density']
去除停用词后： ['DBSCAN', '-', 'Density-Based', 'Spatial', 'Clustering', 'Applications', 'Noise', '.', 'Finds', 'core', 'sample', 'high', 'density', 'expand', 'cluster', '.', 'Good', 'data', 'contain', 'cluster', 'similar', 'density']

说明：of、for、and这类停用词已被去除。

五、去除特殊字符

标点符号在文本分析中也是不需要的，也将其剔除，这里我们采用循环列表判断的方式来剔除，可自定义要去除的标点符号、要剔除的特殊单词也可以放在这将其剔除，比如咱将"DBSCAN"也连同标点符号剔除。

characters = [',', '.','DBSCAN', ':', ';', '?', '(', ')', '[', ']', '&', '!', '*', '@', '#', '$', '%','-','...','^','{','}']
words_list = [word for word in cleaned_words if word not in characters]
print(words_list)

输出结果：

['Density-Based', 'Spatial', 'Clustering', 'Applications', 'Noise', 'Finds', 'core', 'sample', 'high', 'density', 'expand', 'cluster', 'Good', 'data', 'contain', 'cluster', 'similar', 'density']

说明：处理后的单词列表已不存在“-”、“.”等特殊字符。

六、大小写转换

为防止同一个单词同时存在大小写而算作两个单词的情况，还需要统一单词大小写（此处统一为小写）。

words_lists = [x.lower() for x in words_list ]
print(words_lists)

输出结果：

['density-based', 'spatial', 'clustering', 'applications', 'noise', 'finds', 'core', 'sample', 'high', 'density', 'expand', 'cluster', 'good', 'data', 'contain', 'cluster', 'similar', 'density']

七、文本分析

经以上六步的文本预处理后，已经得到干净的单词列表做文本分析或文本挖掘（可转换为DataFrame之后再做分析）。

统计词频（这里我们以统计词频为例）：

from nltk import FreqDist    
freq = FreqDist(words_lists)   
for key,val in freq.items():
    print (str(key) + ':' + str(val))

输出结果：

density-based:1
spatial:1
clustering:1
applications:1
noise:1
finds:1
core:1
sample:1
high:1
density:2
expand:1
cluster:2
good:1
data:1
contain:1
similar:1

可视化（折线图）：

freq.plot(20,cumulative=False)

词频

可视化（词云）：

绘制词云需要将单词列表转换为字符串

words = ' '.join(words_lists)
words

输出结果：

'density-based spatial clustering applications noise finds core sample high density expand cluster good data contain cluster similar density'

绘制词云

from wordcloud import WordCloud
from imageio import imread
import matplotlib.pyplot as plt
pic = imread('./picture/china.jpg')
wc = WordCloud(mask = pic,background_color = 'white',width=800, height=600)
wwc = wc.generate(words)
plt.figure(figsize=(10,10))
plt.imshow(wwc)
plt.axis("off")
plt.show()

词频

文本分析结论：根据折线图或词云，咱可以直观看到“density”与“cluster”两个单词出现最多，词云中字体越大。

谢谢！

NLTK文本预处理与文本分析
本文主要介绍Python中NLTK文本分析的内容，咱先来看看文本分析的整个流程：原始文本 - 分词 - 词性标注...
自然语言分析——利用NLTK进行文本预处理
自然语言分析——利用NLTK进行文本预处理本文作者：方言文字编辑：戴雯技术总编：张馨月现如今的网络信息...
NLP基本步骤及原理
本文目录第一章：文本预处理（Preprocess）1.1NLTK自然语言处理库1.1.1 NLTK自带语料库第二章...
文本挖掘一般流程
流程根据研究，得出文本挖掘一般流程包括文本数据采集、文本数据预处理、文本数据分析和文本数据可视化这四个步骤。 (...
2019-05-29 文本预处理
文本预处理链接
python自然语言处理学习笔记（一）
一、语言计算：文本和单词 1. NLTK简介 NLTK 创建于 2001 年，最初是宾州大学计算机与信息科学系计算...
动手学深度学习(八) NLP 文本预处理
文本预处理文本是一类序列数据，一篇文章可以看作是字符或单词的序列，本节将介绍文本数据的常见预处理步骤，预处理通常...
pytorch之文本预处理,语言模型,循环神经网络基础
文本预处理文本是一类序列数据，一篇文章可以看作是字符或单词的序列，本节将介绍文本数据的常见预处理步骤，预处理通常...
第一次打卡 Task02
一、文本预处理文本是一类序列数据，一篇文章可以看作是字符或单词的序列，本节将介绍文本数据的常见预处理步骤，预处理...
第二天-文本预处理,语言模型,循环神经网络
文本预处理文本是一类序列数据，一篇文章可以看作是字符或单词的序列，本节将介绍文本数据的常见预处理步骤，预处理通常...

NLTK文本预处理与文本分析

本文主要介绍Python中NLTK文本分析的内容，咱先来看看文本分析的整个流程：

原始文本 - 分词 - 词性标注 - 词形归一化 - 去除停用词 - 去除特殊字符 - 单词大小写转换 - 文本分析

一、分词

二、词性标注

1、词干提取

2、词形归并（单词变体还原）

3、词性标注

三、词形归一化（指明词性）

四、去除停用词

五、去除特殊字符

六、大小写转换

七、文本分析

统计词频（这里我们以统计词频为例）：

可视化（折线图）：

可视化（词云）：

谢谢！

相关文章

NLTK文本预处理与文本分析

自然语言分析——利用NLTK进行文本预处理

NLP基本步骤及原理

文本挖掘一般流程

2019-05-29 文本预处理

python自然语言处理学习笔记（一）

动手学深度学习(八) NLP 文本预处理

pytorch之文本预处理,语言模型,循环神经网络基础

第一次打卡 Task02

第二天-文本预处理,语言模型,循环神经网络

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

机器学习

NLP