首先我们要明确 nltk 是一个处理自然语言的处理工具集,而不是分析自然语言,处理自然语言整理出适合机器学习框架使用的数据。
example_text = "Hello Mr. Smith, how are you doing today? The weather is great and Python is awesome. The sky is pinkish-blue. You should not eat carboard."
首先我们需要给出断句的规则,如果我们根据(.)后面紧跟首字母大写作为规则进行断句,那么 Hello Mr. Smith
显然也符合断句规则。
不用担心 nltk 可以帮助我们很好完成对段落按句子或单词的划分,要使用相应工具我们需要引入依赖包。
from nltk.tokenize import sent_tokenize, word_tokenize
print(sent_tokenize(example_text))
输出单位为句子,会对段落按一定规则划分为句子。
['Hello Mr. Smith, how are you doing today?', 'The weather is great and Python is awesome.', 'The sky is pinkish-blue.', 'You should not eat carboard.']
下面类似方式将段落划分为单词
print(word_tokenize(example_text))
['Hello', 'Mr.', 'Smith', ',', 'how', 'are', 'you', 'doing', 'today', '?', 'The', 'weather', 'is', 'great', 'and', 'Python', 'is','awesome', '.', 'The', 'sky', 'is', 'pinkish-blue', '.', 'You', 'should', 'not', 'eat', 'carboard', '.']
for i in word_tokenize(example_text):
print(i)
停止词
首先,我们看下什么是停止词。停止词,是由英文单词:stop word翻译过来的,原来在英语里面会遇到很多a,the,or等使用频率很多的字或词。
在中文网站里面其实也存在大量的stop word,我们称它为停止词。比如,我们前面这句话,“在”、“里面”、“也”、“的”、“它”、“为”这些词都是停止词。这些词因为使用频率过高,几乎 每个网页上都存在,所以搜索引擎开发人员都将这一类词语全部忽略掉。
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
example_text = "This is an example showing off stop word filration."
stop_words = set(stopwords.words("english"))
print(stop_words)
{"it's", 'being', 'him', 'own', 'above', "you'll", 'yourself', 'again', 'because', 'a', 'i', 'yours', "didn't", 've', 'his', 'only', 'hasn', 'all', 'out', 'this', 'just', 'below', 'of', 'will', 'who', 'shan', 'or', 'should', 'here', 'be', 'against', 't', 'than', 'have', 'is', 'does', "wouldn't", 'hers', 'while', 'ours', 'there', 'when', 'himself', 'hadn', 'theirs', 'your', 'doing', 'before', "shouldn't", 'more', 'over', 'both', 'if', 'so', 'themselves', 'll', 'their', 'ma', 'now', 're', 'we', "won't", 'these', 'why', "she's", 'can', 'its', 'up', 'me', 'the', 'most', 'doesn', 'd', 'herself', "needn't", 'an', 'about', 'as', 'further', 'few', "haven't", 'other', 'aren', 'between', "couldn't", 'are', 'where', 'o', "doesn't", 'at', "you've", "wasn't", 'isn', 'each', "you'd", 'yourselves', 'has', 'did', 'off', 'couldn', 'y', "hasn't", 'very', 'not', "mustn't", 'my', 'then', 'myself', "don't", 'those', 'from','any', 'too', 'to', 'weren', 'am', "you're", 'them', 'down', "shan't", 'into', 'nor', 'ain', 'but', 'didn', 'mightn', 'on', 'and',"aren't", 'it', 'how', "that'll", 'wouldn', 'by', 'was', 'during', 'our', 'same', 'until', 'had', 'some', 'been', 'such', 'shouldn', 'do', 'having', "hadn't", 'that', 'mustn', 'don', 'were', 'what', 'ourselves', "mightn't", 'through', 'no', 'wasn', 'needn', 'he', "weren't", 'once', 'they', 'in', "isn't", 'won', 'after', 'you', 'itself', 'which', 'she', 'm', 'her', "should've", 'with', 'haven', 'under', 'for', 's', 'whom'}
words = word_tokenize(example_text)
filtered_sentence = []
for w in words:
if w not in stop_words:
filtered_sentence.append(w)
print(filtered_sentence)
['This', 'example', 'showing', 'stop', 'word', 'filration', '.']
stop_words = set(stopwords.words("english"))
# print(stop_words)
words = word_tokenize(example_text)
filtered_sentence = []
# for w in words:
# if w not in stop_words:
# filtered_sentence.append(w)
filtered_sentence = [w for w in words if not w in stop_words]
print(filtered_sentence)
词干分析器
在学习应该我们都学过动词的时态,有时候我们需要剥去其变化看其本质这就是 stem 用途。
# I was taking a ride in the car.
# I was riding in the car.
在两个句子中 ride 以两种形式存在,但是表示意思都是 ride,
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
ps = PorterStemmer()
example_words = ["python", "pythoner", "pythoning", "pythoned", "pythonly"]
for w in example_words:
print(ps.stem(w))
python
python
python
python
pythonli
从输出来看我们可以看出将 python 的其他形态去掉保留词根。
new_text = "It is very important to be pythonly while you are pythoning with python. All pythoners have pythoned poorly at least once."
words = word_tokenize(new_text)
for w in words:
print(ps.stem(w))
大家可以自己输出看一看,里面好像有些问题,大家可以自己发现。
打标签
import nltk
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer
sample_text = state_union.raw("2006-GWBush.txt")
custom_sent_tokenizer = PunktSentenceTokenizer(sample_text)
tokenized = custom_sent_tokenizer.tokenize(sample_text)
def process_content():
try:
for i in tokenized:
words = nltk.word_tokenize(i)
tagged = nltk.pos_tag(words)
print(tagged)
except Exception as e:
print(str(e))
process_content()
[(u'And', 'CC'), (u'so', 'RB'), (u'we', 'PRP'), (u'move', 'VBP'), (u'forward', 'RB'), (u'--', ':'), (u'optimistic', 'JJ'), (u'about', 'IN'), (u'our', 'PRP$'), (u'country', 'NN'), (u',', ','), (u'faithful', 'JJ'), (u'to', 'TO'), (u'its', 'PRP$'), (u'cause', 'NN'), (u',', ','), (u'and', 'CC'), (u'confident', 'NN'), (u'of', 'IN'), (u'the', 'DT'), (u'victories', 'NNS'), (u'to', 'TO'), (u'come', 'VB'), (u'.','.')]
CC 并列连词 NNS 名词复数 UH 感叹词
CD 基数词 NNP 专有名词 VB 动词原型
DT 限定符 NNP 专有名词复数 VBD 动词过去式
EX 存在词 PDT 前置限定词 VBG 动名词或现在分词
FW 外来词 POS 所有格结尾 VBN 动词过去分词
IN 介词或从属连词 PRP 人称代词 VBP 非第三人称单数的现在时
JJ 形容词 PRP$ 所有格代词 VBZ 第三人称单数的现在时
JJR 比较级的形容词 RB 副词 WDT 以wh开头的限定词
JJS 最高级的形容词 RBR 副词比较级 WP 以wh开头的代词
LS 列表项标记 RBS 副词最高级 WP$ 以wh开头的所有格代词
MD 情态动词 RP 小品词 WRB 以wh开头的副词
NN 名词单数 SYM 符号 TO to
网友评论