NLTK学习记录3：处理原始文本

作者: hitsunbo | 来源:发表于2016-11-28 15:07 被阅读114次

NLTK学习记录3：处理原始文本
NLTK文本预处理与文本分析
Python NLTK结合stanford NLP工具包进行文本
NLP基本步骤及原理
英语文本处理工具库2 — spaCy
NLTK手动下载语料
python自然语言处理工具包
文本分类器
2、机器学习系统搭建流程
NLP | 文本匹配算法

读入web原始文本

from urllib import request
url = "http://www.gutenberg.org/files/2554/2554.txt"
response = request.urlopen(url)
raw = response.read().decode('utf8')
type(raw)  #<class 'str'>

读取本地原始文本

f = open('document.txt')
raw = f.read()

path = nltk.data.find('corpora/gutenberg/melville-moby_dick.txt')
raw = open(path, 'rU').read()

获取用户输入

s = input("Enter some text: ")
print("You typed", len(word_tokenize(s)), "words.")

原始文本本身为字符串格式，可以用字符串的函数处理

raw.find("PART I")
raw = raw[5338:1157743]

从原始文本中提取出词，并封装至text

tokens = word_tokenize(raw)
type(tokens)  #<class 'list'>
text = nltk.Text(tokens)
type(text)  #<class 'nltk.text.Text'>

用正则表达式进行文本模式匹配

import re
[w for w in wordlist if re.search('ed$', w)]
[w for w in wordlist if re.search('^..j..t..$', w)]
[w for w in wordlist if re.search('^[ghi][mno][jlk][def]$', w)]

Operator  Behavior
.  #Wildcard, matches any character
^abc  #Matches some pattern abc at the start of a string
abc$  #Matches some pattern abc at the end of a string
[abc]  #Matches one of a set of characters
[A-Z0-9]  #Matches one of a range of characters
ed|ing|s  #Matches one of the specified strings (disjunction)
*   #Zero or more of previous item, e.g. a*, [a-z]* (also known as *Kleene Closure*)
+  #One or more of previous item, e.g. a+, [a-z]+
?  #Zero or one of the previous item (i.e. optional), e.g. a?, [a-z]?
{n}   #Exactly n repeats where n is a non-negative integer
{n,}  #At least n repeats
{,n}  #No more than n repeats
{m,n}  #At least m and no more than n repeats
a(b|c)+  #Parentheses that indicate the scope of the operators

规则化文本

porter = nltk.PorterStemmer()
lancaster = nltk.LancasterStemmer()
[porter.stem(t) for t in tokens]  #better
[lancaster.stem(t) for t in tokens]

分割句子

text = nltk.corpus.gutenberg.raw('chesterton-thursday.txt')
sents = nltk.sent_tokenize(text)
pprint.pprint(sents[79:89])

网友评论

Python语言与信息数据获取和机器学习

本文标题：NLTK学习记录3：处理原始文本

本文链接：https://www.haomeiwen.com/subject/mpropttx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

NLTK学习记录3：处理原始文本

用正则表达式进行文本模式匹配

规则化文本

分割句子

相关文章

NLTK学习记录3：处理原始文本

NLTK文本预处理与文本分析

Python NLTK结合stanford NLP工具包进行文本

NLP基本步骤及原理

英语文本处理工具库2 — spaCy

NLTK手动下载语料

python自然语言处理工具包

文本分类器

2、机器学习系统搭建流程

NLP | 文本匹配算法

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

Python语言与信息数据获取和机器学习