几种简单的文本数据预处理方法

作者: 不会停的蜗牛 | 来源:发表于2017-10-20 11:18 被阅读780次

    下载数据:
    http://www.gutenberg.org/cache/epub/5200/pg5200.txt

    将开头和结尾的一些信息去掉,使得开头如下:

    One morning, when Gregor Samsa woke from troubled dreams, he found himself transformed in his bed into a horrible vermin.

    结尾如下:

    And, as if in confirmation of their new dreams and good intentions, as soon as they reached their destination Grete was the first to get up and stretch out her young body.

    保存为:metamorphosis_clean.txt

    加载数据:

    filename = 'metamorphosis_clean.txt'
    file = open(filename, 'rt')
    text = file.read()
    file.close()
    

    1. 用空格分隔:

    words = text.split()
    print(words[:100])
    
    # ['One', 'morning,', 'when', 'Gregor', 'Samsa', 'woke', 'from', 'troubled', 'dreams,', 'he', ...]
    

    2. 用 re 分隔单词:
    和上一种方法的区别是,'armour-like' 被识别成两个词 'armour', 'like','"What's' 变成了 'What', 's'

    import re
    words = re.split(r'\W+', text)
    print(words[:100])
    

    3. 用空格分隔并去掉标点:
    string 里的 string.punctuation 可以知道都有哪些算是标点符号,
    maketrans() 可以建立一个空的映射表,其中 string.punctuation 是要被去掉的列表,
    translate() 可以将一个字符串集映射到另一个集,
    也就是 'armour-like' 被识别成 'armourlike','"What's' 被识别成 'Whats'

    words = text.split()
    import string
    table = str.maketrans('', '', string.punctuation)
    stripped = [w.translate(table) for w in words]
    print(stripped[:100])
    

    4. 都变成小写:
    当然大写可以用 word.upper()。

    words = [word.lower() for word in words]
    print(words[:100])
    

    安装 NLTK:
    nltk.download() 后弹出对话框,选择 all,点击 download

    import nltk
    nltk.download()
    

    5. 分成句子:
    用到 sent_tokenize()

    from nltk import sent_tokenize
    sentences = sent_tokenize(text)
    print(sentences[0])
    

    6. 分成单词:
    用到 word_tokenize,
    这次 'armour-like' 还是 'armour-like','"What's' 就是 'What', "'s",

    from nltk.tokenize import word_tokenize
    tokens = word_tokenize(text)
    print(tokens[:100])
    

    7. 过滤标点:
    只保留 alphabetic,其他的滤掉,
    这样的话 “armour-like” 和 “‘s” 也被滤掉了。

    from nltk.tokenize import word_tokenize
    tokens = word_tokenize(text)
    words = [word for word in tokens if word.isalpha()]
    print(tokens[:100])
    

    8. 过滤掉没有深刻含义的 stop words:
    在 stopwords.words('english') 可以查看这样的词表。

    from nltk.corpus import stopwords
    stop_words = set(stopwords.words('english'))
    words = [w for w in words if not w in stop_words]
    print(words[:100])
    

    9. 转化成词根:
    运行 porter.stem(word) 之后,单词会变成相应的词根形式,例如 “fishing,” “fished,” “fisher” 会变成 “fish”

    from nltk.tokenize import word_tokenize
    tokens = word_tokenize(text)
    
    from nltk.stem.porter import PorterStemmer
    porter = PorterStemmer()
    stemmed = [porter.stem(word) for word in tokens]
    print(stemmed[:100])
    

    学习资源:
    http://blog.csdn.net/lanxu_yy/article/details/29002543
    https://machinelearningmastery.com/clean-text-machine-learning-python/


    推荐阅读 历史技术博文链接汇总
    http://www.jianshu.com/p/28f02bb59fe5
    也许可以找到你想要的:
    [入门问题][TensorFlow][深度学习][强化学习][神经网络][机器学习][自然语言处理][聊天机器人]

    相关文章

      网友评论

        本文标题:几种简单的文本数据预处理方法

        本文链接:https://www.haomeiwen.com/subject/bbnmuxtx.html