Python文本处理笔记

Python文本处理笔记

作者: CrossCode | 来源:发表于2018-01-09 17:23 被阅读0次

Python文本处理笔记
Numpy
人工智能
python编码问题
Shell十三问学习笔记
Windows下Python安装及pycharm，pip下载和安
Python 网页爬虫、文本处理科学、计算机器学习、数据挖掘
Python读取文件的方法
Python NLTK结合stanford NLP工具包进行文本
文本处理-python

读取数据

import pandas as pd
df = pd.read_csv('data.csv')

过滤非ASC字符

df['description'].str.replace(r'[^\x00-\x7F]+', '')

过滤数字

df['description'].str.replace('\d+', '')

去停用词

from nltk.corpus import stopwords
stop = stopwords.words('english')
content['description'] = content['description'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))

从HTML中提取纯文本

def clean_html(html):
    """
    Copied from NLTK package.
    Remove HTML markup from the given string.

    :param html: the HTML string to be cleaned
    :type html: str
    :rtype: str
    """

    # First we remove inline JavaScript/CSS:
    cleaned = re.sub(r"(?is)<(script|style).*?>.*?(</\1>)", "", html.strip())
    # Then we remove html comments. This has to be done before removing regular
    # tags since comments can contain '>' characters.
    cleaned = re.sub(r"(?s)<!--(.*?)-->[\n]?", "", cleaned)
    # Next we can remove the remaining tags:
    cleaned = re.sub(r"(?s)<.*?>", " ", cleaned)
    # Finally, we deal with whitespace
    cleaned = re.sub(r"&nbsp;", " ", cleaned)
    cleaned = re.sub(r"  ", " ", cleaned)
    cleaned = re.sub(r"http\S+", '', cleaned)
    # Remove punctuation
    cleaned = cleaned.translate(None, string.punctuation)
    return cleaned.strip()

相关文章

Python文本处理笔记
读取数据过滤非ASC字符过滤数字去停用词从HTML中提取纯文本
Numpy
http://www.scipy-lectures.org/ Python 网页爬虫 & 文本处理 & 科学计算 ...
人工智能
Python教程 Python是一种通用的解释型，交互式，面向对象和高级编程语言 Python文本处理 Pytho...
python编码问题
Python中读取文件时报错UnicodeDecodeError 场景：文本处理平台：linux red-hat...
Shell十三问学习笔记
文本处理 Shell脚本编程 Shell 十三问学习笔记 shell and Carriage 关系 Shell...
Windows下Python安装及pycharm，pip下载和安
Python是方便的编程语言，基于其的文本处理工具非常方便。 1. python安装不详细赘述，非常方便，但是安装...
Python 网页爬虫、文本处理科学、计算机器学习、数据挖掘
Python 网页爬虫、文本处理科学、计算机器学习、数据挖掘兵器谱 1.本文介绍几种Python网页爬虫工具集...
Python读取文件的方法
Python的文本处理是经常碰到的一个问题，Python的文本文件的内容读取中，有三类方法：read()、read...
Python NLTK结合stanford NLP工具包进行文本
Python NLTK结合stanford NLP工具包进行文本处理本文在主要介绍NLTK 中提供 Stanfo...
文本处理-python
string--文本常量和模板 string模块最常用的两个函数是capwords（）和maketrans(),c...

网友评论

本文标题：Python文本处理笔记

本文链接：https://www.haomeiwen.com/subject/jidgnxtx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

栏目导航

热点阅读

关于我们|服务条款|联系我们|Python文本处理笔记|投稿指南|网站地图|RSS订阅|排版工具|手机版

提供经典美文摘抄,优美散文欣赏,现代诗歌精选,短篇小说,心情随笔,表白情书范文,故事会在线阅读欣赏

Copyright © 2014-2023 Haomeiwen.com All Rights Reserved. 好美文阅读网版权所有

备案信息：桂公网安备 45052102000051号 · 桂ICP备13007215号-3

本站所收录作品、热点评论等信息部分来源互联网，目的只是为了系统归纳学习和传递资讯

所有作品版权归原创作者所有，与本站立场无关，如不慎侵犯了你的权益，请联系我们告知，我们将做删除处理！