[Python3]洗数据新手向教程Ⅰ:用自带函数对文本进行加工

作者: Maxmoe | 来源:发表于2018-02-21 15:11 被阅读0次

[Python3]洗数据新手向教程Ⅰ:用自带函数对文本进行加工
R语言不同数据格式计算函数的运用2021.3.3
python3函数（一）
python3教程
matlan学习_变量存储和加载
python3的decode()与encode()
Kafka 快速入门指南
Mac安装python3，通过终端直接运行.py文件报错bad
Mac上用 homebrew 装 mysql,完美解决 ERRO
【Python】-018-函数-装饰器

此教程包含如何对文档进行简单的数据采集和存储。

基础知识储备

String & List & Dictionary & Tuple 相关函数
File IO 相关函数
详见我的另一篇简书：
Python for Informatics(File&String&List&Dictionary&Tuple)

项目示例

读取外部文档，抠出confidence值，计算平均值（习题来自《Python for Informatics》)

from urllib.request import urlopen

file_url = 'http://www.py4inf.com/code/mbox-short.txt'
file_list = urlopen(file_url)
conf_list = []

for line in file_list:
    line = str(line, 'utf-8') #注意类型转换，urlopen()得到的是byte形式
    sign = "X-DSPAM-Confidence: "
    if line.startswith(sign): #防止混进非目标行的数据
        start = line.find(sign)+len(sign)
        end = line.find(' ',start)
        confidence = line[start: end]
        print(confidence)
        conf_list.append(float(confidence))

sum = 0
num = 0
for conf in conf_list:
    sum += conf
    num +=1

print("Average spam condifence: "+str(sum/num))

读取外部文档，收集所有单词（不重复）并储存在list中，按字母顺序排列（习题来自《Python for Informatics》)

from urllib.request import urlopen

url = "http://www.py4inf.com/code/romeo.txt"
url_file = urlopen(url)
words = []

for line in url_file:
    line = str(line,'utf-8')
    temp_words = line.split()
    for word in temp_words:
        if word not in words:
            words.append(word)

words.sort()
print(words)

统计文本中前十高频词（习题来自《Python for Informatics》)

import string
fhand = open('text.txt')
words = dict()

for line in fhand:
    line = str(line)
    table = str.maketrans(' ',' ',string.punctuation)
    line.translate(table) #剥去所有标点，记得Import string(python3中,translate()函数只有一个argument)
    line.lower()
    word_list = line.split()
    for word in word_list:
        if word not in words:
            words[word] =1
        else:
            words[word]+=1

words_cooked = list()

for key,value in words.items():
    words_cooked.append((value,key))

words_cooked.sort(reverse= True)

for key, value in words_cooked[:10]:
    print(key,value)

网友评论

程序员

本文标题：[Python3]洗数据新手向教程Ⅰ:用自带函数对文本进行加工

本文链接：https://www.haomeiwen.com/subject/yakwtftx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

[Python3]洗数据新手向教程Ⅰ:用自带函数对文本进行加工

基础知识储备

项目示例

相关文章

[Python3]洗数据新手向教程Ⅰ:用自带函数对文本进行加工

R语言不同数据格式计算函数的运用2021.3.3

python3函数（一）

python3教程

matlan学习_变量存储和加载

python3的decode()与encode()

Kafka 快速入门指南

Mac安装python3，通过终端直接运行.py文件报错bad

Mac上用 homebrew 装 mysql,完美解决 ERRO

【Python】-018-函数-装饰器

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

程序员