此教程包含如何对文档进行简单的数据采集和存储。
基础知识储备
- String & List & Dictionary & Tuple 相关函数
- File IO 相关函数
详见我的另一篇简书:
Python for Informatics(File&String&List&Dictionary&Tuple)
项目示例
- 读取外部文档,抠出confidence值,计算平均值(习题来自《Python for Informatics》)
from urllib.request import urlopen
file_url = 'http://www.py4inf.com/code/mbox-short.txt'
file_list = urlopen(file_url)
conf_list = []
for line in file_list:
line = str(line, 'utf-8') #注意类型转换,urlopen()得到的是byte形式
sign = "X-DSPAM-Confidence: "
if line.startswith(sign): #防止混进非目标行的数据
start = line.find(sign)+len(sign)
end = line.find(' ',start)
confidence = line[start: end]
print(confidence)
conf_list.append(float(confidence))
sum = 0
num = 0
for conf in conf_list:
sum += conf
num +=1
print("Average spam condifence: "+str(sum/num))
- 读取外部文档,收集所有单词(不重复)并储存在list中,按字母顺序排列(习题来自《Python for Informatics》)
from urllib.request import urlopen
url = "http://www.py4inf.com/code/romeo.txt"
url_file = urlopen(url)
words = []
for line in url_file:
line = str(line,'utf-8')
temp_words = line.split()
for word in temp_words:
if word not in words:
words.append(word)
words.sort()
print(words)
- 统计文本中前十高频词(习题来自《Python for Informatics》)
import string
fhand = open('text.txt')
words = dict()
for line in fhand:
line = str(line)
table = str.maketrans(' ',' ',string.punctuation)
line.translate(table) #剥去所有标点,记得Import string(python3中,translate()函数只有一个argument)
line.lower()
word_list = line.split()
for word in word_list:
if word not in words:
words[word] =1
else:
words[word]+=1
words_cooked = list()
for key,value in words.items():
words_cooked.append((value,key))
words_cooked.sort(reverse= True)
for key, value in words_cooked[:10]:
print(key,value)
网友评论