美文网首页程序员
[Python3]洗数据新手向教程Ⅰ:用自带函数对文本进行加工

[Python3]洗数据新手向教程Ⅰ:用自带函数对文本进行加工

作者: Maxmoe | 来源:发表于2018-02-21 15:11 被阅读0次

    此教程包含如何对文档进行简单的数据采集和存储。

    基础知识储备

    1. String & List & Dictionary & Tuple 相关函数
    2. File IO 相关函数
      详见我的另一篇简书:
      Python for Informatics(File&String&List&Dictionary&Tuple)

    项目示例

    • 读取外部文档,抠出confidence值,计算平均值(习题来自《Python for Informatics》)
    from urllib.request import urlopen
    
    file_url = 'http://www.py4inf.com/code/mbox-short.txt'
    file_list = urlopen(file_url)
    conf_list = []
    
    for line in file_list:
        line = str(line, 'utf-8') #注意类型转换,urlopen()得到的是byte形式
        sign = "X-DSPAM-Confidence: "
        if line.startswith(sign): #防止混进非目标行的数据
            start = line.find(sign)+len(sign)
            end = line.find(' ',start)
            confidence = line[start: end]
            print(confidence)
            conf_list.append(float(confidence))
    
    sum = 0
    num = 0
    for conf in conf_list:
        sum += conf
        num +=1
    
    print("Average spam condifence: "+str(sum/num))
    
    • 读取外部文档,收集所有单词(不重复)并储存在list中,按字母顺序排列(习题来自《Python for Informatics》)
    from urllib.request import urlopen
    
    url = "http://www.py4inf.com/code/romeo.txt"
    url_file = urlopen(url)
    words = []
    
    for line in url_file:
        line = str(line,'utf-8')
        temp_words = line.split()
        for word in temp_words:
            if word not in words:
                words.append(word)
    
    words.sort()
    print(words)
    
    
    • 统计文本中前十高频词(习题来自《Python for Informatics》)
    import string
    fhand = open('text.txt')
    words = dict()
    
    for line in fhand:
        line = str(line)
        table = str.maketrans(' ',' ',string.punctuation)
        line.translate(table) #剥去所有标点,记得Import string(python3中,translate()函数只有一个argument)
        line.lower()
        word_list = line.split()
        for word in word_list:
            if word not in words:
                words[word] =1
            else:
                words[word]+=1
    
    words_cooked = list()
    
    for key,value in words.items():
        words_cooked.append((value,key))
    
    words_cooked.sort(reverse= True)
    
    for key, value in words_cooked[:10]:
        print(key,value)
    
    

    相关文章

      网友评论

        本文标题:[Python3]洗数据新手向教程Ⅰ:用自带函数对文本进行加工

        本文链接:https://www.haomeiwen.com/subject/yakwtftx.html