美文网首页我爱编程
简单统计数据与可视化,Python数据分析-ch2.1

简单统计数据与可视化,Python数据分析-ch2.1

作者: LeeMin_Z | 来源:发表于2018-05-15 18:59 被阅读35次

    1. 提取文件中的时区并计数

    有三种写法,虽然常用的是pandas,其实collections做起来也很快。
    1.1 纯Python代码,提取并统计时区信息
    1.2. 纯Python代码,应用collections.Counter()模块简写
    1.3 用pandas处理,并用matplotlib.pyplot画图

    1.1 纯Python代码,提取并统计时区信息

    1. 从文件中提取时区信息并变为列表
    2. 计算每个时区出现次数
    3. 排序并打印出现次数最高的n个时区。
    # Uses Python3.6
    
    import json
    
    # extract the timezones from the file
    
    path = 'usagov_bitly_data2012-03-16-1331923249.txt'
    records = [json.loads(line) for line in open(path)]
    time_zones = [rec['tz'] for rec in records if 'tz' in rec]
    
    # count the timezones appearance
    
    def get_counts(sequence):
        counts = dict()
        for x in sequence:
            counts[x] = counts.get(x,0) + 1
        return counts
    
    counts = get_counts(time_zones)
    
    # compute and print the top appearance of the timezones and their counts. 
    
    def top_counts(count_dict, a ):
        n = int(a)
        value_key_pairs = [(count,tz) for tz,count in count_dict.items()]
        value_key_pairs.sort()
        return value_key_pairs[-n:]
    
    print(top_counts(counts,3))
    
    #output 
    [(400, 'America/Chicago'), (521, ''), (1251, 'America/New_York')]
    

    1.2. 纯Python代码,应用collections.Counter()模块简写

    用collections.Counters就能一键计数啦,十分方便。

    import json
    from collections import Counter
    
    # extract the timezones from the file
    
    path = 'usagov_bitly_data2012-03-16-1331923249.txt'
    records = [json.loads(line) for line in open(path)]
    time_zones = [rec['tz'] for rec in records if 'tz' in rec]
    
    # count the timezones appearance
    
    counts = Counter(time_zones)
    
    # compute and print the top appearance of the timezones and their counts. 
    
    print(counts.most_common(3))
    

    1.3 用pandas处理,并用matplotlib.pyplot画图

    # Input, uses python 3.6
    
    import json
    import pandas as pd
    import matplotlib.pyplot as plt
    
    path = 'usagov_bitly_data2012-03-16-1331923249.txt'
    records = [json.loads(line) for line in open(path)]
    
    # counts the appearance of the timezone
    frame = pd.DataFrame(records)
    clean_tz = frame['tz'].fillna('Missing')
    clean_tz[clean_tz == ''] = 'Unknown'
    tz_counts = clean_tz.value_counts()
    print(tz_counts[:10])
    
    # plot it and shows it 
    tz_counts[:10].plot(kind='barh',rot=0)
    plt.show()
    
    # Output 
    America/New_York       1251
    Unknown                 521
    America/Chicago         400
    America/Los_Angeles     382
    America/Denver          191
    Missing                 120
    Europe/London            74
    Asia/Tokyo               37
    Pacific/Honolulu         36
    Europe/Madrid            35
    Name: tz, dtype: int64
    
    pandas-timezone.png

    学习总结:

    1. 取信息并组成列表,可以用[ ]并在其中有简单的循环和条件判断操作。
    2. 重用的代码段写为函数,方便调用。
    3. 如果没接触过collections ,可以看我的总结 如何使用python3 的 collections 模块/库, Container datatypes

    参考内容:

    1. 《利用python进行数据分析》Wes McKinney

    2. 示例代码在github上。
      https://github.com/wesm/pydata-book
      可以下载个zip包到本地看,也可以用git clone下来。
      pydata-book-2nd-edition.zip

    相关文章

      网友评论

        本文标题:简单统计数据与可视化,Python数据分析-ch2.1

        本文链接:https://www.haomeiwen.com/subject/kjvilftx.html