美文网首页
data analysis task1:Paper statis

data analysis task1:Paper statis

作者: cornbig | 来源:发表于2021-01-13 13:06 被阅读0次

    任务说明

    任务主题:论文数量统计,即统计2019年全年计算机各个方向论文数量;
    数据集:https://www.kaggle.com/Cornell-University/arxiv

    1. 环境配置: google colab + kaggle数据集

    colab 中运行脚本,导入arxiv datasset
    !pip install kaggle
    !mkdir -p ~/.kaggle
    !cp /content/kaggle.json ~/.kaggle/
    !chmod 600 ~/.kaggle/kaggle.json
    !kaggle config set -n path -v /content
    !kaggle datasets download -d Cornell-University/arxiv

    1.论文数据统计

    (1) 解压文件

    import zipfile
    datapath = '/content/datasets/Cornell-University/arxiv/arxiv.zip'
    datazip = zipfile.ZipFile(datapath)
    print(datazip.namelist())
    print(datazip.filename)
    datazip.extractall()

    (2)文件包导入
    import seaborn as sns 
    from bs4 import BeautifulSoup
    import re
    import requests
    import json
    import pandas as pd
    import matplotlib.pyplot as plt
    
    (3) 读取json数据
    # read data
    data = []
    with open('/content/arxiv-metadata-oai-snapshot.json', 'r') as f:
      for line in f:
        data.append(json.loads(line))
    data = pd.DataFrame(data)
    data.shape
    
    (1796911, 14)
    
    #查看数据
    data.head(1)
    
    id submitter authors title comments journal-ref doi report-no categories license abstract versions update_date authors_parsed
    0704.0001 Pavel Nadolsky C. Bal'azs, E. L. Berger, P. M. Nadolsky, C.-... Calculation of prompt diphoton production cros... 37 pages, 15 figures; published version Phys.Rev.D76:013009,2007 10.1103/PhysRevD.76.013009 ANL-HEP-PR-07-12 hep-ph None A fully differential calculation in perturba... [{'version': 'v1', 'created': 'Mon, 2 Apr 2007... 2008-11-26 [[Balázs, C., ], [Berger, E. L., ], [Nadolsky,...
    列描述
    编号 列名 描述
    0 id arXiv ID,可用于访问论文;
    1 submitter 论文提交者;
    2 authors 论文作者;
    3 title 论文标题;
    4 comments 论文页数和图表等其他信息;
    5 journal-ref 论文发表的期刊的信息;
    6 doi 数字对象标识符,https://www.doi.org
    7 report-no 报告编号;
    8 categories 论文在 arXiv 系统的所属类别或标签;
    9 license 文章的许可证;
    10 abstract 论文摘要;
    12 versions 论文版本;
    13 authors_parsed 作者的信息;
    (4) 数据预处理
    '''
    count: 一列数据的元素个数
    unique: 一列数据中元素的种类
    top: 一列数据中出现频率最高的元素
    freq: 一列数据中出现频率最高的元素的个数
    '''
    # 查看  categories
    data['categories'].describe()
    '''
    output:
    count      1796911
    unique       62055
    top       astro-ph
    freq         86914
    Name: categories, dtype: object
    '''
    

    有1796911个数据, 62055个种类,出现最多的类别是astro-ph,出现86914次

    # 本数据集中出现了多少独立的数据集
    unique_categories = set([i for l in [x.split(' ') for x in data["categories"]] for i in l])
    len(unique_categories)
    unique_categories
    
     'ao-sci',
     'astro-ph', 'astro-ph.CO', 'astro-ph.EP', 'astro-ph.GA', 'astro-ph.HE', 'astro-ph.IM', 'astro-ph.SR',
     'atom-ph',
     'bayes-an',
     'chao-dyn',
     'chem-ph',
     'cmp-lg',
     'comp-gas',
     'cond-mat', 'cond-mat.dis-nn', 'cond-mat.mes-hall', 'cond-mat.mtrl-sci', 'cond-mat.other', 'cond-mat.quant-gas',
    'cond-mat.soft', 'cond-mat.stat-mech', 'cond-mat.str-el', 'cond-mat.supr-con',
     'cs.AI', 'cs.AR','cs.CC', 'cs.CE', 'cs.CG', 'cs.CL','cs.CR', 'cs.CV', 'cs.CY', 'cs.DB','cs.DC', 'cs.DL','cs.DM','cs.DS','cs.ET', 'cs.FL', 'cs.GL','cs.GR','cs.GT','cs.HC', 'cs.IR', 'cs.IT', 'cs.LG', 'cs.LO', 'cs.MA', 'cs.MM', 'cs.MS', 'cs.NA', 'cs.NE', 'cs.NI', 'cs.OH', 'cs.OS','cs.PF', 'cs.PL','cs.RO','cs.SC','cs.SD','cs.SE','cs.SI', 'cs.SY',
     'dg-ga',
     'econ.EM','econ.GN', 'econ.TH',
     'eess.AS', 'eess.IV', 'eess.SP', 'eess.SY',
     'funct-an',
     'gr-qc',
     'hep-ex',
     'hep-lat',
     'hep-ph',
     'hep-th',
     'math-ph',
     'math.AC', 'math.AG', 'math.AP', 'math.AT', 'math.CA', 'math.CO', 'math.CT', 'math.CV', 'math.DG','math.DS', 'math.FA', 'math.GM', 'math.GN', 'math.GR', 'math.GT', 'math.HO', 'math.IT', 'math.KT', 'math.LO', 'math.MG', 'math.MP', 'math.NA', 'math.NT', 'math.OA', 'math.OC', 'math.PR', 'math.QA', 'math.RA', 'math.RT', 'math.SG', 'math.SP', 'math.ST',
     'mtrl-th',
     'nlin.AO','nlin.CD', 'nlin.CG', 'nlin.PS', 'nlin.SI',
     'nucl-ex','nucl-th',
     'patt-sol',
     'physics.acc-ph', 'physics.ao-ph', 'physics.app-ph', 'physics.atm-clus', 'physics.atom-ph', 'physics.bio-ph', 'physics.chem-ph', 'physics.class-ph', 'physics.comp-ph', 'physics.data-an', 'physics.ed-ph', 'physics.flu-dyn', 'physics.gen-ph', 'physics.geo-ph', 'physics.hist-ph', 'physics.ins-det', 'physics.med-ph', 'physics.optics', 'physics.plasm-ph', 'physics.pop-ph', 'physics.soc-ph','physics.space-ph',
     'plasm-ph',
     'q-alg',
     'q-bio','q-bio.BM','q-bio.CB', 'q-bio.GN','q-bio.MN', 'q-bio.NC','q-bio.OT','q-bio.PE', 'q-bio.QM','q-bio.SC', 'q-bio.TO',
    'q-fin.CP','q-fin.EC','q-fin.GN','q-fin.MF','q-fin.PM', 'q-fin.PR', 'q-fin.RM', 'q-fin.ST', 'q-fin.TR',
     'quant-ph',
     'solv-int',
    'stat.AP', 'stat.CO','stat.ME','stat.ML','stat.OT', 'stat.TH',
     'supr-con'}```
    
    print(len(unique_categories))
    
    # 对2019年以后的paper完成分析,
    data['year'] = pd.to_datetime(data["update_date"]).dt.year # update_date 从str变成datetime格式,并提取year
    del data["update_date"]
    data = data[data["year"] >= 2019]
    data.reset_index(drop = True, inplace = True) # 重新编号
    data
    

    395123 rows × 14 columns

    # 2019年以后,计算机领域的数据
    website_url = requests.get('https://arxiv.org/category_taxonomy').text # 获取网页的文本数据
    soup = BeautifulSoup(website_url, 'lxml') # 爬取是数据,使用lxml解析,加速
    print(website_url)
    root = soup.find('div',{'id':'category_taxonomy_list'})
    tags = root.find_all(["h2","h3","h4","p"],recursive = True) #读取tags
    print(tags)
    
    # 初始化 str 和 list变量
    level_1_name = ""
    level_2_name = ""
    level_2_code = ""
    level_1_names = []
    level_2_codes = []
    level_2_names = []
    level_3_codes = []
    level_3_names = []
    level_3_notes = []
    
    # ing
    for t in tags:
      if t.name == "h2":
        level_1_name = t.text
        level_2_code = t.text
        level_2_name = t.text
      elif t.name == "h3":
        raw = t.text
        # 正则表达式 '.'表示匹配任意1个字符,‘*’表示匹配表示前一个字符出现0次、多次或者无限次。
        # "\(" 匹配(.
        # (.*) 为括号前所有的str,\((.*)\), 为后面括号的str/
        level_2_code = re.sub(r"(.*)\((.*)\)",r"\2",raw) # 括号里的文本
        level_2_name = re.sub(r"(.*)\((.*)\)",r"\1",raw) # 括号前的文本
      elif t.name == "h4":
        raw = t.text
        level_3_code = re.sub(r"(.*) \((.*)\)",r"\1", raw)
        level_3_name = re.sub(r"(.*) \((.*)\)",r"\2", raw)
      elif t.name == "p":
        notes = t.text
        level_1_names.append(level_1_name)
        level_2_names.append(level_2_name)
        level_2_codes.append(level_2_code)
        level_3_names.append(level_3_name)
        level_3_codes.append(level_3_code)
        level_3_notes.append(notes)
    
    根据以上信息生成dataframe 格式对的数据
    df_taxonomy = pd.DataFrame({
      'group_name':level_1_names,
      'archive_name':level_2_names,
      'archive_id':level_2_codes,
      'category_name':level_3_names,
      'categories':level_3_codes,
      'category_description':level_3_notes
    })
    df_taxonomy.groupby(["group_name", "archive_name"])
    df_taxonomy
    
    No. group_name archive_name archive_id category_name categories category_description
    0 Computer Science Computer Science Computer Science Artificial Intelligence cs.AI Covers all areas of AI except Vision, Robotics...
    1 Computer Science Computer Science Computer Science Hardware Architecture cs.AR Covers systems organization and hardware archi...
    2 Computer Science Computer Science Computer Science Computational Complexity cs.CC Covers models of computation, complexity class...
    3 Computer Science Computer Science Computer Science Computational Engineering, Finance, and Science cs.CE Covers applications of computer science to the...
    4 Computer Science Computer Science Computer Science Computational Geometry cs.CG Roughly includes material in ACM Subject Class...
    ... ... ... ... ... ... ...
    153 Statistics Statistics Statistics Other Statistics stat.OT Work in statistics that does not fit into the ...
    154 Statistics Statistics Statistics Statistics Theory stat.TH stat.TH is an alias for math.ST. Asymptotics, ...

    155 rows × 6 columns

    数据可视化
    _df = data.merge(df_taxonomy, on="categories", how="left").drop_duplicates(["id","group_name"]).groupby("group_name").agg({"id":"count"}).sort_values(by="id",ascending=False).reset_index()
    _df
    # 使用饼图对结果可视化
    fig = plt.figure(figsize = (15,12)) 
    # explode 每一块距离中心的距离
    explode = (0,0,0,0.2,0.3,0.3,0.2,0.1)
    plt.pie(_df["id"], labels = _df["group_name"], autopct="%1.2f%%", startangle = 160, explode=explode)
    plt.tight_layout()
    plt.show()
    
    不同学科论文数量占比.png

    查看2019、2020论文数量

    group_name="Computer Science"
    cats = data.merge(df_taxonomy, on="categories").query("group_name == @group_name")
    cats.groupby(["year","category_name"]).count().reset_index().pivot(index="category_name", columns="year",values="id") 
    
    category_name 2019 2020
    Artificial Intelligence 558 757
    Computation and Language 2153 2906
    Computational Complexity 131 188
    Computational Engineering, Finance, and Science 108 205
    Computational Geometry 199 216
    Computer Science and Game Theory 281 323
    Computer Vision and Pattern Recognition 5559 6517
    Computers and Society 346 564
    Cryptography and Security 1067 1238
    Data Structures and Algorithms 711 902
    Databases 282 342
    Digital Libraries 125 157
    Discrete Mathematics 84 81
    Distributed, Parallel, and Cluster Computing 715 774
    Emerging Technologies 101 84
    Formal Languages and Automata Theory 152 137
    General Literature 5 5
    Graphics 116 151
    Hardware Architecture 95 159
    Human-Computer Interaction 420 580
    Information Retrieval 245 331
    Logic in Computer Science 470 504
    Machine Learning 177 538
    Mathematical Software 27 45
    Multiagent Systems 85 90
    Multimedia 76 66
    Networking and Internet Architecture 864 783
    Neural and Evolutionary Computing 235 279
    Numerical Analysis 40 11
    Operating Systems 36 33
    Other Computer Science 67 69
    Performance 45 51
    Programming Languages 268 294
    Robotics 917 1298
    Social and Information Networks 202 325
    Software Engineering 659 804
    Sound 7 4
    Symbolic Computation 44 36
    Systems and Control 415 133

    相关文章

      网友评论

          本文标题:data analysis task1:Paper statis

          本文链接:https://www.haomeiwen.com/subject/cyruaktx.html