美文网首页
32 Pandas借助Python爬虫读取HTML网页表格存储到

32 Pandas借助Python爬虫读取HTML网页表格存储到

作者: Viterbi | 来源:发表于2022-11-20 21:51 被阅读0次

    32 Pandas借助Python爬虫读取HTML网页表格存储到Excel文件

    实现目标:

    • 网易有道词典可以用于英语单词查询,可以将查询的单词加入到单词本;
    • 当前没有导出全部单词列表的功能。为了复习方便,可以爬取所有的单词列表,存入Excel方便复习

    涉及技术:

    • Pandas:Python语言最强大的数据处理和数据分析库
    • Python爬虫:可以将网页下载下来然后解析,使用requests库实现,需要绕过登录验证
    import requests
    import requests.cookies
    import json
    import time
    import pandas as pd
    

    0. 处理流程

    1. 登录网易有道词典的PC版,微信扫码登录,复制cookies到文件

    • PC版地址:http://dict.youdao.com/
    • Chrome插件可以复制Cookies为Json格式:http://www.editthiscookie.com/
    cookie_jar = requests.cookies.RequestsCookieJar()
    
    with open("./course_datas/c32_read_html/cookie.txt") as fin:
        cookiejson = json.loads(fin.read())
        for cookie in cookiejson:
            cookie_jar.set(
                name=cookie["name"],
                value=cookie["value"],
                domain=cookie["domain"],
                path=cookie["path"]
            )
    
    cookie_jar
    
    
    
    
        <RequestsCookieJar[Cookie(version=0, name='DICT_LOGIN', value='3||1578922508302', port=None, port_specified=False, domain='.youdao.com', domain_specified=True, domain_initial_dot=True, path='/', path_specified=True, secure=False, expires=None, discard=True, comment=None, comment_url=None, rest={'HttpOnly': None}, rfc2109=False), Cookie(version=0, name='DICT_PERS', value='v2|weixin||DICT||web||2592000000||1578922508299||114.244.161.198||wxoXQUDj_FtHSw23tfJWsboPkq38ok||gFnMeLRLQLRpBOMYMhf6LRUf0Mz5P4TLRqSOM6uhfY5RzW0L6ZhHTB0kGRHeukLg40QZOMOMkMwu0gBkfJF0LTL0', port=None, port_specified=False, domain='.youdao.com', domain_specified=True, domain_initial_dot=True, path='/', path_specified=True, secure=False, expires=None, discard=True, comment=None, comment_url=None, rest={'HttpOnly': None}, rfc2109=False), Cookie(version=0, name='DICT_SESS', value='v2|odmTRIUgTmgz6MlEOMqB0TBnfk5h4pZ0Py0MeBP4Q40qynHeuPMOWRpLPMY5RHJuRQykfJBOLQBRPKO4YYOLquR6zhLwBnMYMR', port=None, port_specified=False, domain='.youdao.com', domain_specified=True, domain_initial_dot=True, path='/', path_specified=True, secure=False, expires=None, discard=True, comment=None, comment_url=None, rest={'HttpOnly': None}, rfc2109=False), Cookie(version=0, name='DICT_UGC', value='be3af0da19b5c5e6aa4e17bd8d90b28a|', port=None, port_specified=False, domain='.youdao.com', domain_specified=True, domain_initial_dot=True, path='/', path_specified=True, secure=False, expires=None, discard=True, comment=None, comment_url=None, rest={'HttpOnly': None}, rfc2109=False), Cookie(version=0, name='JSESSIONID', value='abc46uQPL03Au_P0ghF_w', port=None, port_specified=False, domain='.youdao.com', domain_specified=True, domain_initial_dot=True, path='/', path_specified=True, secure=False, expires=None, discard=True, comment=None, comment_url=None, rest={'HttpOnly': None}, rfc2109=False), Cookie(version=0, name='OUTFOX_SEARCH_USER_ID', value='"1678365514@10.108.160.18"', port=None, port_specified=False, domain='.youdao.com', domain_specified=True, domain_initial_dot=True, path='/', path_specified=True, secure=False, expires=None, discard=True, comment=None, comment_url=None, rest={'HttpOnly': None}, rfc2109=False), Cookie(version=0, name='OUTFOX_SEARCH_USER_ID_NCOO', value='1349541628.6994112', port=None, port_specified=False, domain='.youdao.com', domain_specified=True, domain_initial_dot=True, path='/', path_specified=True, secure=False, expires=None, discard=True, comment=None, comment_url=None, rest={'HttpOnly': None}, rfc2109=False), Cookie(version=0, name='ACCSESSIONID', value='8F00E30693F3BD052C9A4F293394BE0A', port=None, port_specified=False, domain='dict.youdao.com', domain_specified=True, domain_initial_dot=False, path='/', path_specified=True, secure=False, expires=None, discard=True, comment=None, comment_url=None, rest={'HttpOnly': None}, rfc2109=False), Cookie(version=0, name='___rl__test__cookies', value='1578922438675', port=None, port_specified=False, domain='dict.youdao.com', domain_specified=True, domain_initial_dot=False, path='/', path_specified=True, secure=False, expires=None, discard=True, comment=None, comment_url=None, rest={'HttpOnly': None}, rfc2109=False)]>
    
    

    2. 将html都下载下来存入列表

    htmls = []
    url = "http://dict.youdao.com/wordbook/wordlist?p={idx}&tags="
    for idx in range(6):
        time.sleep(1)
        print("**爬数据:第%d页" % idx)
        r = requests.get(url.format(idx=idx), cookies=cookie_jar)
        htmls.append(r.text)
    
    
        **爬数据:第0页
        **爬数据:第1页
        **爬数据:第2页
        **爬数据:第3页
        **爬数据:第4页
        **爬数据:第5页
        
    
    
    htmls[0]
    

    3. 使用Pandas解析网页中的表格

    df = pd.read_html(htmls[0])
    
    print(len(df))
    print(type(df))
    
        2
        <class 'list'>
        
    
    df[0].head(3)
    
    .dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>
    序号 单词 音标 解释 时间 分类 操作
    df[1].head(3)
    
    .dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>
    0 1 2 3 4 5 6
    0 1 agglomerative NaN adj. 会凝聚的;[冶] 烧结的,凝结的 2020-1-13 NaN NaN
    1 2 anatomy [ə'nætəmɪ] n. 解剖;解剖学;剖析;骨骼 2017-7-17 NaN NaN
    2 3 backbone ['bækbəʊn] n. 支柱;主干网;决心,毅力;脊椎 2017-7-13 NaN NaN
    df_cont = df[1]
    
    df_cont.columns = df[0].columns
    
    df_cont.head(3)
    
    .dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>
    序号 单词 音标 解释 时间 分类 操作
    0 1 agglomerative NaN adj. 会凝聚的;[冶] 烧结的,凝结的 2020-1-13 NaN NaN
    1 2 anatomy [ə'nætəmɪ] n. 解剖;解剖学;剖析;骨骼 2017-7-17 NaN NaN
    2 3 backbone ['bækbəʊn] n. 支柱;主干网;决心,毅力;脊椎 2017-7-13 NaN NaN
    # 收集6个网页的表格
    df_list = []
    for html in htmls:
        df = pd.read_html(html)
        df_cont = df[1]
        df_cont.columns = df[0].columns
        df_list.append(df_cont)
    
    # 合并多个表格
    df_all = pd.concat(df_list)
    
    df_all.head(3)
    
    .dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>
    序号 单词 音标 解释 时间 分类 操作
    0 1 agglomerative NaN adj. 会凝聚的;[冶] 烧结的,凝结的 2020-1-13 NaN NaN
    1 2 anatomy [ə'nætəmɪ] n. 解剖;解剖学;剖析;骨骼 2017-7-17 NaN NaN
    2 3 backbone ['bækbəʊn] n. 支柱;主干网;决心,毅力;脊椎 2017-7-13 NaN NaN
    df_all.shape
    
    
        (86, 7)
    
    

    4. 将结果数据输出到Excel文件

    df_all[["单词", "音标", "解释"]].to_excel("./course_datas/c32_read_html/网易有道单词本列表.xlsx", index=False)
    
    
    

    本文使用 文章同步助手 同步

    相关文章

      网友评论

          本文标题:32 Pandas借助Python爬虫读取HTML网页表格存储到

          本文链接:https://www.haomeiwen.com/subject/plfmtdtx.html