美文网首页
第一个python程序:抓取html表格存为本地xls文件

第一个python程序:抓取html表格存为本地xls文件

作者: 迟客 | 来源:发表于2017-08-06 17:29 被阅读0次

    需求:获取目标网址的中每一天的水情公报,并保存进excel进行下一步处理。
    分析水情公报url知,http://www.sxmwr.gov.cn/gb-zxfw-news-3-list-79501中,list后数字并非序列,不规范。因此需要先抓取水情公报栏目每一页所有天的水情公报链接,然后,进入每一链接抓取公报内容存储为本地文件。
    以下分4步完成:

    1、获取每一天的公报链接(index.py)

    from unittest import result
    import requests
    import os, sys
    from bs4 import BeautifulSoup
    
    for i in range(20,30):
        url = 'http://www.sxmwr.gov.cn/List.action?classfyid=61&currentPage=' + str(i) + '&turnHref=a0909d3e2591f4bad185dcc317ed1305ecd1ce6e561ebba08b16ce55831d5ab8a287dfa4a19beda0c6bc579659449f995714cf0f1d6601099d111aa8b2a942c122565fccc10321a12fa3875b48a46949d5c36fb26f106d16e54a688e17199bd5c4e6b68a622d3b2792ba2c781a2d4e17fffe1f9e8c4d6cdf6348d9a80dbcf0bdaea67d6bcc745b348c230d59c63a6576131bcee30514c0527ad244d7662c1922'
        res = requests.get(url)
        res.encoding = 'utf-8'
        html_sample = res.text
        soup = BeautifulSoup(html_sample, 'html.parser')
        for link in soup.select('.articlelist'):
            s=(link)
            soup2 = BeautifulSoup(str(s), 'html.parser')
            alinks = soup2.select('a')
            for links in alinks:
                print(links['href'])
    

    对于新手来说,感觉还是beautifulSoup抓取页面比较好用。(具体用法自行百度)
    print结果如下

    gb-zxfw-news-3-list-62331
    gb-zxfw-news-3-list-62314
    ……
    

    2、用excel处理上一步结果

    由于下一步需要一个list,形如['a','b'],于是将上一步运行结果,通过复制转置为一行,在下一行插入公式(假设第一行第一列为A1,此公式是为上一行每一个数据加上引号)

    ="'"%A1%"'"
    

    然后另存为带逗号的csv格式,可以用记事本打开复制到下一个python程序中。

    3、存储每一天的公报为xls格式(content.py)

    from unittest import result
    import requests
    import os, sys
    from bs4 import BeautifulSoup
    
    
    def getNewsDetail(newsurl):
        result = {}
        res = requests.get(newsurl)
        res.encoding = 'utf-8'
        soup = BeautifulSoup(res.text, 'html.parser')
        result['title'] = soup.select('#mainNewsTitles')[0].text.strip()
        result['artical'] = soup.select('#mainNewsContent')[0]
        return result
    A=['gb-zxfw-news-3-list-62331','gb-zxfw-news-3-list-62314','gb-zxfw-news-3-list-62276','gb-zxfw-news-3-list-62247','gb-zxfw-news-3-list-62218','gb-zxfw-news-3-list-62187','gb-zxfw-news-3-list-62156','gb-zxfw-news-3-list-62155','gb-zxfw-news-3-list-62133','gb-zxfw-news-3-list-62100','gb-zxfw-news-3-list-62076','gb-zxfw-news-3-list-62042','gb-zxfw-news-3-list-61998','gb-zxfw-news-3-list-61995','gb-zxfw-news-3-list-61994','gb-zxfw-news-3-list-55477','gb-zxfw-news-3-list-55385','gb-zxfw-news-3-list-55344','gb-zxfw-news-3-list-55306','gb-zxfw-news-3-list-55289','gb-zxfw-news-3-list-55165','gb-zxfw-news-3-list-55098','gb-zxfw-news-3-list-55047','gb-zxfw-news-3-list-55002','gb-zxfw-news-3-list-55000','gb-zxfw-news-3-list-54997','gb-zxfw-news-3-list-54995','gb-zxfw-news-3-list-54994','gb-zxfw-news-3-list-54991','gb-zxfw-news-3-list-54989','gb-zxfw-news-3-list-54940','gb-zxfw-news-3-list-54900','gb-zxfw-news-3-list-54850','gb-zxfw-news-3-list-54810','gb-zxfw-news-3-list-54772','gb-zxfw-news-3-list-54730','gb-zxfw-news-3-list-54695','gb-zxfw-news-3-list-54643','gb-zxfw-news-3-list-54584','gb-zxfw-news-3-list-54558','gb-zxfw-news-3-list-54556','gb-zxfw-news-3-list-54515','gb-zxfw-news-3-list-54468','gb-zxfw-news-3-list-54387','gb-zxfw-news-3-list-54325','gb-zxfw-news-3-list-54237','gb-zxfw-news-3-list-54188','gb-zxfw-news-3-list-54178','gb-zxfw-news-3-list-54132','gb-zxfw-news-3-list-54052','gb-zxfw-news-3-list-53983','gb-zxfw-news-3-list-53935','gb-zxfw-news-3-list-53892','gb-zxfw-news-3-list-53889','gb-zxfw-news-3-list-53887','gb-zxfw-news-3-list-53858','gb-zxfw-news-3-list-53825','gb-zxfw-news-3-list-53779','gb-zxfw-news-3-list-53729','gb-zxfw-news-3-list-53640','gb-zxfw-news-3-list-53589','gb-zxfw-news-3-list-53587','gb-zxfw-news-3-list-53561','gb-zxfw-news-3-list-53513','gb-zxfw-news-3-list-53465','gb-zxfw-news-3-list-53405','gb-zxfw-news-3-list-53338','gb-zxfw-news-3-list-53286','gb-zxfw-news-3-list-53284','gb-zxfw-news-3-list-53238','gb-zxfw-news-3-list-53187','gb-zxfw-news-3-list-53128','gb-zxfw-news-3-list-53072','gb-zxfw-news-3-list-53012','gb-zxfw-news-3-list-52957','gb-zxfw-news-3-list-52951','gb-zxfw-news-3-list-52897','gb-zxfw-news-3-list-52845','gb-zxfw-news-3-list-52801','gb-zxfw-news-3-list-52740','gb-zxfw-news-3-list-52632','gb-zxfw-news-3-list-52571','gb-zxfw-news-3-list-52564','gb-zxfw-news-3-list-52499','gb-zxfw-news-3-list-52407','gb-zxfw-news-3-list-52295','gb-zxfw-news-3-list-52196','gb-zxfw-news-3-list-52057','gb-zxfw-news-3-list-52003','gb-zxfw-news-3-list-51992','gb-zxfw-news-3-list-51935','gb-zxfw-news-3-list-51848','gb-zxfw-news-3-list-51751','gb-zxfw-news-3-list-51667','gb-zxfw-news-3-list-51612','gb-zxfw-news-3-list-51560','gb-zxfw-news-3-list-51550','gb-zxfw-news-3-list-51504','gb-zxfw-news-3-list-51346','gb-zxfw-news-3-list-51431','gb-zxfw-news-3-list-51246','gb-zxfw-news-3-list-51168','gb-zxfw-news-3-list-51134','gb-zxfw-news-3-list-51131','gb-zxfw-news-3-list-51100','gb-zxfw-news-3-list-51036','gb-zxfw-news-3-list-50933','gb-zxfw-news-3-list-50825','gb-zxfw-news-3-list-50760','gb-zxfw-news-3-list-50717','gb-zxfw-news-3-list-50710','gb-zxfw-news-3-list-50652','gb-zxfw-news-3-list-50578','gb-zxfw-news-3-list-50458','gb-zxfw-news-3-list-50401','gb-zxfw-news-3-list-50336','gb-zxfw-news-3-list-50285','gb-zxfw-news-3-list-50284','gb-zxfw-news-3-list-50243','gb-zxfw-news-3-list-50166','gb-zxfw-news-3-list-50068','gb-zxfw-news-3-list-49957','gb-zxfw-news-3-list-49893','gb-zxfw-news-3-list-49841','gb-zxfw-news-3-list-49840','gb-zxfw-news-3-list-49839','gb-zxfw-news-3-list-49793','gb-zxfw-news-3-list-49727','gb-zxfw-news-3-list-49653','gb-zxfw-news-3-list-49573','gb-zxfw-news-3-list-49475','gb-zxfw-news-3-list-49424','gb-zxfw-news-3-list-49421','gb-zxfw-news-3-list-49380','gb-zxfw-news-3-list-49273','gb-zxfw-news-3-list-49191','gb-zxfw-news-3-list-49125','gb-zxfw-news-3-list-49059','gb-zxfw-news-3-list-49017','gb-zxfw-news-3-list-49016','gb-zxfw-news-3-list-48960','gb-zxfw-news-3-list-48888','gb-zxfw-news-3-list-48820','gb-zxfw-news-3-list-48710','gb-zxfw-news-3-list-48673','gb-zxfw-news-3-list-48748','gb-zxfw-news-3-list-48644','gb-zxfw-news-3-list-48542','gb-zxfw-news-3-list-48502','gb-zxfw-news-3-list-48604','gb-zxfw-news-3-list-48427','gb-zxfw-news-3-list-48362','gb-zxfw-news-3-list-48360','gb-zxfw-news-3-list-48358','gb-zxfw-news-3-list-48249','gb-zxfw-news-3-list-48173','gb-zxfw-news-3-list-48108','gb-zxfw-news-3-list-48045','gb-zxfw-news-3-list-47986','gb-zxfw-news-3-list-47984','gb-zxfw-news-3-list-47923','gb-zxfw-news-3-list-47844','gb-zxfw-news-3-list-47771','gb-zxfw-news-3-list-47692','gb-zxfw-news-3-list-47585','gb-zxfw-news-3-list-47535','gb-zxfw-news-3-list-47533','gb-zxfw-news-3-list-47491','gb-zxfw-news-3-list-47410','gb-zxfw-news-3-list-47343','gb-zxfw-news-3-list-47221','gb-zxfw-news-3-list-47151','gb-zxfw-news-3-list-47123','gb-zxfw-news-3-list-47120','gb-zxfw-news-3-list-47053','gb-zxfw-news-3-list-47010','gb-zxfw-news-3-list-47057','gb-zxfw-news-3-list-46719','gb-zxfw-news-3-list-46641','gb-zxfw-news-3-list-46590','gb-zxfw-news-3-list-46550','gb-zxfw-news-3-list-47056','gb-zxfw-news-3-list-46546','gb-zxfw-news-3-list-38623','gb-zxfw-news-3-list-38512','gb-zxfw-news-3-list-38482','gb-zxfw-news-3-list-38412','gb-zxfw-news-3-list-38357','gb-zxfw-news-3-list-38169','gb-zxfw-news-3-list-38105','gb-zxfw-news-3-list-38103','gb-zxfw-news-3-list-38221','gb-zxfw-news-3-list-38101','gb-zxfw-news-3-list-38098','gb-zxfw-news-3-list-38096','gb-zxfw-news-3-list-38088','gb-zxfw-news-3-list-38050','gb-zxfw-news-3-list-37982','gb-zxfw-news-3-list-37914','gb-zxfw-news-3-list-37861']
    for a in A:
        zd = getNewsDetail('http://www.sxmwr.gov.cn/'+str(a))
        s=str(zd['artical'])
        names=zd['title']
        f = open(str(names)+'.xls', 'w+', encoding="utf-8")
        f.write(s)
    

    这个还是用了BeautifulSoup来抓取页面,不过定义了一个函数。代码中列表A即时上一步处理的结果。此步结果可以得到所有所需页面的xls文件。(不会用html表格转excel的模块,不过发现存为html表格存为本地html后,重命名格式为xls后可以打开。)

    4、合并所有xls为多sheet表格

    在网上直接扒了一个vba代码:

    Sub CombineWorkbooks()
    Dim FilesToOpen, ft
    Dim x As Integer
    Application.ScreenUpdating = False
    On Error GoTo errhandler
    FilesToOpen = Application.GetOpenFilename _
    (FileFilter:="Micrsofe Excel文件(*.xls), *.xls", _
    MultiSelect:=True, Title:="要合并的文件")
    If TypeName(FilesToOpen) = "boolean" Then
    MsgBox "没有选定文件"
    'GoTo errhandler
    End If
    x = 1
    While x <= UBound(FilesToOpen)
    Set wk = Workbooks.Open(Filename:=FilesToOpen(x))
    wk.Sheets().Move after:=ThisWorkbook.Sheets _
    (ThisWorkbook.Sheets.Count)
    x = x + 1
    Wend
    MsgBox "合并成功完成!"
    errhandler:
    'MsgBox Err.Description
    'Resume errhandler
    End Sub
    

    在excel开发工具vba下直接添加,运行。还挺好使的。

    总结

    作为python初级教程还没看完的新手来说,完成上面的工作,用了两三天,心态接近崩溃。以上可能是一个伪需求,只当在实战中学习。
    上面步骤2和4是作为新手的无奈呀,纯粹搞笑。以后有能力看能否全都合并成一个python程序。

    以上。

    10.24日更新

    将python部分代码整合为一个

    import requests
    import os, sys
    import time
    from bs4 import BeautifulSoup
    from unittest import result
    
    i=0
    c=[0]
    time_start=time.time()
    
    for j in range(20,30):
        url = 'http://www.sxmwr.gov.cn/List.action?classfyid=61&currentPage=' + str(j) + '&turnHref=a0909d3e2591f4bad185dcc317ed1305ecd1ce6e561ebba08b16ce55831d5ab8a287dfa4a19beda0c6bc579659449f995714cf0f1d6601099d111aa8b2a942c122565fccc10321a12fa3875b48a46949d5c36fb26f106d16e54a688e17199bd5c4e6b68a622d3b2792ba2c781a2d4e17fffe1f9e8c4d6cdf6348d9a80dbcf0bdaea67d6bcc745b348c230d59c63a6576131bcee30514c0527ad244d7662c1922'
        res = requests.get(url)
        res.encoding = 'utf-8'
        html_sample = res.text
        soup = BeautifulSoup(html_sample, 'html.parser')
        for link in soup.select('.articlelist'):
            s=(link)
            soup2 = BeautifulSoup(str(s), 'html.parser')
            alinks = soup2.select('a')
            for links in alinks:
                [a,i]=[[links['href']],i]
                b=[a,i][0]
                i=i+1
                c=c+b
    C=c[1:]
    
    def get_xls(url_1):
        result = {}
        res = requests.get(url_1)
        res.encoding = 'utf-8'
        soup = BeautifulSoup(res.text, 'html.parser')
        result['title'] = soup.select('#mainNewsTitles')[0].text.strip()
        result['artical'] = soup.select('#mainNewsContent')[0]
        return result
    
    for k in C:
        dic = get_xls('http://www.sxmwr.gov.cn/' + str(k))
        s=str(dic['artical'])
        names=dic['title']
        f = open(str(names)+'.xls', 'w+', encoding="utf-8")
        f.write(s)
    
    time_end=time.time()
    print('已完成,总共耗时'+str(time_end-time_start)+'秒')
    

    相关文章

      网友评论

          本文标题:第一个python程序:抓取html表格存为本地xls文件

          本文链接:https://www.haomeiwen.com/subject/cqqnlxtx.html