用python（Requests库）爬取数据局行业报告的信息

作者: SirKay92 | 来源:发表于2019-07-28 23:07 被阅读0次

用python（Requests库）爬取数据局行业报告的信息
Python基础学习18
python+lxml 爬取网页信息及储存
以人人都是产品经理网站3.6万篇文章为例阐述整个数据ETL和分析
BeautifulSoup库实战-爬取豆瓣top250图书
2018-03-06
Python爬取网页数据基本步骤及学习资料
Python爬取数据之Requests库
程序员都有对象！没有的也用Python找了成千上万个了！不信你看

作为一个刚入门对python极具喜爱的小白，坚信实践才是掌握工具的唯一道路，所以一直想要用python来做一些项目来辅助提高工作、生活中的处理一些事情的效率。

在平时的工作中，总是难免会需要搜集各种各样的行业研究报告，单纯从搜索引擎中搜寻下载又是一个耗费时间和体力的事情。后来发现一个神奇良心的网站——数据局（http://shujuju.cn），里面时常更新一些报告解决燃眉之急。

这么一个大宝藏，我决定将其报告名称及链接爬取下来，因为下载需要涉及到登陆，而登陆遇到了比较棘手的图片验证的问题，后期将会针对报告本身的进行一些分析（下期见~）

其实整体思路是相当的简单，因为网站本身就没有太多复杂的地方，思路如下：

用requests库爬取报告列表页第一页，并通过lxml的etree进行页面分析获取到整体的页数
因为翻页逻辑只是在 "http://www.shujuju.cn/lecture/detail/"后加上页码数字，故利用规律构建所有需要爬取的网页链接
然后继续使用requests和xlml将所有列表页中报告名称和报告所在的页面链接给保留下来
然后逐一去爬取报告详情页，文末附录的报告链接给抓取保存
最后用csv将所有数据保存在本地

完整代码如下：

#-*- coding: utf-8 -*
import requests
import re
import time
import csv
from lxml import etree

#构建header
headers = {
    "Accept": "application/json",
    "Accept-Encoding": "gzip, deflate",
    "Accept-Language": "zh-CN,zh;q=0.9",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36"
}

#获得目前最新的报告的编号
url = "http://www.shujuju.cn/lecture/browe"
response = requests.get(url,headers = headers)
datas = etree.HTML(response.text)
print(datas.text)
results = datas.xpath('//div[@class="textdescription-small-info"]/h3/a/@href')
last_report_num = re.search('\d{4}',results[0])
last_report_num = last_report_num[0]

#构建报告的列表
report_page_urls = []
for report_num in range(int(last_report_num)):
    report_page_url = "http://www.shujuju.cn/lecture/detail/"+str(report_num+1)
    report_page_urls.append(report_page_url)

###获取报告页内容

report_info = []

for i in range(int(last_report_num)):
    target_url=report_page_urls[i]
    response = requests.get(target_url,headers = headers)
    page_datas = etree.HTML(response.text)
    #获取标题
    title = page_datas.xpath('//h1[@class="title"]/text()')
    title = title[0]

    #获取下载链接
    report_download_ul = page_datas.xpath('//div[@class="report-article"]/ul/li/a/@href')
    try:
        report_download_ul=report_download_ul[0]
        report_download_ul = "http://www.shujuju.cn"+report_download_ul
    except IndexError:
        print(title + " 报告页面没有附下载链接")
        report_download_ul = ''
    new_report_info = { 'report_title':title,'download_url':report_download_ul,'paga_url':target_url}
    report_info.append(new_report_info)
    print("已处理第"+str(i+1)+'个页面,共'+last_report_num+'个页面')
    time.sleep(1)

# 保存数据到csv文件中
with open('E:/report/report.csv','w') as csvfile:
    fieldnames = report_info[1].keys()
    f_csv = csv.DictWriter(csvfile,fieldnames = fieldnames)
    f_csv.writeheader()
    for data in report_info:
        try:
            f_csv.writerow(data)
        except UnicodeEncodeError:
            print('第'+str(data)+'个报告的标题存在特殊字符，保存失败')
    print("保存文件成功，处理结束")

因为中途有两个问题，一个是网页中可能介绍了报告却没有附上报告链接，这种情况下报告名称和网页链接保留下来了，是没有下载链接；另一个问题是由于报告标题中存在一些特殊字符，这些字符没法写入到CSV中，故暂时先没有记录；

最后的保存下来的结果如下：