Python Scrapy 爬取PAT网站数据(1.0 爬取题

作者: AlexSun1995 | 来源:发表于2017-06-05 11:12 被阅读0次

Python Scrapy 爬取PAT网站数据(1.0 爬取题
各类链接
Scrapy爬取数据初识
[CP_14] Python爬虫框架01：Scrapy框架创建项
强大的爬虫利器scrapy（介绍与安装）！
scrapy中间件实现增量爬虫
Python商品数据预处理与K-Means聚类可视化分析
Python代写商品数据预处理与K-Means聚类可视化分析
2021-07-21
spider整理

序

出于了解HTTP和爬虫的目的，于是就有了一个自己动手实现一个爬虫，并在此之上做一些简单的数据分析的想法。有两种选择，一种是完全自己用Python的urllib再配合一个html解析（beautifulsoup之类的）库实现一个简单的爬虫，另一种就是学习一个成熟而且功能强大的框架（比如说scrapy）。综合考虑之下，我决定选择后者，因为自己造的轮子肯定没有别人造的好，以后真的需要用上爬虫,使用scrapy也更加靠谱。
爬什么呢？第一次爬虫实践，我想爬一个数据格式比较工整的、干净的，最好是一条一条数据的网站，这样我就想到了PAT的题库。
github地址

我理解的爬虫

简单的说，我们在浏览一个网页的时候，其实是向网页的服务器发送一个请求（Request），网页服务器在收到请求之后返回数据(Response)，这些数据中包括HTML数据（最早期的http协议只能返回HTML数据，现在当然不是了），我们的浏览器再将这些HTML数据展示出来,就是我们看到的网页了。爬虫忽略了浏览器的存在，通过自动化的方式去发送请求，获取服务器的响应数据。
真实去做一个复杂的爬虫的时候当然不会这么简单了，还需要去考虑cookie、反爬虫技巧、模拟登陆等等，不过这个项目只是一个入门，以后接触的多了再慢慢了解也不急。

scrapy使用

对于scrapy安装、介绍这里就不复述了，我觉得网上有很多很棒的资源。

 scrapy startproject patSpider

就表示我们创造了这个叫做patSpider的scrapy项目，tree 一下，可以发现项目的结构是这个样子的：

tree，项目结构

在spider文件夹下，创建一个python文件，继承crawlSpider类，这就是一个爬虫了（要注意的是，一个scrapy项目可以创造不止一个爬虫，你可以用它来创造多个爬虫，不过每个爬虫都有一个独一无二的name加以区分，在项目的文件下使用spracy crawl 爬虫的name 就可以启动这个爬虫了）

首先观察一下pat登录界面的network数据（使用chrome开发者模式），因为要模拟登陆，其实登陆也就是在request的表单里把服务器需要的数据提交过去（用户名、密码等），注意这里还有一个authenticity_token数据项，我们在第一次的response数据中将这一项数据提取出来，然后在下一次提交上去（其实直接复制也可以，但是就失去了代码的重用性，假如一段时间后服务器端把这个值改了怎么办？）

Screenshot from 2017-06-04 20-08-11.png

观察一下from_data中的数据项，这就是我们要提交的所有数据项
然后观察一下我们要爬取的pat甲级题库的html数据格式，因为我们就是要按照这个格式来解析html数据的；我们发现<td><tr> 下面的六行就是一个题目的信息（有没有通过，题目编号，题目名称，提交次数，通过次数，通过率），我们等会就按照这个规律来解析HTML数据

image.png

patSpider/patSpider/spiders/problem_info_spider.py

from scrapy import FormRequest
from scrapy import Request
from scrapy.loader import ItemLoader
from scrapy.spiders import CrawlSpider
from patSpider.items import *
import pickle
from patSpider.pipelines import *

class pat_Spider(CrawlSpider):
    name = "pat"
    items = []
    call_times = 0
    # allowed_domains = []
  #这个是爬虫需要爬取的url，因为只有两页，所以就直接把第二页的url放上去了 
    start_urls = ["https://www.patest.cn/contests/pat-a-practise",
                  "https://www.patest.cn/contests/pat-a-practise?page=2"
                  ]
    #想网页发送请求，注意这些函数不需要显示地调用，启用爬虫的时候就自动调用了
    #使用post_login这个回调函数来提交表单数据，所谓 request 回调函数，就是一个request 获取（也可以说是下载）了一个
    # response
    # post_login 
    # 参见： callback https://doc.scrapy.org/en/1.3/topics/request-response.html#topics-request-response-ref-request-callback-
    # arguments
    #  def start_requests(self) 这个函数是重写crawlSpider 中的函数，这个函数是自动执行的，不用管在
    # 哪里去调用它，在这一段代码中，这个函数的执行顺序是最前的
    # 这三个函数的逻辑是： 首先请求登录界面，获取到第一个response 之后，把表单数据提交了，这时候就有网站的cookie了
    # 之后就把cookie作为request的参数提交，这样就能保持登录状态了。
    # 关于 cookie登录 ,这篇文章介绍的不错 http://www.jianshu.com/p/887af1ab4200
    def start_requests(self):
        return [Request("https://www.patest.cn/users/sign_in", meta={'cookiejar': 1}, callback=self.post_login)]

    def post_login(self, response):
        post_headers = {
            "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
            "Accept-Encoding": "gzip, deflate",
            "Accept-Language": "zh-CN,zh;q=0.8,en;q=0.6",
            "Cache-Control": "no-cache",
            "Connection": "keep-alive",
            "Content-Type": "application/x-www-form-urlencoded",
            "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.75 Safari/537.36",
            "Referer": "https://www.patest.cn/users/sign_in",
            "Upgrade-Insecure-Requests": 1

        }
        authenticity_token = response.xpath('//input[@name="authenticity_token"]/@value').extract()[0]
        # print authenticit-y_token
        return [FormRequest.from_response(response,
                                          url="https://www.patest.cn/users/sign_in",
                                          meta={'cookiejar': response.meta['cookiejar']},
                                          headers=post_headers,
                                          formdata={
                                              'utf8': '✓',
                                              'authenticity_token': authenticity_token,
                                              'user[handle]': 'suncun',
                                              # 我把密码隐藏了
                                              'user[password]': '********',
                                              'user[remember_me]': '0',
                                              'commit': "登录"
                                          },
                                          callback=self.after_login,
                                          dont_filter=True
                                          )]


    def after_login(self, response):
        for url in self.start_urls:
            yield Request(url, meta={'cookiejar': response.meta['cookiejar']})

    # 注意，这个方法是自动调用的，通常有多少个请求url，parse就会执行多少次
    # 当这段代码执行到这个地方的时候 ,已经获取到了一个登录系统后返回的response响应
    # 对这个response中的数据进行提取，就能够获取到我们需要的结果
    #  尤其注意xpath的语法规范，selector对象selectorlist对象
    def parse(self, response):
        print response.body
        self.call_times += 1
        data_selector = response.xpath('//tr/td')
        i = 0
        while i < len(data_selector):
            six_lines = data_selector[i:i+6 ]
            i += 6
            item = PatspiderItem()
            if len(six_lines[0].xpath('.//span/text()').extract()) == 0:
                item['does_pass'] = 'Not submit'
            else:
                item['does_pass'] = six_lines[0].xpath('.//span/text()').extract()[0]
            item['id'] = six_lines[1].xpath('.//a/text()').extract()[0]
            item['title'] = six_lines[2].xpath('.//a/text()').extract()[0]
            item['pass_times'] = six_lines[3].xpath('./text()').extract()[0]
            item['submit_times'] = six_lines[4].xpath('./text()').extract()[0]
            item['pass_rate'] = six_lines[5].xpath('./text()').extract()[0]
            self.items.append(item)
            # do not use 'return' cause the item is piped to 'pipelines'
            # when the Spider is working. yield can make data collecting and
            # processing at the same time.
            yield item
        # 在最后一次调用这个parse()方法的时候，将对象序列化，以供数据分析的时候再来使用
        if self.call_times == len(self.start_urls):
            with open('items_list', 'wb') as tmp_f:
                pickle.dump(self.items, tmp_f)

简单的数据分析

分析了最难的几道题（通过率最低的）、我一共通过了多少题，多少题没有做等等...

import json
import matplotlib.pyplot as plt
import pickle

def total_submit_data(items):
    '''
    :param items: all the data of pat type:list of dic
    :return: (cnt_submit, cnt_pass)
    '''
    cnt_submit = 0
    cnt_pass = 0
    for item in items:
        cnt_submit += int(item['submit_times'])
        cnt_pass += int(item['pass_times'])
    print 'total submit times: %d, total pass times: %d' %(cnt_submit, cnt_pass)
    print 'rate: %f' %(cnt_pass * 1.0/ cnt_submit)
    return cnt_submit,cnt_pass

def top_k_hard(items, k):
    '''
    :param items: all the data of pat, type: list of dic
    :param k: self defined number, ex: if k = 10, the function will return
    information of top 10 most hard problems
    :return: list(dic)
    '''
    size = len(items)
    if k > size:
        k = size
        print 'since k is too large, now we smaller k to:', k
    new_items = sorted(items, key=lambda x:float(x['pass_rate']))
    # print new_items[0:k]
    return new_items[0:k]

def self_practice_data(items):
    '''
    user: suncun(myself)
    pass_word: ***********
    this function aim to show, number of problems I've passed,
    # of problems tried but not passed yet,# of problems never tried
    :param items: all the data of pat, type: list of dic
    :return:
    '''
    print items
    cnt_pass = 0
    cnt_not_try = 0
    cnt_not_pass = 0
    total_problems = len(items)
    for item in items:
        situation = item['does_pass']
        if situation == 'Not submit':
            cnt_not_try += 1
        elif situation == 'Y':
            cnt_pass += 1
        else:
            cnt_not_pass += 1
    print 'there a totally %d problems, and I\'ve passed %d problems' %(total_problems, cnt_pass)
    print 'tried but not passed %d problems, still %d problems not tried yet' %(cnt_not_pass, cnt_not_try)


if __name__ == '__main__':
    items = {}
    with open('../items_list', 'r') as f:
        items = pickle.load(f)
    # total_submit_data(items)
    # print top_k_hard(items, 10)
    self_practice_data(items)

部分分析结果截图：

image.png

Python Scrapy 爬取PAT网站数据(1.0 爬取题
序出于了解HTTP和爬虫的目的，于是就有了一个自己动手实现一个爬虫，并在此之上做一些简单的数据分析的想法。有两种...
各类链接
爬虫使用python-aiohttp爬取今日头条【Python】爬虫爬取各大网站新闻 Scrapy 模拟登录新...
Scrapy爬取数据初识
Scrapy爬取数据初识初窥Scrapy Scrapy是一个为了爬取网站数据，提取结构性数据而编写的应用框架。 ...
[CP_14] Python爬虫框架01：Scrapy框架创建项
目录结构一、Scrapy框架简介 1. Scrapy引入 Scrapy：是用Python实现爬取网站数据、提取结...
强大的爬虫利器scrapy（介绍与安装）！
scrapy简介： scrapy是一个爬取较高水平网站的数据抓框架，用于爬取网站跟从它们的页面提取数据，并且用途范...
scrapy中间件实现增量爬虫
前言 scrapy爬取网站数据的时候，一般第一次爬取为全量爬取，以后需要的都是增量爬取，或者爬取中断之后需要继续爬...
Python商品数据预处理与K-Means聚类可视化分析
数据提取在我之前的文章Scrapy自动爬取商品数据爬虫里实现了爬虫爬取商品网站搜索关键词为python的书籍商品...
Python代写商品数据预处理与K-Means聚类可视化分析
数据提取在我之前的文章Scrapy自动爬取商品数据爬虫里实现了爬虫爬取商品网站搜索关键词为python的书籍商品...
2021-07-21
Scrapy框架的基本使用 scrapy框架简介 Scrapy是用纯Python实现一个为了爬取网站数据、提取结构...
spider整理
Scrapy框架是一个Python的爬取网站数据，提取结构性数据的应用框架，用途广泛。Scrapy Engine（...