美文网首页
Scrapy报400和415错误

Scrapy报400和415错误

作者: 不听不听乾乾念经 | 来源:发表于2018-05-07 17:52 被阅读0次

    今天用Scrapy框架爬取金融APP时出现了一些小问题,折腾了一天,记录一下。

    返回415状态码:请求包未加header


    首先用Charles抓取手机请求数据包,图如下:


    请求包

    这是一个POST请求并需要提交表单数据,所以我用了scrapy.FormRequest构造数据包,具体spider代码:

    class yilicai(Spider):
        name = "yilicai"
        urls = "http://api.yilicai.cn/product/all5"
        base_url = "https://www.yilicai.cn"
        DOWNLOAD_DELAY = 0
        count = 0
        appmc = "壹理财"
    
        def start_requests(self):
            formdata = {
                'page': '1',
                'sType': '0',
                'sTerm': '0',
                'sRate': '0',
                'sRecover': '0',
                'sStart': '0'
            }
            yield scrapy.FormRequest(self.urls, callback=self.parse, formdata=formdata)
    
        def parse(self,response):
            datas = json.loads(response.body)
            print(json.dumps(datas, sort_keys=True, indent=2))
            
    

    运行该爬虫出现415错误:

    2018-05-07 17:00:20 [scrapy.core.engine] DEBUG: Crawled (415) <POST http://api.yilicai.cn/product/all5> (referer: None)
    2018-05-07 17:00:20 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <415 http://api.yilicai.cn/product/all5>: HTTP status code is not handled or not allowed
    2018-05-07 17:00:21 [scrapy.core.engine] INFO: Closing spider (finished)
    

    去找了一下关于HTTP状态码415的解释:

    415 Unsupported Media Type 服务器无法处理请求附带的媒体格式

    后来发现是我没有添加header,添加了header的代码修改如下:

    headers={
            "Accept-Language":"zh-CN,zh;q=0.8",
            "User-Agent ":"Mozilla/5.0 (Linux; U; Android 6.0; zh-cn; AOSP on HammerHead Build/MRA58K) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30",
            "Content-Type":"application/json;charset=utf-8",
            "Host":"api.yilicai.cn",
            "Accept-Encoding":"gzip",
        }
    ...
    ...
    yield scrapy.FormRequest(self.urls, headers=self.headers,callback=self.parse, formdata=formdata)
    

    返回400状态码:未将提交数据转化为json格式

    再次运行415状态码错误算是解决了,但是出现了一个新的错误,报错400:

    2018-05-07 17:11:59 [scrapy.core.engine] DEBUG: Crawled (400) <POST http://api.yilicai.cn/product/all5> (referer: None)
    2018-05-07 17:11:59 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 http://api.yilicai.cn/product/all5>: HTTP status code is not handled or not allowed
    2018-05-07 17:11:59 [scrapy.core.engine] INFO: Closing spider (finished)
    

    真的伤心,就是这个400错误卡了我一整天,我先去找了一下关于400状态码的解释

    400 bad request 错误的请求

    后来发现是这个请求严格要求提交的表单必须是json格式,所以在提交表单时候需要把formdata转换成json格式,然后进行提交。

    由于使用scrapy.FormRequest在构造包时语句formdata=json.dumps(formdata)会报错,所以使用scrapy.Request来进行爬取:

    class yilicai(Spider):
        name = "yilicai"
        urls = "http://api.yilicai.cn/product/all5"
        base_url = "https://www.yilicai.cn"
        DOWNLOAD_DELAY = 0
        count = 0
        appmc = "壹理财"
    
        headers={
            "Accept-Language":"zh-CN,zh;q=0.8",
            "User-Agent ":"Mozilla/5.0 (Linux; U; Android 6.0; zh-cn; AOSP on HammerHead Build/MRA58K) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30",
            "Content-Type":"application/json;charset=utf-8",
            "Host":"api.yilicai.cn",
            "Accept-Encoding":"gzip",
        }
    
        def start_requests(self):
            formdata = {
                'page': '1',
                'sType': '0',
                'sTerm': '0',
                'sRate': '0',
                'sRecover': '0',
                'sStart': '0'
            }
            temp=json.dumps(formdata)
            yield scrapy.Request(self.urls,body=temp,headers=self.headers,callback=self.parse)
    
        def parse(self,response):
            datas = json.loads(response.body)
            print(json.dumps(datas,sort_keys=True, indent=2))
    

    最后终于成功抓到返回的数据包了,然后愉快的进行数据分析了

    2018-05-07 17:47:16 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://api.yilicai.cn/product/all5> (referer: None)
    {
      "base_url": "https://www.yilicai.cn", 
      "current_page": "1", 
      "new_hand": 1, 
      "pager": "1", 
      "pagerParam": {
        "count": 16063, 
        "maxPage": 1607, 
        "perPage": 10
      }, 
      "product_list": [
        {
        ......
        }
      ], 
      "sid": null, 
      "status": "0"
    }
    2018-05-07 17:47:16 [scrapy.core.engine] INFO: Closing spider (finished)
    2018-05-07 17:47:16 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
    ......
    

    相关文章

      网友评论

          本文标题:Scrapy报400和415错误

          本文链接:https://www.haomeiwen.com/subject/ztlvrftx.html