美文网首页
爬虫304解决方法

爬虫304解决方法

作者: 会爬虫的小蟒蛇 | 来源:发表于2022-06-03 02:23 被阅读0次

目标网址:http://www.ts.gov.cn/col/col1300641/index.html

爬虫文件如下

class TaishungovSpider(scrapy.Spider):
    name = 'TaiShunGov'
    # allowed_domains = ['www.ts.gov.cn']
    # start_urls = ['http://www.ts.gov.cn/']

    def start_requests(self):
        offset = 1
        for index in range(1, 2):  # 29
            yield scrapy.Request(
                url='http://www.ts.gov.cn/col/col1300641/index.html?uid=4365453&pageNum={}'.format(index),
                callback=self.parse,
                dont_filter=True)

    def parse(self, response):
        print(response.body)

请求头设置如下:

DEFAULT_REQUEST_HEADERS = {
      "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
      "Accept-Encoding": "gzip, deflate",
      "Accept-Language": "en-US,en;q=0.9",
      "Cache-Control": "max-age=0",
      "Connection": "keep-alive",
      "Cookie": "_gscu_899679477=54252423c3ye2110; TSSESSIONID=7cbc5c00-5074-45e9-b959-a5ca0e79365d; _gscbrs_899679477=1; _gscs_899679477=54252423j8ahp610|pv:4; SERVERID=b2ba659a0bf802d127f2ffc5234eeeba|1654254468|1654254464",
      "Host": "www.ts.gov.cn",
      'If-Modified-Since': 'Thu, 02 Jun 2022 01:13:12 GMT',
      'If-None-Match': 'W/"62980ea8-eea6"',
      "Upgrade-Insecure-Requests": "1",
      "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36"
}

此时会收获如下报错:

INFO: Ignoring response <304 http://www.ts.gov.cn/col/col1300641/index.html?uid=4365453&pageNum=1>: HTTP status code is not handled or not allowed

看日志信息可以看出 是请求头不对 但是我的请求头是原样照搬 一个不少 看着没有什么问题

经过研究发现If-Modified-SinceIf-None-Match这两个参数 是服务器用来识别缓存信息是否是最新资源的

如果服务器认定你已经缓存了最新资源 就不会向你发送最新数据 并返回HTTP/304 Not Modified

解决方法

将这两个字段删除 让服务器知道 你手上没有最新数据即可

删除后请求头如下

DEFAULT_REQUEST_HEADERS = {
      "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
      "Accept-Encoding": "gzip, deflate",
      "Accept-Language": "en-US,en;q=0.9",
      "Cache-Control": "max-age=0",
      "Connection": "keep-alive",
      "Cookie": "_gscu_899679477=54252423c3ye2110; TSSESSIONID=7cbc5c00-5074-45e9-b959-a5ca0e79365d; _gscbrs_899679477=1; _gscs_899679477=54252423j8ahp610|pv:4; SERVERID=b2ba659a0bf802d127f2ffc5234eeeba|1654254468|1654254464",
      "Host": "www.ts.gov.cn",
      "Upgrade-Insecure-Requests": "1",
      "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36"
}

此时重新运行 即可得到正常响应

相关文章

网友评论

      本文标题:爬虫304解决方法

      本文链接:https://www.haomeiwen.com/subject/imyhmrtx.html