目标网址:http://www.ts.gov.cn/col/col1300641/index.html
爬虫文件如下
class TaishungovSpider(scrapy.Spider):
name = 'TaiShunGov'
# allowed_domains = ['www.ts.gov.cn']
# start_urls = ['http://www.ts.gov.cn/']
def start_requests(self):
offset = 1
for index in range(1, 2): # 29
yield scrapy.Request(
url='http://www.ts.gov.cn/col/col1300641/index.html?uid=4365453&pageNum={}'.format(index),
callback=self.parse,
dont_filter=True)
def parse(self, response):
print(response.body)
请求头设置如下:
DEFAULT_REQUEST_HEADERS = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"Accept-Encoding": "gzip, deflate",
"Accept-Language": "en-US,en;q=0.9",
"Cache-Control": "max-age=0",
"Connection": "keep-alive",
"Cookie": "_gscu_899679477=54252423c3ye2110; TSSESSIONID=7cbc5c00-5074-45e9-b959-a5ca0e79365d; _gscbrs_899679477=1; _gscs_899679477=54252423j8ahp610|pv:4; SERVERID=b2ba659a0bf802d127f2ffc5234eeeba|1654254468|1654254464",
"Host": "www.ts.gov.cn",
'If-Modified-Since': 'Thu, 02 Jun 2022 01:13:12 GMT',
'If-None-Match': 'W/"62980ea8-eea6"',
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36"
}
此时会收获如下报错:
INFO: Ignoring response <304 http://www.ts.gov.cn/col/col1300641/index.html?uid=4365453&pageNum=1>: HTTP status code is not handled or not allowed
看日志信息可以看出 是请求头不对 但是我的请求头是原样照搬 一个不少 看着没有什么问题
经过研究发现If-Modified-Since和If-None-Match这两个参数 是服务器用来识别缓存信息是否是最新资源的
如果服务器认定你已经缓存了最新资源 就不会向你发送最新数据 并返回HTTP/304 Not Modified
解决方法
将这两个字段删除 让服务器知道 你手上没有最新数据即可
删除后请求头如下
DEFAULT_REQUEST_HEADERS = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"Accept-Encoding": "gzip, deflate",
"Accept-Language": "en-US,en;q=0.9",
"Cache-Control": "max-age=0",
"Connection": "keep-alive",
"Cookie": "_gscu_899679477=54252423c3ye2110; TSSESSIONID=7cbc5c00-5074-45e9-b959-a5ca0e79365d; _gscbrs_899679477=1; _gscs_899679477=54252423j8ahp610|pv:4; SERVERID=b2ba659a0bf802d127f2ffc5234eeeba|1654254468|1654254464",
"Host": "www.ts.gov.cn",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36"
}
此时重新运行 即可得到正常响应
网友评论