- 打印
response.text
乱码。
-
打印
response.encoding
为utf-8
。
当用 Python 做爬虫的时候,一些网站为了防爬虫会设置一些检查机制,这时我们就需要添加请求头,伪装成浏览器正常访问。例如我们在使用scrapy
写爬虫时,在setting
中我们设置了DEFAULT_REQUEST_HEADERS。在这里面我们设置了Accept-Encoding
为"gzip, deflate, br"
。那么有可能这个网站的编码就br
,然而我们的pycharm
上没有下载这个库,就会导致乱码。
DEFAULT_REQUEST_HEADERS = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
# 这里使用了br,就有可能乱码,解释器需要下载 Brotli pip install Brotli
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "zh-CN,zh;q=0.9",
"Cache-Control": "max-age=0",
"Cookie": "resolution=1080*1920; Hm_lvt_c826b0776d05b85d834c5936296dc1d5=1686822404; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%22188228516787b0-0579a4d65a09f9-1c525634-2073600-188228516791df7%22%2C%22first_id%22%3A%22%22%2C%22props%22%3A%7B%22%24latest_traffic_source_type%22%3A%22%E7%9B%B4%E6%8E%A5%E6%B5%81%E9%87%8F%22%2C%22%24latest_search_keyword%22%3A%22%E6%9C%AA%E5%8F%96%E5%88%B0%E5%80%BC_%E7%9B%B4%E6%8E%A5%E6%89%93%E5%BC%80%22%2C%22%24latest_referrer%22%3A%22%22%7D%2C%22identities%22%3A%22eyIkaWRlbnRpdHlfY29va2llX2lkIjoiMTg4MjI4NTE2Nzg3YjAtMDU3OWE0ZDY1YTA5ZjktMWM1MjU2MzQtMjA3MzYwMC0xODgyMjg1MTY3OTFkZjcifQ%3D%3D%22%2C%22history_login_id%22%3A%7B%22name%22%3A%22%22%2C%22value%22%3A%22%22%7D%2C%22%24device_id%22%3A%22188228516787b0-0579a4d65a09f9-1c525634-2073600-188228516791df7%22%7D; kk_s_t=1687169079723",
"If-None-Match": "27733-wHpibHGyRBeG+tUml+dq3EKDpIc",
"Sec-Ch-Ua-Mobile": "?0",
"Sec-Ch-Ua-Platform": "macOS",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "none",
"Sec-Fetch-User": "?1",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36",
}
总结:
1、将Accept-Encoding
中的:br
去除。
2、导入Brotli
这个库。
网友评论