记一次坑，requests中的编码问题

作者: Gambler_194b | 来源:发表于2018-09-23 19:13 被阅读0次

记一次坑，requests中的编码问题
解决使用request库爬取内容出现乱码的问题
requests的编码问题
我的第一车实战抓取前程无忧
python 随记（2）
Python requests 中文编码问题
python中不同方式打开网页的获取数据类型
requests编码
Requests 爬取乱码问题
我是谁？我在干嘛？回想被MySQL UTF8编码坑的惨痛教训总结

前天在写央广网爬虫的时候遇到一个很奇怪的问题，就是在用xpath取数据的时候总是为空，而且数据明明就在源码上，并不是用JS传的，想了好久没想出来，后来发现是编码的问题。
python中有一个字符编码检测库叫charade，requests其实也能检测编码，不过有时候会不准确。charade还比较好用

    import requests
    from lxml import etree
    import chardet

    def test():
        headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:62.0) Gecko/20100101 Firefox/62.0',        
        }
        # 央广网中随意选取的一个详情页测试的url
        url = 'http://news.cnr.cn/theory/gc/20180914/t20180914_524360107.shtml'
        req = requests.get(url, headers=headers)
        if req.status_code == 200:          
            res_cont = update_code(req.content)
            try:
                com = etree.HTML(res_cont)
            except:
                com = etree.HTML(req.content)

            article = ''.join(com.xpath('//div[@class="TRS_Editor"]/p//text()'))
            # 如果不调用下面的update_code函数，这里打印出来的会是空的
            print(article)
    
    # 可以写成一个模块，每次基本上都可以用，有的需要改成utf-8编码格式，主要看网站
    def update_code(content):
        try:
            code_content = content.decode('gb18030', errors='ignore')
            return code_content
        except:
            try:
                return content.decode()
            except:
                encod = chardet.detect(content)['encoding']
                return content.decode(encod)

    test()

这样才可以取到数据。