美文网首页
爬虫遇到的问题总结

爬虫遇到的问题总结

作者: 狂浪的心 | 来源:发表于2018-09-10 22:55 被阅读0次

    1.请求下来的HTML中文编码问题

    import requests
    from bs4 import BeautifulSoup
    newsurl = "http://news.sina.com.cn/china"
    res = requests.get(newsurl)
    soup = BeautifulSoup(res.text,"lxml")
    news_item = soup.select(".news-item")
    print(news_item[0].select("h2")[0].text)
    

    结果:

    ����������� �止�
    

    解决办法

    import requests
    from bs4 import BeautifulSoup
    newsurl = "http://news.sina.com.cn/china"
    res = requests.get(newsurl)
    soup = BeautifulSoup(res.text.encode(res.encoding).decode('utf-8'),"lxml") #添加编解码
    news_item = soup.select(".news-item")
    print(news_item[0].select("h2")[0].text)
    

    结果:

    半月谈:政务公开渠道多干货少 各地无统一标准
    

    2.爬虫长时间运行报错

    urllib3.exceptions.ProtocolError: ('Connection aborted.', ConnectionResetError(54, 'Connection reset by peer'))
    

    解决办法一,设置请求头user-agent:

    headers = requests.utils.default_headers()
    headers['User-Agent'] = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36'
    #headers['User-Agent'] = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.167 Safari/537.36'
    r = requests.get('https://academic.oup.com/journals', headers=headers)
    

    解决办法二:更换ip地址

    相关文章

      网友评论

          本文标题:爬虫遇到的问题总结

          本文链接:https://www.haomeiwen.com/subject/fkfnzxtx.html