美文网首页
爬取简书robots.txt时遇到的HTTP Error 403

爬取简书robots.txt时遇到的HTTP Error 403

作者: pppppkun | 来源:发表于2019-04-17 20:53 被阅读0次

    先贴一下原来的代码,是按照书上直接抄下来的

    from urllib.robotparser import RobotFileParser
    from urllib.request import urlopen
    rp = RobotFileParser()
    rp.parse(urlopen('http://www.jianshu.com/robots.txt').read().decode('utf-8').split('\n'))
    print(rp.can_fetch('*', 'http://www.jianshu.com/p/b67554025d7d'))
    print(rp.can_fetch('*', 'http://www.jianshu.com/search?q=python&page=1&type=collections'))
    

    然后就Pycharm就报出了如下错误:

    Traceback (most recent call last):
      File "E:/PythonProject/PaChong/first.py", line 15, in <module>
        rp.parse((urlopen('http://www.jianshu.com/robots.txt').read().decode('utf-8').split('\n')))
      File "E:\Python\lib\urllib\request.py", line 222, in urlopen
        return opener.open(url, data, timeout)
      File "E:\Python\lib\urllib\request.py", line 531, in open
        response = meth(req, response)
      File "E:\Python\lib\urllib\request.py", line 641, in http_response
        'http', request, response, code, msg, hdrs)
      File "E:\Python\lib\urllib\request.py", line 563, in error
        result = self._call_chain(*args)
      File "E:\Python\lib\urllib\request.py", line 503, in _call_chain
        result = func(*args)
      File "E:\Python\lib\urllib\request.py", line 755, in http_error_302
        return self.parent.open(new, timeout=req.timeout)
      File "E:\Python\lib\urllib\request.py", line 531, in open
        response = meth(req, response)
      File "E:\Python\lib\urllib\request.py", line 641, in http_response
        'http', request, response, code, msg, hdrs)
      File "E:\Python\lib\urllib\request.py", line 569, in error
        return self._call_chain(*args)
      File "E:\Python\lib\urllib\request.py", line 503, in _call_chain
        result = func(*args)
      File "E:\Python\lib\urllib\request.py", line 649, in http_error_default
        raise HTTPError(req.full_url, code, msg, hdrs, fp)
    urllib.error.HTTPError: HTTP Error 403: Forbidden
    

    直接看最后一行,是HTTP Error 403:Frobidden
    经过搜索,出现这个原因是因为用urllib.request.urlopen方式打开一个URL的话,服务器只会收到一个单纯的对于该页面访问的请求,但是服务器并不知道发送这个请求使用的浏览器,操作系统等信息,而缺失这些信息的访问往往都是非正常访问,会被一些网站禁止掉。
    那么该如何解决这个问题呢?只需要在请求中加入UserAgent信息就行了
    如下

    from urllib.robotparser import RobotFileParser
    from urllib import request
    rp = RobotFileParser()
    headers = {
        'User-Agent': 'Mozilla/4.0(compatible; MSIE 5.5; Windows NT)'
    }
    url = 'http://www.jianshu.com/robots.txt'
    req = request.Request(url=url, headers=headers)
    response = request.urlopen(req)
    rp.parse(response.read().decode('utf-8').split('\n'))
    print(rp.can_fetch('*', 'http://www.jianshu.com/p/b67554025d7d'))
    print(rp.can_fetch('*', 'http://www.jianshu.com/search?q=python&page=1&type=collections'))
    

    这样问题就完美解决了

    相关文章

      网友评论

          本文标题:爬取简书robots.txt时遇到的HTTP Error 403

          本文链接:https://www.haomeiwen.com/subject/dychwqtx.html