常见状态码

作者: whong736 | 来源:发表于2018-01-30 00:21 被阅读23次
    image.png
    伪装成浏览器,请求页面,并下载网页
    
    import urllib.request
    
    URL = "https://www.hao123.com/manhua/detail/176"
    
    header ={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) '  
                            'Chrome/51.0.2704.63 Safari/537.36'}
    
    opener = urllib.request.build_opener()
    
    opener.add_handler= [header]
    
    data = opener.open(URL).read()
    
    Hfile = open("/Users/vincentwen/Downloads/file.html","wb")
    Hfile.write(data)
    Hfile.close()
    
    
    
    
    image.png

    爬取漫画网站的首页的

    import re
    import urllib.request
    import urllib.error
    
    #读取需要爬取的网址
    Readdata= urllib.request.urlopen("http://www.pufei.net/").read()
    
    #对读取的结果进行编码
    data = Readdata.decode("utf-8","ignore")
    
    #定义正则表达式,匹配manhua目录下的所有网址
    pat= ' href="http://www.pufei.net/manhua/(.*?)/"'
    
    #匹配网页中所有的符合条件的url链接地址
    allurl= re.compile(pat).findall(data)
    
    for i in range(0,len(allurl)):
        try:
            print("第"+str(i)+"抓取")
            thisurl= allurl[i]
            file= "/Users/vincentwen/Downloads/"+str(i)+".html"
            urllib.request.urlretrieve(thisurl, file)
            print("------抓取成功-----")
    
        except urllib.error.URLError as e:
            if hasattr(e, "code"):
                print(e.code)
            if hasattr(e, "reason"):
                print(e.reason)
    
    
    image.png

    相关文章

      网友评论

        本文标题:常见状态码

        本文链接:https://www.haomeiwen.com/subject/offqzxtx.html