美文网首页
Python爬虫笔记(3):利用requests和lxml库爬取

Python爬虫笔记(3):利用requests和lxml库爬取

作者: 坐下等雨 | 来源:发表于2018-11-03 22:34 被阅读0次

    爬取完文字,今天就来爬起图片练练手吧,这是练手的网站 居然搞笑网
    咦,还有意外惊喜,发现里面的动态图片不但搞笑,还很养眼~~
    好吧,爬一下试试吧

    1. 由于代码比较简单,只有十几行,就先上代码吧
    import requests
    from lxml import etree
    import time
    
    def get_img(url):
        r = requests.get(url,headers=headers)
        r.encoding = r.apparent_encoding
        html = etree.HTML(r.text)
        list = html.xpath("//div[@class='item']/div[@class='text']/p/img/@src")
        for img in list:
            img = str(img)
            img_name = img.split('/')[-1]
            pic = requests.get(img,headers=headers)
            with open('D:\\pics\\pic_gif\\'+img_name,'wb') as f:
                print('正在下载:',img_name)
                f.write(pic.content)
    
    
    
    if __name__ == '__main__':
        urls = ['https://www.zbjuran.com/dongtai/list_4_{}.html'.format(str(i)) for i in range(1,2475)]
    
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36'
        }
        for url in urls:
            get_img(url)
            time.sleep(0.5)
    
    1. 代码解释
    import requests
    from lxml import etree
    import time
    

    首先导入需要用到的库,其中time模块用来是延长爬虫爬取间隔,防止被封

    def get_img(url):
        r = requests.get(url,headers=headers)
        r.encoding = r.apparent_encoding
        html = etree.HTML(r.text)
        list = html.xpath("//div[@class='item']/div[@class='text']/p/img/@src")
    

    定义一个获取图片的函数,用xpath解析requests返回的网页,其中返回来的list是一个盛放多个图片url的列表

    for img in list:
            img = str(img)
            img_name = img.split('/')[-1]
            pic = requests.get(img,headers=headers)
            with open('D:\\pics\\pic_gif\\'+img_name,'wb') as f:
                print('正在下载:',img_name)
                f.write(pic.content)
    

    从列表中遍历url,并用spilt()函数获取url斜杠最后一部分为图片的名字
    然后用open()函数加载图片

    if __name__ == '__main__':
        urls = ['https://www.zbjuran.com/dongtai/list_4_{}.html'.format(str(i)) for i in range(1,2475)]
    
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36'
        }
        for url in urls:
            get_img(url)
            time.sleep(0.5)
    

    主程序入口,利用for循环构造多页url列表,再用for循环遍历出来放入get_img()函数获取多页图片,time.sleep()睡眠0.5秒

    1541257121(1).png

    不知道今天网络原因还是写的代码有问题,图片下的特别慢,看来要学习一下怎么使用多线程了~

    相关文章

      网友评论

          本文标题:Python爬虫笔记(3):利用requests和lxml库爬取

          本文链接:https://www.haomeiwen.com/subject/dvylxqtx.html