美文网首页
Python进阶实战爬虫:多线程高效高速爬取图片

Python进阶实战爬虫:多线程高效高速爬取图片

作者: 25岁学Python | 来源:发表于2019-12-29 14:21 被阅读0次

    爬虫多线程高效高速爬取图片

    基于之前的爬取代码我们进行函数的封装并且加入多线程

    之前的代码https://www.cnblogs.com/pythonywy/p/11066842.html

    from concurrent import futures导入的模块

    ex = futures.ThreadPoolExecutor(max_workers =22) #设置线程个数

    ex.submit(方法,方法需要传入的参数)

    import os
    import requests
    from lxml.html import etree
    from concurrent import futures  #多线程
    
    url = 'http://www.doutula.com/'
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36',}
    def img_url_lis(url):
        response = requests.get(url,headers = headers)
        response.encoding = 'utf8'
        response_html = etree.HTML(response.text)
        img_url_lis = response_html.xpath('.//img/@data-original')
        return img_url_lis
    
    #创建图片文件夹
    img_file_path = os.path.join(os.path.dirname(__file__),'img')
    if not os.path.exists(img_file_path):  # 没有文件夹名创建文件夹
        os.mkdir(img_file_path)
    print(img_file_path)
    
    def dump_one_img(url):
        name = str(url).split('/')[-1]
        response = requests.get(url, headers=headers)
        img_path = os.path.join(img_file_path, name)
        with open(img_path, 'wb') as fw:
            fw.write(response.content)
    
    def dump_imgs(urls:list):
        for url in urls:
            ex = futures.ThreadPoolExecutor(max_workers =22)  #多线程
            ex.submit(dump_one_img,url)   #方法,对象
            # dump_one_img(url)
    
    def run():
        count = 1
        while True:
            if count == 10:
                count += 1
                continue
            lis = img_url_lis(f'http://www.doutula.com/article/list/?page={count}')
            if len(lis) == 0:
                print(count)
                break
            dump_imgs(lis)
            print(f'第{count}页也就完成')
            count +=1
    
    if __name__ == '__main__':
        run()
    

    相关文章

      网友评论

          本文标题:Python进阶实战爬虫:多线程高效高速爬取图片

          本文链接:https://www.haomeiwen.com/subject/qscwnctx.html