美文网首页
自动下载网页图片

自动下载网页图片

作者: hubert1002 | 来源:发表于2017-10-13 18:36 被阅读0次

    简易版的网页爬虫,寻找网页中的图片链接,通过python完成。直接运行py文件即可,但需要在命令行中运行,不大方便,所以使用了python-script-converter或者pyinstaller 将py转成可执行文件,双击即可运行。

    上代码img.py

    #encoding:UTF-8
    import sys
    import urllib
    import re
    import os
    # import urllib2
    from bs4 import BeautifulSoup
    
    def getImg(html):
        html = urllib.urlopen(url)
        page = html.read()
        soup = BeautifulSoup(page, "html.parser")
        imglist = soup.find_all('img')  # 发现html中带img标签的数据,输出格式为<img xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx,存入集合
        lenth = len(imglist)  # 计算集合的个数
        path = sys.path[0]
        print(path)
        pathArg = sys.argv[0]
        print(pathArg)
        filePath = os.path.dirname(os.path.realpath(pathArg))
        print(filePath)
        for i in range(lenth):
            try:
                imageUrl = getImageUrl(imglist[i])
                index = i+1
                print('[{0}-{1}]{2}'.format(index, lenth,imageUrl))
                if(len(imageUrl)>0):
                    urllib.urlretrieve(imageUrl,filePath+'/'+'%s.jpg' % index)
    
            except Exception as e:
                print(e)
    
    def getImageUrl(item):
        imageUrl = ""
    
        if(item.has_attr('src')):
            imageUrl = item.attrs['src']
        elif(item.has_attr('data-src')):
            imageUrl = item.attrs['data-src']
        else:
            print(item)
            for i in item.attrs:
                if i.index('src') > -1:
                    imageUrl = item.attrs[i]
                    break
    
        #
        # try:
        #     imageUrl = item.attrs['src']
        # except Exception as e:
        #     print(e)
        # try:
        #     imageUrl = item.attrs['data-src']
        # except Exception as e:
        #     print(e)
        return getRealUrl(imageUrl)
    
    
    def getRealUrl(url):
        reg = r'http+?'
        imgre = re.compile(reg)
        imglist = re.findall(imgre, url)
        totalSize = len(imglist)
        realUrl = ""
        if(totalSize>0):
            realUrl = url
        return realUrl
    
    
    
    if(len(sys.argv)>1):
        url = sys.argv[1]
        print("url = "+sys.argv[1])
        getImg(url)
    else:
        url = raw_input("please input url:")
        print(url)
        # url = "https://mp.weixin.qq.com/s/SBM1gq5i7ZfrE4GMBzK6dw"
        getImg(url)
    
    
    

    使用方法

    1. 新建文件夹
    2. 将img.py 拷贝到刚建文件夹
    3. 运行命令(xxx为网址),图片会下载在当前文件夹
      python img xxx

    py文件转可执行文件

    1. python-script-converter
      https://github.com/ZYunH/Python-script-converter/blob/master/Readme-cn.md
    psc test.py 2
    chmod -x img.command 
    
    1. pyinstaller
    pyinstaller -F img.py
    

    相关文章

      网友评论

          本文标题:自动下载网页图片

          本文链接:https://www.haomeiwen.com/subject/pfqtuxtx.html