Python 实现简单的爬虫功能 -----批量下载网页中的图片

作者: 十二月的水瓶座 | 来源:发表于2017-07-07 11:16 被阅读141次

Python简单爬虫 - 正则表达式
Python 实现简单的爬虫功能 -----批量下载网页中的图片
python下载动漫图片
python爬虫基础教程：手把手教你网页图片的抓取方法
用Python爬虫批量下载百度图片
python爬虫：多媒体文件抽取
10行代码完成一个爬虫，就这么简单
selenium和pantomjs学习
45|位图：如何实现网页爬虫中的URL去重功能？
【web图片批量下载到压缩文件夹】使用到 jszip fil

我使用的是macPro , mac 自带了python2.7 , 我自己下载了pytho3.6根据操作进行安装后,终端默认的还是 python 2.7, 需要修改为 Python3.6

进入 ~/.bash_profile 文件,添加别名这样默认使用的就是3.6了

alias python="/Library/Frameworks/Python.framework/Versions/3.6/bin/python3.6"

接下来就直接上代码了

````

import urllib

import urllib.request

import re

import ssl

def getHtml(url):

# 全局取消 ssl 验证

ssl._create_default_https_context = ssl._create_unverified_context

page = urllib.request.urlopen(url)

html = page.read()#读取URL上的数据

returnhtml

def getImg(html):

reg = r'src="(.+?\.jpg)" pic_ext' # 获取.jpg 图片的正则表达式

# reg = r'src="(.+?\.jpg)"'

imgre = re.compile(reg)# 把正则表达式编译成一个正则表达式对象

html = html.decode('utf-8')# python3 需要将 html 进行utf-8编码

imglist = re.findall(imgre,html)# 读取html 中包含 imgre（正则表达式）的数据

print(imglist)

x =0

for imgurl in imglist:

urllib.request.urlretrieve(imgurl,'./resource/%s.jpg'% x)# 直接将远程数据下载到本地。第二个参数为存放的具体路径, 如果没有写路径则默认为当前文件夹下

x +=1

return imglist

html = getHtml('http://tieba.baidu.com/p/2460150866')

# html = getHtml('http://www.sj33.cn/dphoto/stsy/200908/20652_4.html')

print(getImg(html))

````

这里说明一下:

当使用爬取 http://tieba.baidu.com/p/2460150866网页中的.jpg 图片的时候其正则表达是 reg = r'src="(.+?\.jpg)" pic_ext' 不加 pic_ext 就不能获取到正确的数据, 即当reg = r'src="(.+?\.jpg)"'提示如下

但是使用reg = r'src="(.+?\.jpg)" pic_ext' 就能正确拿到数据.

我查看网页源码,在图片链接后面紧跟着 pic_ext="bmp"

当使用爬取 http://www.sj33.cn/dphoto/stsy/200908/20652_4.html网页中的.jpg 图片的时候其正则表达是 reg = r'src="(.+?\.jpg)"' 不加pic_ext 就获取到正确的数据

其网页源码如下

按理说我的正则匹配只要匹配以.jpg 结尾的字符串就可以了,但是现在我就有点想不明白了为什么 http://tieba.baidu.com/p/2460150866 这个网页的正则一定要添加'pic_ext'才能拿到数据, reg = r'src="(.+?\.jpg)" pic_ext', 这是什么原理?

希望知道的小伙伴解惑.