利用正则表达式和requests下载千图网高清图片,非会员图片哦
网页整体规则性比较一致,多翻阅几个页面就可以发现规律了,无非是
html_url = "http://www.58pic.com/tupian/jianshen-0-0-{}.html".format(n),n不断增加,而且对于03或者3也没有特别区分,所以翻页构造基本就可以直接处理了
for n in range(0, 11):
print("******** 正在下载第%s页的图片 ********" % (n+1))
html_url = "http://www.58pic.com/tupian/jianshen-0-0-{}.html".format(n)
PicThousand().run(html_url)
对图片的获取直接右键检查就可以看到url了,打开进去预览就可以看到图片了
![](https://img.haomeiwen.com/i7415868/b703d98b5d0df227.png)
![](https://img.haomeiwen.com/i7415868/88ef2185560bd8d0.png)
但是这里的后缀和平常图片的.jpg格式不太一样,按理说应该是对图片的格式方面做了处理,将后缀去掉应该就是想要的高清图了
image.png
但是网页却显示了40310014的错误码,有心的小伙伴们应该有了解过这个是违反了防盗链规则,后面的英文翻译也能直接明白,我们再回去检查图片的headers,发现了referer里面规定了图片下载时为空
![](https://img.haomeiwen.com/i7415868/b7ad6d6c1855b40b.png)
这里就可以简单的在每次下载时把当前地址复制给headers里面的referer即可
url = url_list[index].replace('!qt324', '')
self.getHtmlHeaders['Referer'] = url
网页源码以及url都可以直接使用正则表达式获取,比较简单
def getHtml(self, url):
response = requests.get(url, self.getHtmlHeaders).text
return response
def getUrl(self, text):
image_urls = re.compile('data-original="(.*?)"', re.S).findall(text)
return image_urls
完整代码
#!/usr/bin/env python
# -*- coding:utf-8 -*-
'''
@author: maya
@software: Pycharm
@file: pic.py
@time: 2018/12/19 16:13
@desc:
'''
import requests
import re
import time
class PicThousand():
def __init__(self):
self.getHtmlHeaders={
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3493.3 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'zh-CN,zh;q=0.9,en-GB;q=0.8,en;q=0.7',
}
def getHtml(self, url):
response = requests.get(url, self.getHtmlHeaders).text
print(response)
return response
def getUrl(self, text):
image_urls = re.compile('data-original="(.*?)"', re.S).findall(text)
print(image_urls)
return image_urls
def img_Download(self, url_list):
for index in range(len(url_list)):
url = url_list[index].replace('!qt324', '')
self.getHtmlHeaders['Referer'] = url
file_name = url.replace('!qt324', '').split('/')[-1]
print("正在下载第%s张图片:%s" % (index + 1, file_name))
response = requests.get(url, headers= self.getHtmlHeaders)
with open('img/'+file_name, 'wb') as f:
f.write(response.content)
def run(self, url):
text = self.getHtml(url)
list = self.getUrl(text)
self.img_Download(list)
if __name__ == '__main__':
start = time.time()
for n in range(0, 11):
print("******** 正在下载第%s页的图片 ********" % (n+1))
html_url = "http://www.58pic.com/tupian/jianshen-0-0-{}.html".format(n)
PicThousand().run(html_url)
end = time.time()
print("******** 下载完成,共用时%.2f ********" % (end-start))
- 更多爬虫代码详情参考Github
网友评论