大家好,我是DannyWu,刚学习python爬虫不久,最近在网上查找有趣的爬虫来练手,其中就看到了有爬取妹子图的。他们写的程序很好,因此我想写一个自己的,正好也学习一下。下面给大家分享一下,如有更好的实现方式欢迎在评论区讨论。
我的博客:DannyWu博客
公众号:DannyWu博客
我的Github:DannyWu
1.所需库安装
'''
author:DannyWu
site:www.idannywu.com
'''
pip install requests
pip install bs4
pip install os
pip install pathlib
pip install multiprocessing
2.网站分析
首先打开妹子图的官网(mzitu.com),点击菜单(最新),经过观察(最新)其实是按时间来排序的,也就是网站全部的组图按发布时间来排序的,页面链接为mzitu.com/page/1, mzitu.com/page/2最后面的数字递增,所以将(最新)的图片全部爬取就大功告成!
3.构造请求头
在我踩过坑之后,发现在请求头中要有referer才能获取图片,下面为请求头的构造。
def get_header(referer):
header ={
'cookie':'Hm_lvt_dbc355aef238b6c32b43eacbbf161c3c=1536981553; Hm_lpvt_dbc355aef238b6c32b43eacbbf161c3c=1536986863',
'referer': referer,
'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36'
}
return header
4.下载图片
def download_pics(pic_page_url):
header = get_header(pic_page_url)
try:
page_data = requests.get(pic_page_url,headers=header)
soup_data = BeautifulSoup(page_data.text,'lxml')
img_link = soup_data.select('.main-image p a img')[0].get('src')
img_alt = soup_data.select('.main-image p a img')[0].get('alt')
print("img_link : ", img_link)
pic_name = img_link.split('/')[-1]
pic_save_path = "mzitu\\"+img_alt+"\\"
path = Path(pic_save_path)
if path.exists():
pass
else:
path.mkdir()
except:
pass
try:
pic_data = requests.get(img_link,headers=header)
except:
pass
try:
if os.path.isfile(str(pic_save_path)+pic_name):
print("########此图已经下载########")
else:
with open( str(pic_save_path)+pic_name ,'wb') as f:
f.write(pic_data.content)
except:
pass
5.获取一组组图里面的所有图片页面链接
def get_pic_page_for_one_group(url):
header = get_header(url)
current_folder_path = os.getcwd()
folder_path = str(current_folder_path) + '\\mzitu'
path_is_exist = Path(folder_path)
pages_link = []
if path_is_exist.exists():
pass
else:
path_is_exist.mkdir()
try:
web_data = requests.get(url,headers=header)
soup = BeautifulSoup(web_data.text,'lxml')
title = soup.select('.main-title')[0].text
save_path = str(folder_path) + '\\' + title
print("美图保存于:",save_path)
pages_total = int(soup.select('.pagenavi a span')[-2].text)
print("此美女图片总数:",pages_total)
for i in range(pages_total):
pages_link.append(url + str(i+1))
except:
pass
return pages_link
6.使用多进程下载整页的所有图片
def download_pics_for_one_page(url,header,pool_num):
try:
web_data = requests.get(url,headers=header).text
soup = BeautifulSoup(web_data,'lxml')
pages_url = soup.select('#pins li span a')
for page_url in pages_url:
print('===============开始下载:',page_url.text+"==============")
print("此美女美图链接",page_url.get('href'))
url_list = get_pics_for_one(page_url.get('href')+"/")
pool = Pool(pool_num)
pool.map(download_pics,url_list)
pool.close()
pool.join()
print("======================下载完成======================")
print("")
except:
pass
7.下载全站所有图片
if __name__ == '__main__':
hello = " |--------------------------------- |\n | 欢迎使用无界面多进程美图下载器! |\n | 目标站点:mzitu.com |\n | 作者:DannyWu(mydannywu@gmail.com)|\n | 博客站点:www.idannywu.com |\n | 此项目只供个人学习使用,请勿用于 |\n | 其他商业用途,谢谢! |\n | 如若侵权,联系立删! |\n |----------------------------------|"
print(hello)
page_num = int(input('请输入下载的页数:'))
pool_num = int(input('请输入启动进程数: '))
start_tip = " 美图下载器开始运行... "
print(start_tip)
header = get_header("referer")
try:
base_url = 'http://www.mzitu.com/page/{}/'
start = "################第 {} 页开始################"
end = "################第 {} 页结束################"
for i in range(page_num):
print(start.format(i+1))
doc.add_paragraph(start.format(i+1))
url = base_url.format(i+1)
get_pics_for_one_pages(url,header,pool_num)
print(end.format(i+1))
except:
pass
print("")
print("##################全部下载完成!##################")
2018-09-19_212105.png
到此就全部完成了,全部源码在我的Github:DannyWu
声明:此项目仅是自己学习python时的练手小项目,请勿拿去当商业用途,一切责任与我无关。如有侵权,联系速删!
若转载此文章,请注明转载链接,否则视为侵权处理!
网友评论