爬虫福利二之妹子图网批量下载MM

作者: Python芸芸 | 来源:发表于2020-03-26 18:18 被阅读0次

爬虫福利二之妹子图网批量下载MM
爬虫福利二之妹子图网MM批量下载
「福利向」Python妹子图爬虫（二）
Python3实战：批量下载妹子图片
python爬虫学习教程，短短25行代码批量下载豆瓣妹子图片
python爬虫爬取妹子图片
[福利向]Python妹子图爬虫
爬煎蛋网妹子图
零基础爬虫实例教学
我不是老司机，但我能带你飞——简单爬虫之攻陷煎蛋妹子图

看了本文，相信大家对爬虫一定会产生强烈的兴趣，激励自己去学习爬虫，在这里提前祝：大家学有所成！

目标网站：妹子图网

环境：Python3.x

相关第三方模块：requests、beautifulsoup4

Re：各位在测试时只需要将代码里的变量 path 指定为你当前系统要保存的路径，使用 python xxx.py 或IDE运行即可。

完整源码如下：

# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup
import os
 
all_url = 'https://www.mzitu.com'
 
# http请求头
Hostreferer = {
    'User-Agent': 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)',
    'Referer': 'http://www.mzitu.com'
}
# 此请求头Referer破解盗图链接
Picreferer = {
    'User-Agent': 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)',
    'Referer': 'http://i.meizitu.net'
}
 
# 对mzitu主页all_url发起请求，将返回的HTML数据保存，便于解析
start_html = requests.get(all_url, headers=Hostreferer)
 
# Linux保存地址
# path = '/home/Nick/Desktop/mzitu/'
 
# Windows保存地址
path = 'E:/mzitu/'
 
# 获取最大页数
soup = BeautifulSoup(start_html.text, "html.parser")
page = soup.find_all('a', class_='page-numbers')
max_page = page[-2].text
 
 
# same_url = 'http://www.mzitu.com/page/'   # 主页默认最新图片
# 获取每一类MM的网址
same_url = 'https://www.mzitu.com/mm/page/'     # 也可以指定《qingchun MM系列》
 
for n in range(1, int(max_page) + 1):
    # 拼接当前类MM的所有url
    ul = same_url + str(n)
 
    # 分别对当前类每一页第一层url发起请求
    start_html = requests.get(ul, headers=Hostreferer)
 
    # 提取所有MM的标题
    soup = BeautifulSoup(start_html.text, "html.parser")
    all_a = soup.find('div', class_='postlist').find_all('a', target='_blank')
 
    # 遍历所有MM的标题
    for a in all_a:
        # 提取标题文本，作为文件夹名称
        title = a.get_text()
        if(title != ''):
            print("准备扒取：" + title)
 
            # windows不能创建带？的目录，添加判断逻辑
            if(os.path.exists(path + title.strip().replace('?', ''))):
                # print('目录已存在')
                flag = 1
            else:
                os.makedirs(path + title.strip().replace('?', ''))
                flag = 0
            # 切换到上一步创建的目录
            os.chdir(path + title.strip().replace('?', ''))
 
            # 提取第一层每一个MM的url，并发起请求
            href = a['href']
            html = requests.get(href, headers=Hostreferer)
            mess = BeautifulSoup(html.text, "html.parser")
 
            # 获取第二层最大页数
            pic_max = mess.find_all('span')
            pic_max = pic_max[9].text
            if(flag == 1 and len(os.listdir(path + title.strip().replace('?', ''))) >= int(pic_max)):
                print('已经保存完毕，跳过')
                continue
 
            # 遍历第二层每张图片的url
            for num in range(1, int(pic_max) + 1):
                # 拼接每张图片的url
                pic = href + '/' + str(num)
 
                # 发起请求
                html = requests.get(pic, headers=Hostreferer)
                mess = BeautifulSoup(html.text, "html.parser")
                pic_url = mess.find('img', alt=title)
                print(pic_url['src'])
                html = requests.get(pic_url['src'], headers=Picreferer)
 
                # 提取图片名字
                file_name = pic_url['src'].split(r'/')[-1]
 
                # 保存图片
                f = open(file_name, 'wb')
                f.write(html.content)
                f.close()
            print('完成')
    print('第', n, '页完成')

扒图步骤分析：（送给有兴趣的朋友）

1、获取网页源码

打开mzitu网址，用浏览器的F12可以看到网页的请求过程及源码

该步骤代码如下：


#coding=utf-8
 
import requests
 
url = 'http://www.mzitu.com'
 
#设置headers，网站会根据这个判断你的浏览器及操作系统，很多网站没有此信息将拒绝你访问
header = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 UBrowser/6.1.2107.204 Safari/537.36'}
 
#用get方法打开url并发送headers
html = requests.get(url,headers = header)
 
#打印结果 .text是打印出文本信息即源码

返回的响应，如果没问题的话结果和下面类似，这些就是网页的源码了。

<html>
<body>
 
......
 
        $("#index_banner_load").find("div").appendTo("#index_banner");
        $("#index_banner").css("height", 90);
        $("#index_banner_load").remove();
});
</script>
</body>
</html>

2、提取所需信息

将获取的源码转换为BeautifulSoup对象

使用find搜索需要的数据，保存到容器中

该步骤代码如下：

#coding=utf-8
 
import requests
from bs4 import BeautifulSoup
 
url = 'http://www.mzitu.com'
header = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 UBrowser/6.1.2107.204 Safari/537.36'}
 
html = requests.get(url,headers = header)
 
#使用自带的html.parser解析，速度慢但通用
soup = BeautifulSoup(html.text,'html.parser')
 
#实际上是第一个class = 'postlist'的div里的所有a 标签是我们要找的信息
all_a = soup.find('div',class_='postlist').find_all('a',target='_blank')
 
for a in all_a:
    title = a.get_text() #提取文本

如下就找到了当页所有套图的标题：

注意：BeautifulSoup()返回的类型是<class 'bs4.BeautifulSoup'>
find()返回的类型是<class 'bs4.element.Tag'>
find_all()返回的类型是<class 'bs4.element.ResultSet'>
<class 'bs4.element.ResultSet'>不能再进项find/find_all操作

3、进入第二层页面，进行下载操作

点进一个套图之后，发现他是每个页面显示一个图片，这时我们需要知道他的总页数，比如：http://www.mzitu.com/26685是某个套图的第一页，后面的页数都是再后面跟/和数字http://www.mzitu.com/26685/2 (第二页)，那么很简单了，我们只需要找到他一共多少页，然后用循环组成页数就OK了。

image
该步骤代码如下：


#coding=utf-8
 
import requests
from bs4 import BeautifulSoup
 
url = 'http://www.mzitu.com/26685'
header = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 UBrowser/6.1.2107.204 Safari/537.36'}
 
html = requests.get(url,headers = header)
soup = BeautifulSoup(html.text,'html.parser')
 
#最大页数在span标签中的第10个
pic_max = soup.find_all('span')[10].text
print(pic_max)
 
#输出每个图片页面的地址
for i in range(1,int(pic_max) + 1):
    href = url+'/'+str(i)
    print(href)

那么我们接下来就是进行寻找图片地址，保存下来；右键MM图片，点击检查可以发现如图：

image

img src="https://i5.meizitu.net/2019/07/01b56.jpg" alt="xxxxxxxxxxxxxxxxxxxxxxxxx" width="728" height="485">

如图所示，上面就是我们MM图片的具体地址了，保存它即可。

该步骤代码如下：

#coding=utf-8
 
import requests
from bs4 import BeautifulSoup
 
url = 'http://www.mzitu.com/26685'
header = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 UBrowser/6.1.2107.204 Safari/537.36'}
 
html = requests.get(url,headers = header)
soup = BeautifulSoup(html.text,'html.parser')
 
#最大页数在span标签中的第10个
pic_max = soup.find_all('span')[10].text
 
#找标题
title = soup.find('h2',class_='main-title').text
 
#输出每个图片页面的地址
for i in range(1,int(pic_max) + 1):
    href = url+'/'+str(i)
    html = requests.get(href,headers = header)
    mess = BeautifulSoup(html.text,"html.parser")
 
 
    #图片地址在img标签alt属性和标题一样的地方
    pic_url = mess.find('img',alt = title)
 
    html = requests.get(pic_url['src'],headers = header)
 
    #获取图片的名字方便命名
    file_name = pic_url['src'].split(r'/')[-1]
 
    #图片不是文本文件，以二进制格式写入，所以是html.content
    f = open(file_name,'wb')
    f.write(html.content)

点击了解更多获取PythonWeb开发，数据分析，爬虫，人工智能等学习知识，
点击了解

爬虫福利二之妹子图网批量下载MM
看了本文，相信大家对爬虫一定会产生强烈的兴趣，激励自己去学习爬虫，在这里提前祝：大家学有所成！目标网站：妹子图网...
爬虫福利二之妹子图网MM批量下载
展开爬虫的基本流程：模拟浏览器向目标url发送请求，并获取响应对响应中有用的内容进行提取如果提取url，则...
「福利向」Python妹子图爬虫（二）
由于之前很多朋友表示Selenium很麻烦，于是我换了一种不使用Selenium的方法。项目地址 https:/...
Python3实战：批量下载妹子图片
目标网站：点击进入说明：代码来源「福利向」Python妹子图爬虫（一）不使用框架，简单上手实例代码：
python爬虫学习教程，短短25行代码批量下载豆瓣妹子图片
python爬虫学习教程，短短25行代码批量下载豆瓣妹子图片、非常简短，代码不是很多非常适合新手练习！学习pyt...
python爬虫爬取妹子图片
不爬妹子图的爬虫不是一只好爬虫。 ----鲁迅主页网址[妹子图...
[福利向]Python妹子图爬虫
项目地址 https://github.com/3inchtime/CX_spiders 作为Python的初学者...
爬煎蛋网妹子图
利用 BeautifulSoup + Requests 爬取煎蛋网妹子图一、爬煎蛋网一页图片此爬虫只能爬取...
零基础爬虫实例教学
本篇主要面向于对Python爬虫感兴趣的零基础的同学，实例为下载煎蛋网中指定页面的妹子图。好了，话不多说，让我们开...
我不是老司机，但我能带你飞——简单爬虫之攻陷煎蛋妹子图
上面是煎蛋的妹子图上面是爬虫爬取下载到我电脑里的妹子图你可能会说，搞几十张图到电脑里就敢说自己是老司机，就想骗...