简单的Crawler

作者: A黄橙橙 | 来源:发表于2018-08-19 21:43 被阅读0次

相关知识

Reques库

import requests

1.发送请求
使用Requests发送网络请求非常简单。

r = requests.get("https://www.jianshu.com/u/3307ac591285")

其中r为Response对象。

2.传递URL参数
（没有用到也看不懂）

3.响应服务

r.text

Requests会自动解码来自服务器的内容，大多数unicode字符集都能被无缝地解码。
当请求发出后，Requests会基于HTTP头部对响应的编码作出有根据的推测。当访问r.text时，Requests会使用期推测的文本编码。

4.二进制响应内容
能以字节的方式访问请求响应体，对于非文本请求

r.content

Requests会自动为你解码gazip和deflate传输编码的响应数据。
例如，以请求返回的二进制数据创建一张图片，可以使用如下代码

from PIL import Image
from io import BytesIO

i = Image.open(BytesIO(r.content))
#直接展示照片，格式为.BMP
i.show() 
#保存为.jpg格式的照片
i.save('G:/1.jpg')

5....
(以后的内容就用不上了，暂时不表)

以上引用快速上手Requests

BeautifulSoup4库

BeautifulSoup除了内置HTML解析器，还支持一些第三方解析器，如html5lib，lxml等。
一般可以处理两种html文件，一种是在线获取再处理，一种是直接处理本地文件。

import requests
from bs4 import BeautifulSoup
#通过requests获取
html = requests.get("https://www.jianshu.com/u/3307ac591285")
soup = BeautifulSoup(html.text)
#处理本地文件
soup = BeautifulSoup(open('test.html'))

1.遍历文档树
（我还没学过html，理解不到这边节点的划分，就不贴了）
2.搜索文档树
find_all(name, attrs, recursive,text,**kwargs)
1)name参数可以查找所有名字为name的tag
查找所有<b>标签
A 字符串

soup.find_all('b')

B 正则表达式

import re
for tag in soup.find_all(re.compile('^b'))
  print(tag.name)

C 列表（不懂）

soup.find_all(['a', 'b'])

D 方法（打扰了）

2）关键字参数

#查找id为link2
soup.find_all(id = 'link2')
#传入href参数，搜索每个tag的'href'属性
soup.find_all(href = re.compile('elseid'))
#可以使用多个指定名字的参数，同时过滤tag的多个属性
soup.find_all(href = re.compile('elseid'),id = 'link2')
#注意class是python中的关键字
soup.find_all('a',class_ = 'sister')
# 有些tag属性在搜索不能使用，比如html5的data-*属性，可以通过find_all()的attrs参数定义一个字典来搜索包含特殊属性的tag
soup.find_all(attrs={'data-foo': 'value'})

3）text参数
通过text参数可以搜索文的字符串内容与name参数的可选值一样，text参数接受字符串，正则表达式，列表，True
4）limit参数
limit参数与SQL中的limit关键字类似，当搜索到的结果数量达到limit的限制时，就停止搜索返回结果

soup.find_all('a',limit = 2)

5)recursive参数
调节tag的find_all()方法时，BeautifulSoup会检索当前所有的子孙节点，如果只想搜索tag的直接子节点，可以使用参数recursive = False

#为什么要加一个.html???
soup.html.find_all('title', recursive = False)

find( name , attrs , recursive , text , **kwargs )
与 find_all() 方法唯一的区别是 find_all() 方法的返回结果是值包含一个元素的列表,而 find() 方法直接返回结果

引用-dreams512

4.CSS选择器
写css时，标签名不加修饰，类名前加点，id名前加#，可以利用类似的方法筛选元素，用到的方法是soup.select()返回list
(例子我也不懂，就不贴了)

参考别人的代码

图一

图二

此段代码还里遗留了五个问题【假装是红色】

#coding=utf-8
import requests
from bs4 import BeautifulSoup
import os
from multiprocessing import Pool

# http请求头
# 这个头是模仿浏览器访问的，网站会根据这个判断你的浏览器及操作系统，很多网站没有此信息会拒绝你访问
Hostreferer = {
    'User-Agent': 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)',
    'Referer': 'http://www.mzitu.com'
}
# 此请求头破解盗链 --------------①
Picreferer = {
    'User-Agent': 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)',
    'Referer': 'http://i.meizitu.net'
}

all_url = 'http://www.mzitu.com'

# 保存地址
path = 'G:/mzitu/'

def Download(href,title):
    html = requests.get(href,headers = Hostreferer)
    soup = BeautifulSoup(html.text,'html.parser')
    pic_max = soup.find_all('span')
    #这个10的意义------------------------------④
    #图一
    pic_max = pic_max[10].text  # 最大页数
    if(os.path.exists(path+title.strip().replace('?','')) and len(os.listdir(path+title.strip().replace('?',''))) >= int(pic_max)):
        print('已完毕，跳过'+title)
        return 1
    print("开始扒取：" + title)
    os.makedirs(path+title.strip().replace('?',''))
    os.chdir(path + title.strip().replace('?',''))

    for num in range(1,int(pic_max)+1):
        #图二
        pic = href+'/'+str(num)

        html = requests.get(pic,headers = Hostreferer)

        mess = BeautifulSoup(html.text,"html.parser")

        pic_url = mess.find('img',alt = title)
        html = requests.get(pic_url['src'],headers = Picreferer)

        #-1的意义-------------------------------⑤
        file_name = pic_url['src'].split(r'/')[-1]

        f = open(file_name,'wb')
        f.write(html.content)
        f.close()
  #   print('完成'+title)


if __name__=='__main__':
    start_html = requests.get(all_url, headers=Hostreferer)

    # 找寻最大页数
    soup = BeautifulSoup(start_html.text, "html.parser")
    page = soup.find_all('a', class_='page-numbers')
    #----------------------------②
    max_page = page[-2].text

    same_url = 'http://www.mzitu.com/page/'
    #多线程，此开了15线程
    pool = Pool(15)
    for n in range(1, int(max_page) + 1):
        #通过观察不同页码的url得到的规律
        ul = same_url + str(n)
        #访问每个页面
        start_html = requests.get(ul, headers=Hostreferer)
        soup = BeautifulSoup(start_html.text, "html.parser")
        all_a = soup.find('div', class_='postlist').find_all('a', target='_blank')
        for a in all_a:
            print("a的内容：")
            print(a)
            #<a href="http://www.mzitu.com/137726" target="_blank"><img alt="性感女神易阳elly无圣光美图 超级木瓜奶无比诱人" class="lazy" data-original="http://i.meizitu.net/thumbs/2018/06/137726_07a23_236.jpg" height="354" src="http://i.meizitu.net/pfiles/img/lazy.png" width="236"/></a>
            title = a.get_text()  # 提取文ben
            if (title != ''):
                print("a的get_text")
                print(title)
                #性感女神易阳elly无圣光美图 超级木瓜奶无比诱人
                href = a['href']
                print('href:')
                print(href)
                #http://www.mzitu.com/137726
                #这是这个多线程的方法，不过这个传参方式-----------------------③
                pool.apply_async(Download,args=(href,title))
    pool.close()
    pool.join()
    print('所有图片已下完')

后注：

在粗略了解本段代码之后就去尝试爬 pexels的图片，然后光荣牺牲。
不过在修改代码的过程中，发现每个网站爬的难度是不一样的，比如pexels，理论上在搜索之后，只需要在对当前页面进行解析，就可以找到存储图片的URL链接。
当时用Image.open(BytesIO(r.content))代码，不能下载图片，提示OSError: cannot identify image file <_io.BytesIO object at 0x00000153910C5620>，百度出来的解法不对，后来我试验了几个网站找到了问题

import requests
from PIL import Image
from io import BytesIO

if __name__=='__main__':
    right200_url = 'https://timgsa.baidu.com/timg?image&quality=80&size=b9999_10000&sec=1535279402&di=6594fb2bd0922a9f0d304818509723ea&imgtype=jpg&er=1&src=http%3A%2F%2Fa.hiphotos.baidu.com%2Fimage%2Fpic%2Fitem%2F0824ab18972bd4077557733177899e510eb3096d.jpg'
    wrong503_url = 'https://images.pexels.com/photos/157967/portrait-woman-girl-blond-157967.jpeg'
    wrong403_url = 'http://i.meizitu.net/2018/08/07b08.jpg'
    right200_url2 = 'https://visualhunt.com/photos/1/portrait-of-young-woman-wearing-sunglasses-and-holding-wildflowers-in-meadow.jpg'
    r = requests.get(wrong503_url)
    print(r)
    i = Image.open(BytesIO(r.content))
    i.show()
    i.save('G:/111.jpeg','JPEG')

发现有可能是服务器拒绝了你的请求，比如以上我的命名就是服务器返回值。
返回值的意义

在多次requests之后出现错误

#requests.exceptions.SSLError: hostname 'requestb.in' doesn't match either of '*.herokuapp.com', 'herokuapp.com'
requests.get('https://requestb.in')

这是SSL证书验证的问题。SSL验证默认是开启的，如果证书验证是失败，Requests会抛出SSLError。
可以通过开启验证requests.get('https://github.com', verify=False)的方式解决。
当时我也遇到这个问题，但原因好像不是这个，我已经搜索不到我的历史记录了，所以下一次遇到了再填坑。

附
requests官方中文‘皮’文档
 各浏览器的User-Agent

简单的Crawler

相关知识

Reques库

BeautifulSoup4库

参考别人的代码

后注：

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读