Python3爬虫工具-MiniSpider

作者: ZhangYunHao | 来源:发表于2017-05-22 20:51 被阅读363次

    Python3爬虫工具-仅用3条命令创建你的爬虫!

    1.前言

    Mini-Spider是一个实用的爬虫工具,它的意义在于快速获得你所要的资源,而不用去关注诸如爬虫构造、数据存储、网络环境、语言实现等一系列的事情。现在你只需要简单的几个命令,就可以创建一个爬虫,并完成你的任务!

    GitHub地址:MIni-Spider

    2.介绍

    对于大部分的刚刚开始使用Python的开发人员,创建一个爬虫并不是一件容易的事情。通常来说,一个完整的小型爬虫也至少需要以下的特性:

    • 正确的提取程序(正则表达式以及一些解析html网页的python库)
    • 合理的错误处理程序(爬虫经常遇见各种错误,例如SSL验证错误、超时、头文件错误等等)
    • 数据的持久性(你需要保存相应的文件、提取的链接,对于一些资源型文件你可能还需要一个下载器)

    使用Mini-Spider你就可以忽略这些烦人的事情。

    相比于普通的爬虫,Mini-Spider不需要你去制定提取规则、错误处理、数据存储。

    你需要做的仅仅是告诉Mini-Spider你需要提取什么和这些资源的位置!

    3.安装

    安装Mini-Spider只需要在终端中输入一条命令。(对于windows是cmd.exe,Linux&mac是terminal)

    注意:Mini-Spider仅仅支持python3.x,如果没有下载python3的童鞋请去官网下载python3下载

    $ pip3 install mini-spider
    

    然后,在终端中输入一条命令确认安装。

    $ mini-spider
    
    optional arguments:
      -h, --help            show this help message and exit
      -a [URL] [[URL] ...]  Analysis a URL.
      -st [float]           Set similarity_threshold,default = 0.6
      -c [num] [[num] ...]  Choose block make extractor.
      -time [float]         Set timeout.(default: 2)
      -to [{u,r}]           Choose match data.(default: u)
      -n [name]             Name your extractor.it can be ignored.
      -start [URL]          Start spider to get url and resource.
      -download [Path]      Download all url from database.
      -m [RE] [[RE] ...]    Make extractor by user.
      -export [FileName]    Export url from database.
      -import [FileName]    Import url into database.
      -list  [ ...]         List url in next_url or resource.options: "u" or "r"
      -classify []          Enable classification function in -download.
      -reset [{u,r}]        Reset database stats = 1.(default: u)
    

    如此,你已经成功安装了Mini-Spider。

    4.使用

    现在Mini-Spider还不够完善,但对于一些简单的需求往往具有非常高的效率。

    例如在一个论坛中提取图片这项工作亦或是在所有的单位通知中寻找一些文件这些功能,Mini-Spider仅需要几条命令即可。

    这里,以提取蜂鸟网的图片为例来介绍如何使用Mini-Spider。

    示例中的爬虫实际上可以在蜂鸟网的任意帖子使用,因为需要提取的格式都是一样的,这意味着使用Mini-Spider创建出来的爬虫往往可以重复使用,尽管创建他们仅需要两条命令。

    示例网址:http://bbs.fengniao.com/forum/9373824.html

    屏幕快照 2017-05-22 20.03.21.png

    这是示例的帖子,现在通过几个简单的命令来将他们全部提取。

    1)在终端中输入

    -a 命令的作用是分析该网站,并查找html 以及 jpg 资源

    $ mini-spider -a http://bbs.fengniao.com/forum/9373824.html jpg html
    

    得到以下输入

    [0]:
    ---(0)http://bbs.qn.img-space.com/g3/M00/02/3E/Cg-40lggN4GIcjIsAAbtzNgksRcAADfQwP38xwABu3k334.jpg
    ---(1)http://bbs.qn.img-space.com/g3/M00/02/3E/Cg-40lggN8KIC5XmAAPNyGCOY3kAADfRAL7-zgAA83g069.jpg
    ---(2)http://bbs.qn.img-space.com/g3/M00/02/3E/Cg-40lggNLuITlKGAAaIKdnN16YAADfPAIfryoABohB596.jpg
    ---(3)http://bbs.qn.img-space.com/g3/M00/02/3E/Cg-40lggNo2IdlT0AAiEl8en1M0AADfQQKQklQACISv087.jpg
    ---(4)http://bbs.qn.img-space.com/g3/M00/02/3E/Cg-77VggNSKISexrAAvLkt1pk-sAADfPgMOPsEAC8uq536.jpg
    ---(5)http://bbs.qn.img-space.com/g3/M00/02/3E/Cg-77VggNWmIVXTwAAP2OrqqXKIAADfPwG44UYAA_ZS536.jpg
    ---(6)http://bbs.qn.img-space.com/g3/M00/02/3E/Cg-77VggNemIXgi7AAX2dyHXWx0AADfQAEwyIAABfaP695.jpg
    ---(7)http://bbs.qn.img-space.com/g3/M00/02/3E/Cg-77VggNuyIAbvnAATIreqNbX8AADfQgHQjtcABMjF884.jpg
    ---(8)http://bbs.qn.img-space.com/g3/M00/02/3E/Cg-77VggNyaIfQoWAAmYVML4iy8AADfQwCfj8IACZhs124.jpg
    [1]:
    ---(0)http://icon.fengniao.com/forum/images/complain_close.jpg
    ---(1)http://icon.fengniao.com/index/2016/images/jiaoquan/qrcode-b.jpg
    ---(2)http://icon.fengniao.com/index/2016/images/jiaoquan/qrcode-bk.jpg
    ---(3)http://icon.fengniao.com/index/2016/images/jiaoquan/qrcode-s.jpg
    ---(4)http://icon.fengniao.com/index/images/qrcode-weixin.jpg
    [2]:
    ---(0)http://image3.fengniao.com/head/1185/80/1184655_0.jpg
    ---(1)http://image3.fengniao.com/head/129/80/128323_0.jpg
    ---(2)http://image3.fengniao.com/head/484/80/483356_0.jpg
    ---(3)http://image3.fengniao.com/head/7729/80/7728915_3.jpg
    ---(4)http://image3.fengniao.com/head/7814/80/7813945_6.jpg
    ---(5)http://image3.fengniao.com/head/7952/80/7951762_2.jpg
    ---(6)http://image3.fengniao.com/head/8022/80/8021124_1.jpg
    ---(7)http://image3.fengniao.com/head/8354/80/8353033_2.jpg
    ---(8)http://image3.fengniao.com/head/8364/80/8363606_9.jpg
    ---(9)http://image3.fengniao.com/head/8573/80/8572862_1.jpg
    ---(10)http://image3.fengniao.com/head/8606/80/8605542_0.jpg
    ---(11)http://image3.fengniao.com/head/8642/80/8641927_1.jpg
    ---(12)http://image3.fengniao.com/head/932/80/931748_12.jpg
    [3]:
    ---(0)http://img2.fengniao.com/290_module_images/240/592269008dae8.jpg
    ---(1)http://img2.fengniao.com/290_module_images/240/59226916a3c2e.jpg
    ---(2)http://img2.fengniao.com/290_module_images/240/59226935644e7.jpg
    ---(3)http://img2.fengniao.com/290_module_images/240/592269483791a.jpg
    ---(4)http://img2.fengniao.com/290_module_images/240/5922696b605cb.jpg
    ---(5)http://img2.fengniao.com/290_module_images/240/592269811b123.jpg
    [4]:
    ---(0)http://pic.fengniao.com/201705/fn3924bbs100060_tj-green_0518.jpg
    [5]:
    ---(0)http://test.svn.fengniao.com/frontend_svn/fengniao/common-pic/photo.jpg
    [6]:
    ---(0)http://2.fengniao.com/price/112-0-0-0-0-0-def-1_1.html
    ---(1)http://2.fengniao.com/price/114-1049-0-0-0-0-def-1_1.html
    ---(2)http://2.fengniao.com/price/114-1050-0-0-0-0-def-1_1.html
    ---(3)http://2.fengniao.com/price/115-0-0-0-0-0-def-1_1.html
    ---(4)http://2.fengniao.com/price/115-1041-0-0-0-0-def-1_1.html
    ---(5)http://2.fengniao.com/price/118-0-0-0-0-0-def-1_1.html
    [7]:
    ---(0)http://bbs.fengniao.com/forum/8964936.html
    ---(1)http://bbs.fengniao.com/forum/8965729.html
    ---(2)http://bbs.fengniao.com/forum/8968456.html
    ---(3)http://bbs.fengniao.com/forum/8975854.html
    ---(4)http://bbs.fengniao.com/forum/9373824.html
    ---(5)http://bbs.fengniao.com/forum/9606084.html
    ---(6)http://bbs.fengniao.com/forum/9606091.html
    ---(7)http://bbs.fengniao.com/forum/9606309.html
    ---(8)http://bbs.fengniao.com/forum/9606664.html
    ---(9)http://bbs.fengniao.com/forum/9606677.html
    ---(10)http://bbs.fengniao.com/forum/9606920.html
    ---(11)http://bbs.fengniao.com/forum/9606961.html
    ---(12)http://bbs.fengniao.com/forum/9606987.html
    ---(13)http://bbs.fengniao.com/forum/9607050.html
    ---(14)http://bbs.fengniao.com/forum/9607081.html
    ---(15)http://bbs.fengniao.com/forum/9607097.html
    ---(16)http://bbs.fengniao.com/forum/9607112.html
    ---(17)http://bbs.fengniao.com/forum/9607113.html
    ---(18)http://bbs.fengniao.com/forum/9607117.html
    ---(19)http://bbs.fengniao.com/forum/9607759.html
    ---(20)http://bbs.fengniao.com/forum/forum_101.html
    ---(21)http://bbs.fengniao.com/forum/forum_11.html
    ---(22)http://bbs.fengniao.com/forum/forum_115.html
    ---(23)http://bbs.fengniao.com/forum/forum_125.html
    ---(24)http://bbs.fengniao.com/forum/forum_16.html
    ---(25)http://bbs.fengniao.com/forum/forum_167.html
    ---(26)http://bbs.fengniao.com/forum/forum_168.html
    ---(27)http://bbs.fengniao.com/forum/forum_20.html
    ---(28)http://bbs.fengniao.com/forum/forum_23.html
    ---(29)http://bbs.fengniao.com/forum/forum_250.html
    ---(30)http://bbs.fengniao.com/forum/forum_27.html
    ---(31)http://bbs.fengniao.com/forum/forum_38.html
    ---(32)http://bbs.fengniao.com/forum/forum_75.html
    ---(33)http://bbs.fengniao.com/jinghua-16.html
    ---(34)http://m.fengniao.com/thread/9373824.html
    ---(35)http://pic.fengniao.com/201703/iframe_auto_11006.html
    [8]:
    ---(0)http://pic.fengniao.com/201704/iframe_auto_7757.html
    ---(1)http://print.fengniao.com/album/6.html
    ---(2)http://print.fengniao.com/album/7.html
    ---(3)http://product.fengniao.com/accessories.html
    ---(4)http://product.fengniao.com/camcorder.html
    ---(5)http://product.fengniao.com/camera.html
    [9]:
    ---(0)http://product.fengniao.com/digital_camera_index/subcate15_232_list_1.html
    ---(1)http://product.fengniao.com/filmcamera.html
    ---(2)http://product.fengniao.com/lens.html
    ---(3)http://product.fengniao.com/lens_index/subcate268_232_list_1.html
    ---(4)http://product.fengniao.com/others.html
    [10]:
    ---(0)http://www.fengniao.com/about.html
    ---(1)http://www.fengniao.com/appstore_chuping.html
    ---(2)http://www.fengniao.com/appstore_kacha.html
    ---(3)http://www.fengniao.com/appstore_meitu.html
    ---(4)http://www.fengniao.com/appstore_sheying.html
    ---(5)http://www.fengniao.com/contact.html
    ---(6)http://www.fengniao.com/copyright.html
    ---(7)http://www.fengniao.com/jiaoquan-apps.html
    ---(8)http://www.fengniao.com/law.html
    ---(9)http://www.fengniao.com/pe/0_222162.html
    ---(10)http://www.fengniao.com/pe/0_256396.html
    ---(11)http://www.fengniao.com/pe/0_265674.html
    ---(12)http://www.fengniao.com/pe/0_286239.html
    ---(13)http://www.fengniao.com/pe/0_363866.html
    ---(14)http://www.fengniao.com/pe/0_404279.html
    ---(15)http://www.fengniao.com/pe/0_512244.html
    ---(16)http://www.fengniao.com/pe/0_563110.html
    ---(17)http://www.fengniao.com/pe/0_703221.html
    ---(18)http://www.fengniao.com/pe/0_772471.html
    ---(19)http://www.fengniao.com/pe/0_846166.html
    ---(20)http://www.fengniao.com/shengming.html
    ---(21)http://www.fengniao.com/topic/5335828.html
    ---(22)http://www.fengniao.com/topic/5344513.html
    ---(23)http://www.fengniao.com/zhaopin.html
    [11]:
    ---(0)http://bbs.fengniao.com/forum/2879928.html
    ---(1)http://bbs.fengniao.com/forum/9373824.html
    ---(2)http://bbs.fengniao.com/forum/9373824_2.html
    ---(3)http://bbs.fengniao.com/forum/forum_11.html
    ---(4)http://bbs.fengniao.com/forum/forum_16.html
    ---(5)http://bbs.fengniao.com/forum/pic/slide_16_9373824_83214719.html
    ---(6)http://bbs.fengniao.com/forum/pic/slide_16_9373824_83214726.html
    ---(7)http://bbs.fengniao.com/forum/pic/slide_16_9373824_83214802.html
    ---(8)http://bbs.fengniao.com/forum/pic/slide_16_9373824_83214824.html
    ---(9)http://bbs.fengniao.com/forum/pic/slide_16_9373824_83214858.html
    [12]:
    ---(0)http://bbs.fengniao.com/forum/pic/slide_16_9373824_83214904.html
    ---(1)http://bbs.fengniao.com/forum/pic/slide_16_9373824_83214927.html
    ---(2)http://bbs.fengniao.com/forum/pic/slide_16_9373824_83214935.html
    ---(3)http://bbs.fengniao.com/forum/pic/slide_16_9373824_83214944.html
    

    2)可以通过观察以上输出发现

    • [0]组的所有元素是我们需要提取的版主图片
    • [11]组的第(2)个元素是下一个页面的地址

    下面来创建提取他们的提取器

    $ mini-spider -c 0 -to r
    http://bbs.qn.img-space.com/[a-z][0-9]/[A-Z][0-9][0-9]/[0-9][0-9]/[0-9][A-Z]/[A-Z][a-z]-\S*\.jpg
    The extractor was created successfully!
    

    -c 0 代表创建提取[0]组所有元素的提取器

    -to r 标识该提取器得到的数据是资源数据(也就是我们需要提取的图片)

    $ mini-spider -c 11 2 -to u
    Host:http://bbs.fengniao.com
    href="(/[a-z][a-z][a-z][a-z][a-z]/[0-9][0-9][0-9][0-9][0-9][0-9][0-9]_[0-9]\S*\.html)\S*"
    The extractor was created successfully!
    

    -c 11 2 代表创建提取[11]组的第二个元素的提取器

    -to u 标识该提取器得到的数据是下一个页面的地址(也就是爬取得下一个目标)

    3)开始提取

    现在爬虫已经创建完毕!

    我们仅需一个命令开始爬取

    $ mini-spider -start http://bbs.fengniao.com/forum/9373824.html
    url: 1/1||resource: 9/9
    url: 1/2||resource: 19/19
    url: 1/3||resource: 21/21
    url: 1/4||resource: 21/21
    url: 1/5||resource: 21/21
    url: 1/6||resource: 21/21
    url: 1/7||resource: 21/21
    url: 1/8||resource: 21/21
    url: 1/9||resource: 21/21
    url: 1/10||resource: 21/21
    url: 1/11||resource: 21/21
    url: 1/12||resource: 21/21
    url: 1/13||resource: 21/21
    url: 1/14||resource: 21/21
    url: 1/15||resource: 21/21
    url: 1/16||resource: 21/21
    url: 1/17||resource: 21/21
    url: 1/18||resource: 21/21
    url: 1/19||resource: 21/21
    url: 1/20||resource: 21/21
    url: 1/21||resource: 21/21
    url: 1/22||resource: 21/21
    url: 1/23||resource: 21/21
    url: 1/24||resource: 21/21
    url: 1/25||resource: 21/21
    url: 1/26||resource: 21/21
    url: 1/27||resource: 21/21
    url: 1/28||resource: 21/21
    url: 1/29||resource: 21/21
    url: 1/30||resource: 21/21
    url: 1/31||resource: 21/21
    url: 1/32||resource: 21/21
    url: 0/32||resource: 21/21
    

    -start http://bbs.fengniao.com/forum/9373824.html 命令代表从该URL开始爬取

    可以看到爬虫遍历了该帖子所有页面,并提取到了21张图片

    4)下载

    下面将这些图片下载到我们的计算机上

    $ mini-spider -download hereresult
    resource:21/21
    Cg-40lggNLuITlKGAAaIKdnN16YAADfPAIfryoABohB596.jpg completed            
    resource:20/21
    Cg-77VggNWmIVXTwAAP2OrqqXKIAADfPwG44UYAA_ZS536.jpg completed            
    resource:19/21
    Cg-77VggNemIXgi7AAX2dyHXWx0AADfQAEwyIAABfaP695.jpg completed            
    resource:18/21
    Cg-40lggNo2IdlT0AAiEl8en1M0AADfQQKQklQACISv087.jpg completed            
    resource:17/21
    Cg-77VggNuyIAbvnAATIreqNbX8AADfQgHQjtcABMjF884.jpg completed            
    resource:16/21
    Cg-77VggNyaIfQoWAAmYVML4iy8AADfQwCfj8IACZhs124.jpg completed            
    resource:15/21
    Cg-40lggN4GIcjIsAAbtzNgksRcAADfQwP38xwABu3k334.jpg completed            
    resource:14/21
    Cg-40lggN8KIC5XmAAPNyGCOY3kAADfRAL7-zgAA83g069.jpg completed            
    resource:13/21
    Cg-77VggOAGIFmdBAAciD1FTywcAADfRQEVL1AAByIn767.jpg completed            
    resource:12/21
    Cg-40lggOEyIQ710AAj28Zlt3rIAADfRgInuygACPcJ793.jpg completed            
    resource:11/21
    Cg-77VggOKaID-4NABEBYKnrDZwAADfRwJ97EkAEQF4509.jpg completed            
    resource:10/21
    Cg-40lggONSIUt2rAAYtIZQkSw0AADfRwONsU8ABi05620.jpg completed            
    resource:9/21
    Cg-40lggOY-IBUB_AAgM8w9mNu4AADfSgHyDMAACA0L024.jpg completed            
    resource:8/21
    Cg-40lggOeCIUuB5AAJqcbeeULYAADfSwBQnw8AAmqJ230.jpg completed            
    resource:7/21
    Cg-77VggOjmIQMU1AAXFjcmbnZUAADfSwN_SygABcWl171.jpg completed            
    resource:6/21
    Cg-40lggOmqIQOMvAALMYvEfffwAADfTABus20AAsx6110.jpg completed            
    resource:5/21
    Cg-40lggOqKICQeBAAUQ49PRBCcAADfTAHsc9QABRD7959.jpg completed            
    resource:4/21
    Cg-77VggOzaIMFjWAALpHLe8s2UAADfTwKq-dkAAuk0821.jpg completed            
    resource:3/21
    Cg-40lggO2eIePbkAAHNtqlYkW4AADfUAMcWDsAAc3O540.jpg completed            
    resource:2/21
    Cg-77VggO4mIW-RQAAM1PHkRm2QAADfUAPhx_gAAzVU397.jpg completed            
    resource:1/21
    Cg-77VggNSKISexrAAvLkt1pk-sAADfPgMOPsEAC8uq536.jpg completed            
    resource:0/21
    

    -download hereresult 代表将数据库中的资源地址内容全部下载到result文件上(here代表运行路径,单层目录不需要加斜杠或反斜杠,多层目录例如在Unix上可以使用hereresult/first)

    如此,我们在当前路径的文件result中看到所有图片

    屏幕快照 2017-05-22 20.20.46.png

    5.循环利用你的爬虫

    创建出来的爬虫其实仅仅是两个提取器文件,即

    屏幕快照 2017-05-22 20.34.02.png

    只要拥有这两个提取器文件,你就可以在任何时候任何地方使用它。

    例如,提取蜂鸟网的任意帖子,你可以这样做

    $ mini-spider -start [你需要的帖子的链接]
    

    然后下载他们

    $ mini-spider -download hereresult
    

    (hereresult可以省略,如果省略这意味着你将他们下载到当前目录下,这样可能会导致目录文件过多而混乱)

    下面尝试提取另一个帖子

    http://bbs.fengniao.com/forum/9602611.html

    $ mini-spider -start http://bbs.fengniao.com/forum/9602611.html
    url: 1/33||resource: 5/26
    url: 1/34||resource: 5/26
    url: 0/34||resource: 5/26
    

    然后下载

    $ mini-spider -download
    Cg-77VkbpZaIV5j0API0EM8exZ4AAIxIAGwaLMA8jQo240.jpg completed            
    resource:4/26
    Cg-40lkbpZ2AJZZ6AQbsOwJnKWA724.jpg completed                            
    resource:3/26
    Cg-40lkbpaSAYg7HATl3oor8G-o077.jpg completed                            
    resource:2/26
    timed out
    resource:2/26
    Cg-77Vkbpa-IYRpiAOWTf8_3oQEAAIxIAKinNsA5ZOX197.jpg completed            
    resource:1/26
    Cg-77VkbpauAHLUMARDERN2NFpI525.jpg completed                            
    resource:0/26
    

    完成!

    剩下的任务就需要自己去发掘了~~~

    更详细的细节请看Mini-Spider的github,里面有教程~

    相关文章

      网友评论

      • 无名小卒陶然:一楼,大神能告诉http://sxy.ncvt.net/xyxw1/xyxw.htm,爬取上面的文字和图片吗?
        ZhangYunHao:@无名小卒陶然 提取文字暂时需要自建正则表达式 不过这一特性正在更新 不久以后就可以自动创建了
        ZhangYunHao:@无名小卒陶然 用minispider分析你需要的文件类型即可 例如jpg 然后看看下一个提取目标 一般是html网站 和例子应该没有大的区别

      本文标题:Python3爬虫工具-MiniSpider

      本文链接:https://www.haomeiwen.com/subject/dklnattx.html