【Scrapy】简单的爬虫--抓取取安全客漏洞（一）

作者: 是Jonathan | 来源:发表于2017-03-25 22:11 被阅读118次

【Scrapy】简单的爬虫--抓取取安全客漏洞（一）
使用scrapy爬虫框架抓取伯乐在线的文章标题、标题url与发布
scrapy抓取百度图片-写给自己看爬虫系列1
Scrapy介绍
Python项目收录
Python爬虫 --- 2.3 Scrapy 框架的简单使用
【实战演练】Python爬虫，使用2.3 Scrapy 框架爬
【实战演练】Python爬虫，使用2.3 Scrapy 框架爬
scrapy-Redis分布式爬虫
学习网址

0x01 创建项目
scrapy startproject YOUR_PROJECT_NAME

创建爬虫项目

• items.py ：该文件定义了待抓取域的模型。
• settings.py ：该文件定义了一些设置，如用户代理、爬取延时等。
• spiders/ ：该目录存储实际的爬虫代码。
另外，Scrapy使用scrapy.cfg设置项目配置，使用pipelines.py处理要抓取的域，不过目前无须修改这两个文件。

0x02 定义模型
默认情况下items.py文件中包含如下代码:

定义爬虫要爬取的字段信息

Exam123Item类是一个模板需要将其中的内容替换为爬虫运行时想要存储的待抓取的信息字段。

Paste_Image.png

** 0x03 创建爬虫文件**

scrapy genspider SPIDER_NAME SPIDER_DOMAIN
例如：scrapy genspider bobao bobao.360.cn

Paste_Image.png

0x03 完整的代码案例
爬取360播报漏洞

爬虫bobao.py

# -*- coding: utf-8 -*-
import scrapy
from scrapy.spiders import CrawlSpider, Rule


class BobaoSpider(CrawlSpider):
    name = 'bobao'
    allowed_domains = ['bobao.360.cn']
    start_urls = ['http://bobao.360.cn/vul/index']

    def parse(self, response):
        vuls = response.xpath('/html/body/div[2]/div[2]/div[2]/div[1]/div/div[3]/ul/li')
        
        for vul in vuls:
            label_danger = vul.xpath('.//div/div/span/text()').extract()[0] if len( \
                vul.xpath('.//div/div/span/text()').extract()) else "null"            
            yield {
                'url': 'http://bobao.360.cn' + vul.xpath('.//div/div/a/@href').extract()[0],
                'title': vul.xpath('.//div/div/a/text()').extract()[0],
                'label_danger': label_danger,
                'ori': vul.xpath('.//div/span[2]/text()').extract()[0],
                }

items.py


import scrapy


class ExampleItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    
    url = scrapy.Field()
    title = scrapy.Field()
    label_danger = scrapy.Field()
    ori = scrapy.Field()

运行结果

Paste_Image.png

0x04 使用SHELL命令抓取内容
有些网站反爬虫措施禁止Scrapy框架去获取网站内容可以使用下面的方式绕过限制。
scrapy shell -s USER_AGENT='custom user agent' 'http://www.example.com'
处理中文

f='\u53eb\u6211' 
print f 
print(f.decode('unicode-escape'))

修改pipelines.py文件，

import json
import codecs
 
class ExamplePipeline(object):
    def __init__(self): 
        self.file = codecs.open('vul.json', 'wb', encoding='utf-8')
 
    def process_item(self, item, spider):
        line = json.dumps(dict(item)) + '\n' 
        self.file.write(line.decode("unicode_escape"))
        return item

然后修改settings.py文件去掉ITEM_PIPELINES这个参数的注释

Paste_Image.png
最终效果如下:

Paste_Image.png

网友评论

本文标题：【Scrapy】简单的爬虫--抓取取安全客漏洞（一）

本文链接：https://www.haomeiwen.com/subject/hithottx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

【Scrapy】简单的爬虫--抓取取安全客漏洞（一）

相关文章