scrapy--demo （3）

作者: 周周周__ | 来源:发表于2020-04-08 13:42 被阅读0次

scrapy--demo （3）
恶意文件夹
3+3+3
3/3
3/3
if(a==3) or if(3==a)
3/3
3/3
3:3
美惠教练3 3 3 3

scrapy startproject mytest创建项目文件夹
cd mytest
scrapy genspider mycz czvv.com创建spider模版
由上创建出基本的程序框架
我们就开始编写自己的爬虫了。

1、./spiders/myczw.py

# -*- coding: utf-8 -*-
import scrapy
from items import CzwItem

class MyczwSpider(scrapy.Spider):
    name = 'myczw'
    allowed_domains = ['www.czvv.com']
    start_urls = [
        'http://www.czvv.com/320500/jiagong/',
    ]

    def parse(self, response):
        fenlei = response.xpath('//div[contains(text(),"分类筛选")]')  # 如果有分类筛选
        if fenlei:
            links = response.xpath('//ul[@id="catalogDIV"]//a[@class="vul"]/@href')
            for link in links:
                yield response.follow(link, self.parse)
        next = response.xpath('//*[@class="nextpage"]/@href').extract_first()
        if next and not fenlei:  # 如果有下一页继续拿,和没有分类筛选
            yield response.follow(next, self.parse)

        links = response.xpath('//*[@class="company-mesage"]/div[1]/a/@href')  # 否则进行界面公司列表进行筛选
        if links and not fenlei and '很抱歉' not in response.text:  # 如果有公司列表和没有分类
            for link in links:
                yield response.follow(link, self.parse)
        else:
            is_have = response.xpath('//*[@id="aboutbox"]/div[contains(text(),"企业简介")]')
            if is_have:
                url = response.url
                html = response.text
                item = {}
                item['url'] = url
                item['html'] = html
                #item = CzwItem(url=url, html=html)
                yield item
            else:
                pass

当我们初始化项目后，使用继承的是scrapy.Spider
name: 爬虫的名字
allowed_domains: 作用的域名下
start_urls: 初始的url
parse: 需要重写的内容，对页面进行解析
其中在parse中，我们有需要进行解析的界面和需要进行存储的数据。
当这个界面中还有我们需要进一步进行解析的url,我们用yield response.follow或者yield scrapy.Request，两者区别是：response.follow直接解析，请求的是域名加上查到的href，scrapy.Request需要我们拼接域名。
需要进行保存的数据可以使用yield item会传递给管道。

如果我们需要进行存储的话
就需要进行定义管道中的数据。

2、piplines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
import pymongo
class CzwPipeline(object):
    def __init__(self):
        host = '127.0.0.1'
        port = 27017
        dbname = 'chuanzhong'
        sheetname = 'test6'
        # 创建MONGODB数据库链接
        client = pymongo.MongoClient(host=host, port=port)
        # 指定数据库
        mydb = client[dbname]
        # 存放数据的数据库表名
        self.post = mydb[sheetname]
    def open_spider(self, spider):  # 爬虫开启时会提前打开
        print("开始爬虫", spider.name)
    def process_item(self, item, spider):
        print(type(item))
        data = dict(item)
        self.post.insert(data)
        return item
    def close_spider(self, spider): # 关闭时默认使用的
        print("结束爬虫")

我这里使用后的是mongo进行存储的。
如果使用的话，我们需要在setting.py中进行修改

ITEM_PIPELINES = {  # 把这个注释去掉
    'czw.pipelines.CzwPipeline': 300,
}

现在spider中的yeild会把数据传给piplines进行存储，但是这里我们发现并没有使用items.py运行的话,打印出来的item类型还是dict。

3、定义items.py

# -*- coding: utf-8 -*-
# Define here the models for your scraped items
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
class CzwItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    url = scrapy.Field()
    html = scrapy.Field()

定义好items.py后，我们在spider(上边)已经导入，

#item = {}
# item['url'] = url
#item['html'] = html
item = CzwItem(url=url, html=html)

把spider中的内容进行修改，我们就会发现，打印出来的数据类型就是想要的CzwItem类型。

4、CrawlSpider

scrapy既然是工具，它这个工具就可能做的尽量完美，scrapy中Spider是最基础的爬虫，我们还需要了解crawl帮我们自动进行解析。
首先是创建项目
scrapy strartproject my_test
cd my_test
scrapy genspider -t crawl mycz czvv.com

生成的新的spider
spider

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class MyczSpider(CrawlSpider):
    name = 'mycz'
    allowed_domains = ['czvv.com']
    start_urls = ['http://www.czvv.com/320500/jiagong/']
    rules = (
        Rule(LinkExtractor(allow=r'.+/320500/jiagong/\d+'),  follow=True),
        Rule(LinkExtractor(allow=r'.+huangye/.+\.html'), callback='parse_item', follow=False),
    )
    def parse_item(self, response):
        is_have = response.xpath('//*[@id="aboutbox"]/div[contains(text(),"企业简介")]')
        if is_have:
            url = response.url
            html = response.text
            item = CzwItem(url=url, html=html)
            yield item

rules是我们定义的规则，如果是需要进一步提取的url，是没有回调函数进行解析的，follow代表的意思是是否跟进。

网友评论

本文标题：scrapy--demo （3）

本文链接：https://www.haomeiwen.com/subject/olqtmhtx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

scrapy--demo （3）

1、./spiders/myczw.py

2、piplines.py

3、定义items.py

4、CrawlSpider

相关文章

scrapy--demo （3）

恶意文件夹

3+3+3

3/3

3/3

if(a==3) or if(3==a)

3/3

3/3

3:3

美惠教练3 3 3 3

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读