scrapy

作者: 082e63dc752b | 来源:发表于2019-05-18 10:38 被阅读0次

scrapy

支持库

1. wheel
pip install wheel
2. lxml
http://www.lfd.uci.edu/~gohlke/pythonlibs/#lxml
3. PyOpenSSL
https://pypi.python.org/pypi/pyOpenSSl#downloads
4. Twisted
http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted
5. Pywin32
https://sourceforge.net/projects/pywin32/files/pywin32/Bulid%20220/
6. Scrapy
pip install scrapy

安装完成wheel后可以通过wheel安装一些wheel软件。

PS

出现如下提示时:You are using pip version 18.1, however version 19.1.1 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.

以上是老版本,python3.7,升级了pip19.1以后,都可以使用pip install进行安装。

1. wheel
pip install wheel
2. lxml
pip install lxml
3. PyOpenSSL
pip install PyOpenSSL
4. Twisted
http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted
这个是下载安装的。3.7 版本。
5. Pywin32
pip install Pywin32
6. Scrapy
pip install scrapy

安装完成以后。测试一下。

scrapy startproject hello
cd hello
scrapy genspider baidu http://www.baidu.com
scrapy crawl baidu

还可以用 anaconda

科学计算环境。

没有安装成功

scrapy 官方提供的爬虫练手网站。

http://quotes.toscrape.com/

scrapy startproject hello
cd hello
scrapy genspider first quotes.toscrape.com
scrapy crawl first

编辑 first.py

def parse(self, response):
    quotes = response.css('.quote')
    for quote in quotes:
        text =  quote.css('.text::text').extract_first
        author =  quote.css('.author::text').extract_first
        tags = quote.css('.tags .tag::text').extract

scrapy 还提供了一个shell,可以在终端输入

scrapy shell quotes.toscrape.com
[1]: resonse
[2]: response.css('.quote')
[3]: quote = response.css('.quote')
[4]: quote[0]
[5]: quote[0].css.('.text').extract_first()
......

就可以进行交互式对话。

输出

    //输出为json格式
    scrapy crawl quotes -o quotes.json
    //一行一行的json格式
    scrapy crawl quotes -o quotes.jl
    //存储成为csv格式
    scrapy crawl quotes -o quotes.csv     
    
    scrapy crawl quotes -o quotes.xml
    
    scrapy crawl quotes -o quotes.pickle
    
    scrapy crawl quotes -o quotes.marshal
    //还支持远程ftp
    scrapy crawl quotes -o ftp://user:pass@ip/path/quotes.csv
再次充实代码
    def parse(self, response):
    quotes = response.css('.quote')
    for quote in quotes:
        item = HelloItem()
        text =  quote.css('.text::text').extract_first()
        author =  quote.css('.author::text').extract_first()
        tags = quote.css('.tags .tag::text').extract()
        item['text'] = text
        item['author'] =  author
        item['tags'] = tags
        yield item

ps:
pycharm 导数据包的快捷键 ,alt+回车。

加上翻页的代码

# -*- coding: utf-8 -*-
import scrapy

from hello.items import HelloItem


class FirstSpider(scrapy.Spider):
    name = 'first'
    allowed_domains = ['quotes.toscrape.com']
    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
        quotes = response.css('.quote')
        for quote in quotes:
            item = HelloItem()
            text =  quote.css('.text::text').extract_first()
            author =  quote.css('.author::text').extract_first()
            tags = quote.css('.tags .tag::text').extract()
            item['text'] = text
            item['author'] =  author
            item['tags'] = tags
            yield item

        next = response.css('.pager .next a::attr(href)').extract_first()
        url = response.urljoin(next)
        yield scrapy.Request(url = url,  callback = self.parse)

item 的配置

# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy

    
class HelloItem(scrapy.Item):
    # define the fields for your item here like:
    text = scrapy.Field()
    author = scrapy.Field()
    tags = scrapy.Field()

pipline

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html


class TextPipeline(object):
    def __init__(self):
        self.limit = 50

    def process_item(self, item, spider):
        if item['text']:
            if len(item['text'])>self.limit
                item['text'] = item['text'][0:self.limit].rstript()+'...'
            return  item
        else:
            return DropItem('Missing Text')

class MongoPipline(object):
def __init__(self,mongo_uri,mongo_db):
    self.mongo_uri = mongo+mongo_uri
    self.mongo_db =mongo + mongo_db
def from_crawl(self,crawler):
    return  cls(
        mongo_uri=crawler.settings.get('')
        mongo_db=crawler.settings.get('')
    )

def open

相关文章

  • 简单 Scrapy 使用小结

    Scrapy 安装Scrapy pip install scrapy Scrapy Doc 查看Scrapy的文档...

  • scrapy框架

    一、scrapy简介 二、scrapy原理 三、scrapy工作流程 四、scrapy框架的创建 五、scrapy...

  • Scrapy笔记

    Scrapy笔记 安装scrapy框架: 安装scrapy:通过pip install scrapy即可安装。 如...

  • scrapy笔记

    1 scrapy的运行原理 参考:Learning Scrapy笔记(三)- Scrapy基础Scrapy爬虫入门...

  • Scrapy基础(一): 安装和使用

    安装 新建scrapy项目 目录 scrapy模板 使用pycharm调试scrapy执行流程 scrapy 终端...

  • python爬虫13:scrapy

    scrapy的结构 scrapy的工作原理 scrapy的用法

  • Scrapy笔记

    Scrapy笔记 pip 指定源安装模块 创建Scrapy项目 创建Scrapy爬虫程序 启动Scrapy爬虫 在...

  • PyCharm运行和调试Scrapy

    前言: PyCharm运行和调试Scrapy,首先需要安装Scrapy,安装Scrapy请点链接Scrapy的安装...

  • 11- Scrapy-Redis分布式

    Scrapy和Scrapy-Redis的区别 安装Scrapy-Redis Scrapy-Redis介绍 提供了下...

  • scrapy框架基本使用

    scrapy基本介绍 scrapy VS requests Mac安装 conda install scrapy ...

网友评论

      本文标题:scrapy

      本文链接:https://www.haomeiwen.com/subject/lgizaqtx.html