scrapy

支持库

1. wheel
pip install wheel
2. lxml
http://www.lfd.uci.edu/~gohlke/pythonlibs/#lxml
3. PyOpenSSL
https://pypi.python.org/pypi/pyOpenSSl#downloads
4. Twisted
http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted
5. Pywin32
https://sourceforge.net/projects/pywin32/files/pywin32/Bulid%20220/
6. Scrapy
pip install scrapy

安装完成wheel后可以通过wheel安装一些wheel软件。

PS

出现如下提示时：You are using pip version 18.1, however version 19.1.1 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.

以上是老版本，python3.7，升级了pip19.1以后，都可以使用pip install进行安装。

1. wheel
pip install wheel
2. lxml
pip install lxml
3. PyOpenSSL
pip install PyOpenSSL
4. Twisted
http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted
这个是下载安装的。3.7 版本。
5. Pywin32
pip install Pywin32
6. Scrapy
pip install scrapy

安装完成以后。测试一下。

scrapy startproject hello
cd hello
scrapy genspider baidu http://www.baidu.com
scrapy crawl baidu

还可以用 anaconda

科学计算环境。

没有安装成功

scrapy 官方提供的爬虫练手网站。

http://quotes.toscrape.com/

scrapy startproject hello
cd hello
scrapy genspider first quotes.toscrape.com
scrapy crawl first

编辑 first.py

def parse(self, response):
    quotes = response.css('.quote')
    for quote in quotes:
        text =  quote.css('.text::text').extract_first
        author =  quote.css('.author::text').extract_first
        tags = quote.css('.tags .tag::text').extract

scrapy 还提供了一个shell，可以在终端输入

scrapy shell quotes.toscrape.com
[1]: resonse
[2]: response.css('.quote')
[3]: quote = response.css('.quote')
[4]: quote[0]
[5]: quote[0].css.('.text').extract_first()
......

就可以进行交互式对话。

输出

    //输出为json格式
    scrapy crawl quotes -o quotes.json
    //一行一行的json格式
    scrapy crawl quotes -o quotes.jl
    //存储成为csv格式
    scrapy crawl quotes -o quotes.csv     
    
    scrapy crawl quotes -o quotes.xml
    
    scrapy crawl quotes -o quotes.pickle
    
    scrapy crawl quotes -o quotes.marshal
    //还支持远程ftp
    scrapy crawl quotes -o ftp://user:pass@ip/path/quotes.csv

再次充实代码

    def parse(self, response):
    quotes = response.css('.quote')
    for quote in quotes:
        item = HelloItem()
        text =  quote.css('.text::text').extract_first()
        author =  quote.css('.author::text').extract_first()
        tags = quote.css('.tags .tag::text').extract()
        item['text'] = text
        item['author'] =  author
        item['tags'] = tags
        yield item

ps：
pycharm 导数据包的快捷键，alt+回车。

加上翻页的代码

# -*- coding: utf-8 -*-
import scrapy

from hello.items import HelloItem


class FirstSpider(scrapy.Spider):
    name = 'first'
    allowed_domains = ['quotes.toscrape.com']
    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
        quotes = response.css('.quote')
        for quote in quotes:
            item = HelloItem()
            text =  quote.css('.text::text').extract_first()
            author =  quote.css('.author::text').extract_first()
            tags = quote.css('.tags .tag::text').extract()
            item['text'] = text
            item['author'] =  author
            item['tags'] = tags
            yield item

        next = response.css('.pager .next a::attr(href)').extract_first()
        url = response.urljoin(next)
        yield scrapy.Request(url = url,  callback = self.parse)

item 的配置

# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy

    
class HelloItem(scrapy.Item):
    # define the fields for your item here like:
    text = scrapy.Field()
    author = scrapy.Field()
    tags = scrapy.Field()

pipline

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html


class TextPipeline(object):
    def __init__(self):
        self.limit = 50

    def process_item(self, item, spider):
        if item['text']:
            if len(item['text'])>self.limit
                item['text'] = item['text'][0:self.limit].rstript()+'...'
            return  item
        else:
            return DropItem('Missing Text')

class MongoPipline(object):
def __init__(self,mongo_uri,mongo_db):
    self.mongo_uri = mongo+mongo_uri
    self.mongo_db =mongo + mongo_db
def from_crawl(self,crawler):
    return  cls(
        mongo_uri=crawler.settings.get('')
        mongo_db=crawler.settings.get('')
    )

def open