Scrapy

作者: 安于然 | 来源:发表于2015-12-05 15:58 被阅读1208次

0. 基础知识:

1) 搜索引擎爬虫介绍 --> 增量式爬虫和分布式爬虫

http://www.zouxiaoyang.com/archives/386.html

http://docs.pythontab.com/scrapy/scrapy0.24/intro/overview.html

62792

scrapy crawl -s LOG_FILE=./logs/liter.log -s MONGODB_COLLECTION=literature literatureSpider

#http://doc.scrapy.org/en/latest/topics/jobs.html

scrapy crawl douban8590Spider -s JOBDIR=crawls/douban8590Spider -s MONGODB_DB=douban -s MONGODB_COLLECTION=book8590

1. Run your spider with -a option like:

scrapy crawl myspider -a filename=text.txt

Then read the file in the __init__ method of the spider and define start_urls:

class MySpider(BaseSpider):

    name = 'myspider'

    def __init__(self, filename=None):

        if filename:

            with open(filename, 'r') as f:

            self.start_urls = f.readlines()

2. scrapy可以通过Settings来让爬取结束之后不自动关闭.  how ?

3. 快代理svip 经常出现的问题:

TCP connection timed out: 60: Operation timed out.

Connection was refused by other side: 61: Connection refused.

An error occurred while connecting: 65: No route to host.

504 Gateway Time-out

404 Not Found

501 Not Implemented

4. AttributeError: 'Response' object has no attribute 'body_as_unicode'

出现这个问题,主要是网站的header里面没有content-type字段,scrapy就抽风了,不知道抓取网页的类型,其实解决办法很简单。

把pase方法进行简单的改写即可

def parse(self, response):

    hxs=Selector(text=response.body)

    detail_url_list = hxs.xpath('//li[@class="good-list"]/@href').extract()

    for url in detail_url_list:

        if 'goods' in url:

            yield Request(url, callback=self.parse_detail)

#该代码片段来自于: http://www.sharejs.com/codes/python/9049

5.Speed up web scraper

Here's a collection of things to try:

- use latest scrapy version (if not using already)

- check if non-standard middlewares are used

- try to increase CONCURRENT_REQUESTS_PER_DOMAIN, CONCURRENT_REQUESTS settings (docs)

- turn off logging LOG_ENABLED = False (docs)

- try yielding an item in a loop instead of collecting items into the items list and returning them

- use local cache DNS (see this thread)

- check if this site is using download threshold and limits your download speed (see this thread)

- log cpu and memory usage during the spider run - see if there are any problems there

- try run the same spider under scrapyd service

- see if grequests + lxml will perform better (ask if you need any help with implementing this solution)

- try running Scrapy on pypy, see Running Scrapy on PyPy

相关文章

  • 简单 Scrapy 使用小结

    Scrapy 安装Scrapy pip install scrapy Scrapy Doc 查看Scrapy的文档...

  • scrapy框架

    一、scrapy简介 二、scrapy原理 三、scrapy工作流程 四、scrapy框架的创建 五、scrapy...

  • Scrapy笔记

    Scrapy笔记 安装scrapy框架: 安装scrapy:通过pip install scrapy即可安装。 如...

  • scrapy笔记

    1 scrapy的运行原理 参考:Learning Scrapy笔记(三)- Scrapy基础Scrapy爬虫入门...

  • Scrapy基础(一): 安装和使用

    安装 新建scrapy项目 目录 scrapy模板 使用pycharm调试scrapy执行流程 scrapy 终端...

  • python爬虫13:scrapy

    scrapy的结构 scrapy的工作原理 scrapy的用法

  • Scrapy笔记

    Scrapy笔记 pip 指定源安装模块 创建Scrapy项目 创建Scrapy爬虫程序 启动Scrapy爬虫 在...

  • PyCharm运行和调试Scrapy

    前言: PyCharm运行和调试Scrapy,首先需要安装Scrapy,安装Scrapy请点链接Scrapy的安装...

  • 11- Scrapy-Redis分布式

    Scrapy和Scrapy-Redis的区别 安装Scrapy-Redis Scrapy-Redis介绍 提供了下...

  • scrapy框架基本使用

    scrapy基本介绍 scrapy VS requests Mac安装 conda install scrapy ...

网友评论

      本文标题:Scrapy

      本文链接:https://www.haomeiwen.com/subject/fcnshttx.html