Scrapy

作者: 安于然 | 来源:发表于2015-12-05 15:58 被阅读1208次

简单 Scrapy 使用小结
scrapy框架
Scrapy笔记
scrapy笔记
Scrapy基础（一）：安装和使用
python爬虫13：scrapy
Scrapy笔记
PyCharm运行和调试Scrapy
11- Scrapy-Redis分布式
scrapy框架基本使用

0. 基础知识：

1) 搜索引擎爬虫介绍 --> 增量式爬虫和分布式爬虫

http://www.zouxiaoyang.com/archives/386.html

http://docs.pythontab.com/scrapy/scrapy0.24/intro/overview.html

62792

scrapy crawl -s LOG_FILE=./logs/liter.log -s MONGODB_COLLECTION=literature literatureSpider

#http://doc.scrapy.org/en/latest/topics/jobs.html

scrapy crawl douban8590Spider -s JOBDIR=crawls/douban8590Spider -s MONGODB_DB=douban -s MONGODB_COLLECTION=book8590

1. Run your spider with -a option like:

scrapy crawl myspider -a filename=text.txt

Then read the file in the __init__ method of the spider and define start_urls:

class MySpider(BaseSpider):

name = 'myspider'

def __init__(self, filename=None):

if filename:

with open(filename, 'r') as f:

self.start_urls = f.readlines()

2. scrapy可以通过Settings来让爬取结束之后不自动关闭. how ?

3. 快代理svip 经常出现的问题：

TCP connection timed out: 60: Operation timed out.

Connection was refused by other side: 61: Connection refused.

An error occurred while connecting: 65: No route to host.

504 Gateway Time-out

404 Not Found

501 Not Implemented

4. AttributeError: 'Response' object has no attribute 'body_as_unicode'

出现这个问题，主要是网站的header里面没有content-type字段，scrapy就抽风了，不知道抓取网页的类型，其实解决办法很简单。

把pase方法进行简单的改写即可

def parse(self, response):

hxs=Selector(text=response.body)

detail_url_list = hxs.xpath('//li[@class="good-list"]/@href').extract()

for url in detail_url_list:

if 'goods' in url:

yield Request(url, callback=self.parse_detail)

#该代码片段来自于: http://www.sharejs.com/codes/python/9049

5.Speed up web scraper

Here's a collection of things to try:

- use latest scrapy version (if not using already)

- check if non-standard middlewares are used

- try to increase CONCURRENT_REQUESTS_PER_DOMAIN, CONCURRENT_REQUESTS settings (docs)

- turn off logging LOG_ENABLED = False (docs)

- try yielding an item in a loop instead of collecting items into the items list and returning them

- use local cache DNS (see this thread)

- check if this site is using download threshold and limits your download speed (see this thread)

- log cpu and memory usage during the spider run - see if there are any problems there

- try run the same spider under scrapyd service

- see if grequests + lxml will perform better (ask if you need any help with implementing this solution)

- try running Scrapy on pypy, see Running Scrapy on PyPy