一、Scrapy安装

不说了，装Ubuntu

在Windows上建环境就是SB

Xpath例子

1.新建项目

scrapy startproject tutorial

2.运行项目

scrapy crawl dmoz

3.打开测试窗口

scrapy shell "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/"

4.假设要匹配下面这段代码中的src

<div class="float-r">

<img src="/img/moz/obooksm.gif" width="84" height="55" alt="[Book Mozilla]">

</div>

用下面这行

response.xpath("//div[@class = 'float-r']/img/@src").extract()

输出为

[u'/img/moz/obooksm.gif']

主代码

from scrapy.selector import HtmlXPathSelector

from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

from scrapy.contrib.spiders import CrawlSpider, Rule

class MininovaSpider(CrawlSpider):

name = 'mininova.org'

allowed_domains = ['mininova.org']

start_urls = ['http://www.mininova.org/today']

rules = [Rule(SgmlLinkExtractor(allow=['/tor/\d+']), 'parse_torrent')]

def parse_torrent(self, response):

x = HtmlXPathSelector(response)

torrent = TorrentItem()

torrent['url'] = response.url

torrent['name'] = x.select("//h1/text()").extract()

torrent['description'] = x.select("//div[@id='description']").extract()

torrent['size'] = x.select("//div[@id='info-left']/p[2]/text()[2]").extract()

return torrent

item代码

# -*- coding: utf-8 -*-

# Define here the models for your scraped items

#

# See documentation in:

# http://doc.scrapy.org/en/latest/topics/items.html

from scrapy.item import Item, Field

class TorrentItem(Item):

url = Field()

name = Field()

description = Field()

size = Field()

调用及保存为Json