Scrapy

作者: 迷路 | 来源:发表于2014-08-19 20:11 被阅读231次

    一、Scrapy安装

    不说了,装Ubuntu

    在Windows上建环境就是SB

    Xpath例子

    1.新建项目

    scrapy startproject tutorial

    2.运行项目

    scrapy crawl dmoz

    3.打开测试窗口

    scrapy shell "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/"

    4.假设要匹配下面这段代码中的src

    <div class="float-r">

    <img src="/img/moz/obooksm.gif" width="84" height="55" alt="[Book Mozilla]">

     </div>

    用下面这行

    response.xpath("//div[@class = 'float-r']/img/@src").extract()

    输出为

    [u'/img/moz/obooksm.gif']

    主代码

    from scrapy.selector import HtmlXPathSelector

    from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

    from scrapy.contrib.spiders import CrawlSpider, Rule

    class MininovaSpider(CrawlSpider):

    name = 'mininova.org'

    allowed_domains = ['mininova.org']

    start_urls = ['http://www.mininova.org/today']

    rules = [Rule(SgmlLinkExtractor(allow=['/tor/\d+']), 'parse_torrent')]

    def parse_torrent(self, response):

    x = HtmlXPathSelector(response)

    torrent = TorrentItem()

    torrent['url'] = response.url

    torrent['name'] = x.select("//h1/text()").extract()

    torrent['description'] = x.select("//div[@id='description']").extract()

    torrent['size'] = x.select("//div[@id='info-left']/p[2]/text()[2]").extract()

    return torrent

    item代码

    # -*- coding: utf-8 -*-

    # Define here the models for your scraped items

    #

    # See documentation in:

    # http://doc.scrapy.org/en/latest/topics/items.html

    from scrapy.item import Item, Field

    class TorrentItem(Item):

    url = Field()

    name = Field()

    description = Field()

    size = Field()

    调用及保存为Json

    scrapy crawl mininova.org -o scraped_data.json -t json

    相关文章

      网友评论

        本文标题:Scrapy

        本文链接:https://www.haomeiwen.com/subject/xbgetttx.html