美文网首页python爬虫入门看这个就够了
Scrapy学习爬虫实战记录-入门(一)

Scrapy学习爬虫实战记录-入门(一)

作者: eyeglasses | 来源:发表于2017-06-27 10:25 被阅读873次

    今天是2016年6月26日,开始学习爬虫。

    软件包使用Scrapy。

    已经在linux虚拟机下安装了anaconda3,安装Scrapy,版本为1.1。

    以这个网址作为https://doc.scrapy.org/en/1.1/intro/tutorial.html做为教程。

    以前用过爬虫,但非常简单,现在需要爬取天气,地震,突发事件等,尝试使用scrapy来获取信息。

    首先建一个项目,用拼音,取名tianqi

    具体方法如下:

    scrapy startproject tianqi

    发现我的home目录下已经有tianqi这个目录。

    进入这个目录,列出目录内容:

    [root@wangqi tianqi]# ls -l

    total 8

    -rw-rw-r-- 1 eyeglasses root  256 Jun 26 14:21 scrapy.cfg

    drwxrwxr-x 4 eyeglasses root 4096 Jun 26 14:21 tianqi

    有一个目录和一个文件。

    文件的名称是scrapy.cfg,看得出来,这个是配置文件,看看里面有些什么内容。

    ------------------------------------------------------------------------------------------

    # Automatically created by: scrapy startproject

    #

    # For more information about the [deploy] section see:

    # https://scrapyd.readthedocs.org/en/latest/deploy.html

    [settings]

    default = tianqi.settings

    [deploy]

    #url = http://localhost:6800/

    project = tianqi

    ----------------------------------------------------------------------------------------------------------

    感觉没有什么东西。

    看看目录tianqi,看看里面有什么内容:

    [root@wangqi tianqi]# ls -l

    total 20

    -rw-rw-r-- 1 eyeglasses root    0 Jul 14  2016 __init__.py

    -rw-r--r-- 1 eyeglasses root  285 Jun 26 14:21 items.py

    -rw-r--r-- 1 eyeglasses root  286 Jun 26 14:21 pipelines.py

    drwxrwxr-x 2 eyeglasses root 4096 Jun 26 15:13 __pycache__

    -rw-r--r-- 1 eyeglasses root 3128 Jun 26 14:21 settings.py

    drwxrwxr-x 3 eyeglasses root 4096 Jun 26 15:49 spiders

    有4个文件和2个目录,文件分别为:__init__.py ,items.py ,pipelines.py, settings.py,后缀都是py,看来是python源码,看看里面的内容。

    [root@wangqi tianqi]#vi __init__.py 

    执行上面的命令,发现里面什么都没有,在python模块的每一个包中,都有一个__init__.py文件(这个文件定义了包的属性和方法)如果里面什么都没有,那么你也可以将这个目录作为一个包,作为模块导入。

    ---------------------------------------------------------------------

    [root@wangqi tianqi]# vi items.py #Items是保存爬取到的数据的容器

    # -*- coding: utf-8 -*-

    # Define here the models for your scraped items

    #

    # See documentation in:

    # http://doc.scrapy.org/en/latest/topics/items.html

    import scrapy

    class TianqiItem(scrapy.Item):

    # define the fields for your item here like:

    # name = scrapy.Field()

    pass

    --------------------------------------------------------------------------------

    继续看pipelines.py,这个是管道

    --------------------------------------------------------------

    [root@wangqi tianqi]# vi pipelines.py

    # -*- coding: utf-8 -*-

    # Define your item pipelines here

    #

    # Don't forget to add your pipeline to the ITEM_PIPELINES setting

    # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html

    class TianqiPipeline(object):

    def process_item(self, item, spider):

    return item

    --------------------------------------------------------------------

    看看settings.py里面的内容,可以看见全局配置,还有robots.txt rules,robots协议定义了网站允许爬虫爬行的范围。

    [root@wangqi tianqi]# vi settings.py

    # -*- coding: utf-8 -*-

    # Scrapy settings for tianqi project

    #

    # For simplicity, this file contains only settings considered important or

    # commonly used. You can find more settings consulting the documentation:

    #

    #    http://doc.scrapy.org/en/latest/topics/settings.html

    #    http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html

    #    http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html

    BOT_NAME = 'tianqi'

    SPIDER_MODULES = ['tianqi.spiders']

    NEWSPIDER_MODULE = 'tianqi.spiders'

    # Crawl responsibly by identifying yourself (and your website) on the user-agent

    #USER_AGENT = 'tianqi (+http://www.yourdomain.com)'

    # Obey robots.txt rules

    ROBOTSTXT_OBEY = True

    # Configure maximum concurrent requests performed by Scrapy (default: 16)

    #CONCURRENT_REQUESTS = 32

    # Configure a delay for requests for the same website (default: 0)

    # See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay

    # See also autothrottle settings and docs

    #DOWNLOAD_DELAY = 3

    # The download delay setting will honor only one of:

    #CONCURRENT_REQUESTS_PER_DOMAIN = 16

    #CONCURRENT_REQUESTS_PER_IP = 16

    # Disable cookies (enabled by default)

    #COOKIES_ENABLED = False

    # Disable Telnet Console (enabled by default)

    #TELNETCONSOLE_ENABLED = False

    # Override the default request headers:

    #DEFAULT_REQUEST_HEADERS = {

    #  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',

    #  'Accept-Language': 'en',

    #}

    # Enable or disable spider middlewares

    # See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html

    现在我们来看看目录,有两个目录,先看__pycache__,

    当第一次运行 python 脚本时,解释器会将 *.py 脚本进行编译并保存到 __pycache__ 目录

    下次执行脚本时,若解释器发现你的 *.py 脚本没有变更,便会跳过编译一步,直接运行保存在 __pycache__ 目录下的 *.pyc 文件

    -----------------------------------------------------------------

    [root@wangqi __pycache__]# ls -l

    total 8

    -rw-r--r-- 1 eyeglasses root 125 Jun 26 15:13 __init__.cpython-35.pyc

    -rw-r--r-- 1 eyeglasses root 240 Jun 26 15:13 settings.cpython-35.pyc

    ---------------------------------------------------------------------------------

    关闭 pycache

    单次关闭: 运行脚本时添加 -B 参数即可

    永久关闭: 设置环境变量 PYTHONDONTWRITEBYTECODE=1 即可

    剩下最后一个目录,spiders,这个目录是用来放代码的目录,比如你要爬行哪个网站,就取名,我一般是按照域名来命名,便于记忆。

    [root@wangqi spiders]# ls -l

    total 12

    -rw-rw-r-- 1 eyeglasses root  161 Jul 14  2016 __init__.py

    drwxrwxr-x 2 eyeglasses root 4096 Jun 26 15:21 __pycache__

    -rw-r--r-- 1 eyeglasses root  589 Jun 26 15:10 weather_spider.py

    这里面有三个文件,我自己建了一个weather_spider.py,用来爬行天气网站,里面是爬行代码,而__init__.py与__pycache__上面有详细说明。

    ---------------------------------------------------------------------------

    [root@wangqi spiders]# vi weather_spider.py

    #!/bin/env python

    # -*- coding:utf-8 -*-

    import scrapy

    class WeatherSpider(scrapy.Spider):

    name = "weather"

    def start_requests(self):

    urls = [

    'http://sc.weather.com.cn/chengdu/index.shtml',

    'http://sc.weather.com.cn/neijiang/index.shtml',

    ]

    for url in urls:

    yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):

    page = response.url.split("/")[-2]

    filename = 'weather-%s.html' % page

    with open(filename, 'wb') as f:

    f.write(response.body)

    self.log('Saved file %s' % filename)

    -----------------------------------------------------------------------------------------------

    开始运行,输入命令:

    [eyeglasses@wangqi spiders]$ scrapy crawl weather

    2017-06-27 10:18:31 [scrapy] INFO: Scrapy 1.1.1 started (bot: tianqi)

    2017-06-27 10:18:31 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'tianqi.spiders', 'ROBOTSTXT_OBEY': True, 'BOT_NAME': 'tianqi', 'SPIDER_MODULES': ['tianqi.spiders']}

    2017-06-27 10:18:31 [scrapy] INFO: Enabled extensions:['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.logstats.LogStats']

    2017-06-27 10:18:31 [scrapy] INFO: Enabled downloader middlewares:['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware', 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats']

    2017-06-27 10:18:31 [scrapy] INFO: Enabled spider middlewares:['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware']

    2017-06-27 10:18:31 [scrapy] INFO: Enabled item pipelines:[]

    2017-06-27 10:18:31 [scrapy] INFO: Spider opened

    2017-06-27 10:18:31 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

    2017-06-27 10:18:31 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:60232017-06-27 10:18:31 [scrapy] DEBUG: Redirecting (302) tofrom

    2017-06-27 10:18:31 [scrapy] DEBUG: Crawled (200)(referer: None)

    2017-06-27 10:18:32 [scrapy] DEBUG: Crawled (200)(referer: None)2017-06-27 10:18:32 [weather] DEBUG: Saved file quotes-neijiang.html

    2017-06-27 10:18:32 [scrapy] DEBUG: Crawled (200)(referer: None)

    2017-06-27 10:18:32 [weather] DEBUG: Saved file quotes-chengdu.html

    2017-06-27 10:18:32 [scrapy] INFO: Closing spider (finished)

    2017-06-27 10:18:32 [scrapy] INFO: Dumping Scrapy stats:

    {'downloader/request_bytes': 969,

    'downloader/request_count': 4,

    'downloader/request_method_count/GET': 4,

    'downloader/response_bytes': 34374,

    'downloader/response_count': 4,

    'downloader/response_status_count/200': 3,

    'downloader/response_status_count/302': 1,

    'finish_reason': 'finished',

    'finish_time': datetime.datetime(2017, 6, 27, 2, 18, 32, 553229),

    'log_count/DEBUG': 7,

    'log_count/INFO': 7,

    'response_received_count': 3,

    'scheduler/dequeued': 2,

    'scheduler/dequeued/memory': 2,

    'scheduler/enqueued': 2,

    'scheduler/enqueued/memory': 2,

    'start_time': datetime.datetime(2017, 6, 27, 2, 18, 31, 471145)}

    2017-06-27 10:18:32 [scrapy] INFO: Spider closed (finished)

    查看目录,发现有html文件生成,成功。

    相关文章

      网友评论

      • alisx:写的不太好,虽然很详细,但会让人陷入细节
        eyeglasses:谢谢点评,记的流水帐。

      本文标题:Scrapy学习爬虫实战记录-入门(一)

      本文链接:https://www.haomeiwen.com/subject/hddjcxtx.html