美文网首页python成长之路
Scrapy 爬虫实战-爬取字幕库

Scrapy 爬虫实战-爬取字幕库

作者: CPP后台服务器开发 | 来源:发表于2018-11-10 09:53 被阅读46次

    Scrapy 爬虫实战-爬取字幕库


    1.首先,创建Scrapy框架

    创建工程
    scrapy startproject zimuku
    
    创建爬虫程序
    cd zimuku
    scrapy genspider zimu zimuku.cn
    

    如图:


    snipaste_20181110_074005.png
    snipaste_20181110_074302.png

    我们会发现所有的框架以及模板都已经创建好了,
    依次给大家看看:

    zimu.py
    # -*- coding: utf-8 -*-
    import scrapy
    
    
    class ZimuSpider(scrapy.Spider):
        name = 'zimu'
        allowed_domains = ['zimuku.cn']
        start_urls = ['http://zimuku.cn/']
    
        def parse(self, response):
            pass
    
    items.py
     -*- coding: utf-8 -*-
    
    # Define here the models for your scraped items
    #
    # See documentation in:
    # https://doc.scrapy.org/en/latest/topics/items.html
    
    import scrapy
    
    
    class ZimukuItem(scrapy.Item):
        # define the fields for your item here like:
        # name = scrapy.Field()
        pass
    
    pipelines.py
    # -*- coding: utf-8 -*-
    
    # Define your item pipelines here
    #
    # Don't forget to add your pipeline to the ITEM_PIPELINES setting
    # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
    
    
    class ZimukuPipeline(object):
        def process_item(self, item, spider):
            return item
    
    

    这是三个比较重要,其他的我先不一一列举了。
    接下来,我们要进行分析网页了

    2.网页分析

    snipaste_20181110_074858.png

    我们的目的是把这个内容保存下来,因为Scrapy框架自带一下工具,所以我们就用xpath来做内容匹配。
    提取红框的内容的xpath语句为:/html/body/div[2]/div/div/div[2]/table/tbody/tr[1]/td[1]/a/b/text()

    3.编写代码
    (1)首先编写items.py

    # -*- coding: utf-8 -*-
    
    # Define here the models for your scraped items
    #
    # See documentation in:
    # https://doc.scrapy.org/en/latest/topics/items.html
    
    import scrapy
    
    
    class ZimukuItem(scrapy.Item):
        # define the fields for your item here like:
        # name = scrapy.Field()
        #要爬取的内容定义
        text = scrapy.Field()
    

    (2)编写zimu.py

    # -*- coding: utf-8 -*-
    import scrapy
    #需要把items中的类导进来
    import zimuku.items import ZimukuItem
    
    class ZimuSpider(scrapy.Spider):
        name = 'zimu'
        allowed_domains = ['zimuku.cn']
        start_urls = ['http://zimuku.cn/']
    
        def parse(self, response):
            '''
            :param response: 解析网页返回的内容
            :return: 
            '''
            name = response.xpath("/html/body/div[2]/div/div/div[2]/table/tbody/tr[1]/td[1]/a/b/text()")
            item = {}
            item['text'] = name
            yield item
    
    

    (3)编写piplines(处理爬到的内容的)

    # -*- coding: utf-8 -*-
    
    # Define your item pipelines here
    #
    # Don't forget to add your pipeline to the ITEM_PIPELINES setting
    # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
    
    
    class ZimukuPipeline(object):
        def process_item(self, item, spider):
            with open("F:\\python\\1.txt", 'a') as fp:
                fp.write(str(item['text']))
            print(item['text'])
    
    
    

    (4)settings.py

    #通过配置告诉Scrapy明白是谁来处理结果
    ITEM_PIPELINES={'zimuku.pipelines.ZimukuPipeline':300,}
    

    (5)运行

    #不打印日志
    scrapy crawl zimu --nolog
    #打印日志
    scrapy crawl meiju 
    
    建议最好打印日志,不然有些错误不会发现,除了问题还不知道出在哪块
    
    snipaste_20181110_094710.png

    我们会发现运行成功了,我们再看看文件是否保存成功


    snipaste_20181110_094749.png

    OK!!!终于成功了。
    这篇Scrapy是一步一步做的,对自己还是入门的朋友都是不错的参考。

    相关文章

      网友评论

        本文标题:Scrapy 爬虫实战-爬取字幕库

        本文链接:https://www.haomeiwen.com/subject/mtqzxqtx.html