美文网首页
7、Python Scrapy框架,简单学习

7、Python Scrapy框架,简单学习

作者: 波罗的海de夏天 | 来源:发表于2019-03-11 15:32 被阅读0次

    Crawl Web:美剧天堂

    工程搭建流程:
    1、cmd: cd PyCharmProject(工程所在目标文件)
    2、cmd: scrapy startproject movie
    3、cmd: cd movie
    4、cmd: scrapy genspider meiju meijutt.com
    5、IDE(PyCharm) 打开工程:
    items.py -- 该文件定义存储模板,用于结构化数据

    import scrapy
    class MovieItem(scrapy.Item):
        # define the fields for your item here like:
        # name = scrapy.Field()
        name = scrapy.Field()
    

    meiju.py -- 存储实际的爬虫代码

    import scrapy
    from movie.items import MovieItem
    class MeijuSpider(scrapy.Spider):
        name = 'meiju'
        allowed_domains = ['meijutt.com']
        start_urls = ['http://www.meijutt.com/new100.html']
    
        # def start_requests(self):
        #     urls = ['http://www.meijutt.com/new100.html']
        #     for url in urls:
        #         yield scrapy.Request(url=url, callback=self.parse)
    
        def parse(self, response):
            movies = response.xpath('//ul[@class="top-list  fn-clear"]/li')
            for each_movie in movies:
                item = MovieItem()
                item['name'] = each_movie.xpath('./h5/a/@title').extract()[0]
                yield item
    

    pipelines.py --该文件定义数据的存储方式,可以是文件、数据库或其他

    class MoviePipeline(object):
        def process_item(self, item, spider):
            with open("my_meiju.txt",'a') as fp:
                fp.write(item['name'])
                # fp.write(str(value=item['name'], encoding="utf-8"))
                fp.write('\n------------\n')
    

    setting.py -- 配置文件,可设置用户代理、爬取延时等

    ITEM_PIPELINES = {'movie.pipelines.MoviePipeline': 100}
    

    6、cmd: cd movie
    7、cmd: scrapy crawl meiju --log 或 scrapy crawl meiju

    相关文章

      网友评论

          本文标题:7、Python Scrapy框架,简单学习

          本文链接:https://www.haomeiwen.com/subject/yfqvpqtx.html