美文网首页
用Scrapy采集公管学院新闻

用Scrapy采集公管学院新闻

作者: 安小宇 | 来源:发表于2017-05-21 16:54 被阅读0次

    采集对象:四川大学公共管理学院新闻动态及内容
    爬取规则:用css选择器的方法来进行元素定位

    采集过程

    激活,进入虚拟环境


    1.png

    创建项目


    2.png

    修改items.py文件

    # -*- coding: utf-8 -*-
    
    # Define here the models for your scraped items
    #
    # See documentation in:
    # http://doc.scrapy.org/en/latest/topics/items.html
    
    import scrapy
    
    class GgnewsItem(scrapy.Item):
        # define the fields for your item here like:
        # name = scrapy.Field()
        title = scrapy.Field()
        time = scrapy.Field()
        content = scrapy.Field()
        img = scrapy.Field()
    

    编写爬虫

    import scrapy
    
    from ggnews.items import GgnewsItem
    
    class GgnewsSpider(scrapy.Spider):
        name = "spidernews"
        start_urls = [
            'http://ggglxy.scu.edu.cn/index.php?c=special&sid=1',
        ]
    
        def parse(self, response):
            for href in response.css('div.pb30.mb30 div.right_info.p20.bgf9 ul.index_news_ul.dn li a.fl::attr(href)'):
                url = response.urljoin(href.extract())
                yield scrapy.Request(url, callback=self.parse2)
    
                next_page = response.css('div.w100p div.px_box.w1000.auto.ovh.cf div.pb30.mb30 div.mobile_pager.dn li.c::text').extract_first()
                if next_page is not None:
                    next_url = int(next_page) + 1
                    next_urls = '?c=special&sid=1&page=%s' % next_url
                    print next_urls
                    next_urls = response.urljoin(next_urls)
                    yield scrapy.Request(next_urls,callback = self.parse)
    
        def parse2(self, response):
            items = []
            for new in response.css('div.w1000.auto.cf div.w780.pb30.mb30.fr div.right_info.p20'):
                    item = GgnewsItem()
                    item['title'] = new.css('div.detail_zy_title h1::text').extract_first(),
                    item['time'] = new.css('div.detail_zy_title p::text').extract_first(),
                    item['content'] = new.css('div.detail_zy_c.pb30.mb30 p span::text').extract(),
                    item['img'] = new.css('div.detail_zy_c.pb30.mb30 p.MsoNormal img::attr(src)').extract(),
                    items.append(item)
    
            return items
    

    将爬虫文件拖进spiders文件夹下

    3.png
    4.png

    执行爬虫

    scrapy crawl spidernews -o spidernews.xml
    

    (开始几次一直出现 ImportError: No module named items的错误,查百度发现时spiders 目录中的.py文件不能和项目名同名的问题,对其文件名进行修改)


    5.png
    scrapy crawl spidernews -o spidernews.json
    
    7.png

    得到数据


    6.png 8.png 9.png 10.png

    相关文章

      网友评论

          本文标题:用Scrapy采集公管学院新闻

          本文链接:https://www.haomeiwen.com/subject/vrtvxxtx.html