美文网首页爬虫专题大数据 爬虫Python AI Sql我爱编程
Python爬虫(16)利用Scrapy爬取银行理财产品信息(共

Python爬虫(16)利用Scrapy爬取银行理财产品信息(共

作者: 山阴少年 | 来源:发表于2018-03-15 16:39 被阅读93次

      本次Scrapy爬虫的目标是爬取“融360”网站上所有银行理财产品的信息,并存入MongoDB中。网页的截图如下,全部数据共12多万条。

    银行理财产品

      我们不再过多介绍Scrapy的创建和运行,只给出相关的代码。关于Scrapy的创建和运行,有兴趣的读者可以参考:Scrapy爬虫(4)爬取豆瓣电影Top250图片
      修改items.py,代码如下,用来储存每个理财产品的相关信息,如产品名称,发行银行等。

    import scrapy
    
    class BankItem(scrapy.Item):
        # define the fields for your item here like:
        name = scrapy.Field()
        bank = scrapy.Field()
        currency = scrapy.Field()
        startDate = scrapy.Field()
        endDate = scrapy.Field()
        period = scrapy.Field()
        proType = scrapy.Field()
        profit = scrapy.Field()
        amount = scrapy.Field()
    

      创建爬虫文件bankSpider.py,代码如下,用来爬取网页中理财产品的具体信息。

    import scrapy
    from bank.items import BankItem
    
    class bankSpider(scrapy.Spider):
        name = 'bank'
        start_urls = ['https://www.rong360.com/licai-bank/list/p1']
    
        def parse(self, response):
    
            item = BankItem()
            trs = response.css('tr')[1:]
            
            for tr in trs:
                item['name'] = tr.xpath('td[1]/a/text()').extract_first()
                item['bank'] = tr.xpath('td[2]/p/text()').extract_first()
                item['currency'] = tr.xpath('td[3]/text()').extract_first()
                item['startDate'] = tr.xpath('td[4]/text()').extract_first()
                item['endDate'] = tr.xpath('td[5]/text()').extract_first()
                item['period'] = tr.xpath('td[6]/text()').extract_first()
                item['proType'] = tr.xpath('td[7]/text()').extract_first()
                item['profit'] = tr.xpath('td[8]/text()').extract_first()
                item['amount'] = tr.xpath('td[9]/text()').extract_first()
    
                yield item
    
            next_pages = response.css('a.next-page')
    
            if len(next_pages) == 1:
                next_page_link = next_pages.xpath('@href').extract_first() 
            else:
                next_page_link = next_pages[1].xpath('@href').extract_first()
           
            if next_page_link:
                next_page = "https://www.rong360.com" + next_page_link
                yield scrapy.Request(next_page, callback=self.parse)
    

      为了将爬取的数据储存到MongoDB中,我们需要修改pipelines.py文件,代码如下:

    # pipelines to insert the data into mongodb
    import pymongo
    from scrapy.conf import settings
    
    class BankPipeline(object):
        def __init__(self):
            # connect database
            self.client = pymongo.MongoClient(host=settings['MONGO_HOST'], port=settings['MONGO_PORT'])
    
            # using name and password to login mongodb
            # self.client.admin.authenticate(settings['MINGO_USER'], settings['MONGO_PSW'])
            
            # handle of the database and collection of mongodb
            self.db = self.client[settings['MONGO_DB']]
            self.coll = self.db[settings['MONGO_COLL']] 
    
        def process_item(self, item, spider):
            postItem = dict(item)
            self.coll.insert(postItem)
            return item
    

    其中的MongoDB的相关参数,如MONGO_HOST, MONGO_PORT在settings.py中设置。修改settings.py如下:

    1. ROBOTSTXT_OBEY = False
    2. ITEM_PIPELINES = {'bank.pipelines.BankPipeline': 300}
    3. 添加MongoDB连接参数
    MONGO_HOST = "localhost"  # 主机IP
    MONGO_PORT = 27017  # 端口号
    MONGO_DB = "Spider"  # 库名 
    MONGO_COLL = "bank"  # collection名
    # MONGO_USER = ""
    # MONGO_PSW = ""
    

    其中用户名和密码可以根据需要添加。

      接下来,我们就可以运行爬虫了。运行结果如下:

    运行结果

    共用时3小时,爬了12多万条数据,效率之高令人惊叹!
      最后我们再来看一眼MongoDB中的数据:

    MongoDB数据库

      Perfect!本次分享到此结束,欢迎大家交流~~

    相关文章

      网友评论

        本文标题:Python爬虫(16)利用Scrapy爬取银行理财产品信息(共

        本文链接:https://www.haomeiwen.com/subject/gnvhqftx.html