Python爬虫（16）利用Scrapy爬取银行理财产品信息（共

作者: 山阴少年 | 来源:发表于2018-03-15 16:39 被阅读93次

本次Scrapy爬虫的目标是爬取“融360”网站上所有银行理财产品的信息，并存入MongoDB中。网页的截图如下，全部数据共12多万条。

银行理财产品

我们不再过多介绍Scrapy的创建和运行，只给出相关的代码。关于Scrapy的创建和运行，有兴趣的读者可以参考：Scrapy爬虫（4）爬取豆瓣电影Top250图片。
修改items.py，代码如下，用来储存每个理财产品的相关信息，如产品名称，发行银行等。

import scrapy

class BankItem(scrapy.Item):
    # define the fields for your item here like:
    name = scrapy.Field()
    bank = scrapy.Field()
    currency = scrapy.Field()
    startDate = scrapy.Field()
    endDate = scrapy.Field()
    period = scrapy.Field()
    proType = scrapy.Field()
    profit = scrapy.Field()
    amount = scrapy.Field()

创建爬虫文件bankSpider.py，代码如下，用来爬取网页中理财产品的具体信息。

import scrapy
from bank.items import BankItem

class bankSpider(scrapy.Spider):
    name = 'bank'
    start_urls = ['https://www.rong360.com/licai-bank/list/p1']

    def parse(self, response):

        item = BankItem()
        trs = response.css('tr')[1:]
        
        for tr in trs:
            item['name'] = tr.xpath('td[1]/a/text()').extract_first()
            item['bank'] = tr.xpath('td[2]/p/text()').extract_first()
            item['currency'] = tr.xpath('td[3]/text()').extract_first()
            item['startDate'] = tr.xpath('td[4]/text()').extract_first()
            item['endDate'] = tr.xpath('td[5]/text()').extract_first()
            item['period'] = tr.xpath('td[6]/text()').extract_first()
            item['proType'] = tr.xpath('td[7]/text()').extract_first()
            item['profit'] = tr.xpath('td[8]/text()').extract_first()
            item['amount'] = tr.xpath('td[9]/text()').extract_first()

            yield item

        next_pages = response.css('a.next-page')

        if len(next_pages) == 1:
            next_page_link = next_pages.xpath('@href').extract_first() 
        else:
            next_page_link = next_pages[1].xpath('@href').extract_first()
       
        if next_page_link:
            next_page = "https://www.rong360.com" + next_page_link
            yield scrapy.Request(next_page, callback=self.parse)

为了将爬取的数据储存到MongoDB中，我们需要修改pipelines.py文件，代码如下：

# pipelines to insert the data into mongodb
import pymongo
from scrapy.conf import settings

class BankPipeline(object):
    def __init__(self):
        # connect database
        self.client = pymongo.MongoClient(host=settings['MONGO_HOST'], port=settings['MONGO_PORT'])

        # using name and password to login mongodb
        # self.client.admin.authenticate(settings['MINGO_USER'], settings['MONGO_PSW'])
        
        # handle of the database and collection of mongodb
        self.db = self.client[settings['MONGO_DB']]
        self.coll = self.db[settings['MONGO_COLL']] 

    def process_item(self, item, spider):
        postItem = dict(item)
        self.coll.insert(postItem)
        return item

其中的MongoDB的相关参数，如MONGO_HOST, MONGO_PORT在settings.py中设置。修改settings.py如下：

ROBOTSTXT_OBEY = False
ITEM_PIPELINES = {'bank.pipelines.BankPipeline': 300}
添加MongoDB连接参数

MONGO_HOST = "localhost"  # 主机IP
MONGO_PORT = 27017  # 端口号
MONGO_DB = "Spider"  # 库名 
MONGO_COLL = "bank"  # collection名
# MONGO_USER = ""
# MONGO_PSW = ""

其中用户名和密码可以根据需要添加。

接下来，我们就可以运行爬虫了。运行结果如下：

运行结果

共用时3小时，爬了12多万条数据，效率之高令人惊叹！
最后我们再来看一眼MongoDB中的数据：

MongoDB数据库

Perfect！本次分享到此结束，欢迎大家交流~~

网友评论

本文标题：Python爬虫（16）利用Scrapy爬取银行理财产品信息（共

本文链接：https://www.haomeiwen.com/subject/gnvhqftx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

Python爬虫（16）利用Scrapy爬取银行理财产品信息（共

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

爬虫专题

大数据爬虫Python AI Sql

我爱编程

Python爬虫（16）利用Scrapy爬取银行理财产品信息（共

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

爬虫专题

大数据 爬虫Python AI Sql

我爱编程

大数据爬虫Python AI Sql