美文网首页
Python Scrapy 爬取姓名大全数据

Python Scrapy 爬取姓名大全数据

作者: Fizz翊 | 来源:发表于2018-11-25 12:50 被阅读117次

    欢迎来我的个人博客:fizzyi

    项目介绍

    爬取地址: http://www.resgain.net/xmdq.html

    爬取内容:为该网址下的所有姓氏和姓氏名字

    爬取步骤:

    • 先爬取所有的姓氏,包括姓氏,姓氏的中文,每个姓氏的URL
    • 然后在进每一个姓氏的网址进去爬取每个姓氏下的名字,每个姓氏下都有十页,但是发现并不是每一页都是存在姓名的。
    • 最后进每一个姓氏的详细页面,爬取每个姓名的相同人数和五行和三才。

    工作环境和爬取的框架: python3 scrapy

    爬取数据量: 姓氏435个 姓名194万数据

    代码

    1 准备工作

    新建scrapy项目 和新建爬虫项目(2个)

    scrapy startproject baijiaxing1
    
    scrapy genspider baijiaxing2 resgain.net/xmdq.html
    
    scrapy genspider spider_xingming resgain.net/xmdq.html
    

    解释下为什么要创建两个爬虫项目

    因为scrapy是多线程的爬取方式,我之前写在一起,会同时爬取姓氏和姓名,但是姓名中有一个字段是姓氏的id,这样就会存在一个情况 姓名抓取都的时候姓氏还没有存到数据库中,会导致报错。但想一想,应该有其他办法可以解决这个问题,本人也是才接触scrapy,所以采用这种笨方法。

    baijiaxing2 是爬取姓氏, spider_xingming 是爬取姓名

    2 建立items

    Items.py

    # -*- coding: utf-8 -*-
    
    import scrapy
    
    
    class Xingshi_Item(scrapy.Item):
        xingshi = scrapy.Field()
        href = scrapy.Field()
        xingshi_zhongwen = scrapy.Field()
    
    
    class Xingming_Item(scrapy.Item):
        name = scrapy.Field()
        the_same_people_number = scrapy.Field()
        boy_ratio = scrapy.Field()
        girl_ratio = scrapy.Field()
        five_elements = scrapy.Field()
        three_talents = scrapy.Field()
        xingshi = scrapy.Field()
    
    

    3 爬取所有姓氏

    # -*- coding: utf-8 -*-
    import scrapy
    
    from baijiaxing1.items import Xingshi_Item
    
    
    class Baijiaxing2Spider(scrapy.Spider):
        name = 'baijiaxing2'
        # allowed_domains = ['resgain.net/xmdq.html']
        start_urls = ('http://www.resgain.net/xmdq.html',)
    
        def parse(self, response):
            content = response.xpath('//div[@class="col-xs-12"]/a')
    
            for i in content:
                # xingshi = i.split('.')[0].split('/')[-1]
                xingshi = i.xpath('./@href').extract()[0].split('.')[0].split('/')[-1]
    
                href = 'http:' + i.xpath('./@href').extract()[0]
                item = Xingshi_Item()
                item['xingshi'] = xingshi
                item['href'] = href
                item['xingshi_zhongwen'] = i.xpath('./text()').extract()[0].split('姓名')[0]
    
                yield item
    
    

    4 爬取所有名字

    # -*- coding: utf-8 -*-
    from urllib.parse import urljoin
    
    import scrapy
    
    from baijiaxing1.items import Xingshi_Item, Xingming_Item
    
    
    class SpiderXingmingSpider(scrapy.Spider):
        name = 'spider_xingming'
        # allowed_domains = ['www.resgain.net/xmdq.html']
        start_urls = ('http://www.resgain.net/xmdq.html',)
    
        def parse(self, response):
            content = response.xpath('//div[@class="col-xs-12"]/a/@href').extract()
    
            for i in content:
                page = 0
                href = 'http:' + i
                base = href.split('/name')[0] + '/name_list_'
                while page < 10:
                    url = base + str(page) + '.html'
                    page += 1
                    yield scrapy.Request(url, callback=self.parse_in_html)
    
        # 解析每一页
        def parse_in_html(self, response):
            person_info = response.xpath('//div[@class="col-xs-12"]/a')
            base_url = 'http://'+response.url.split('/')[2]
            xingshi = response.url.split('/')[2].split('.')[0]
            for every_one in person_info:
                name = every_one.xpath('./text()').extract()[0]
                href = every_one.xpath('./@href').extract()[0]
                the_person_info_url =base_url + href
                the_item = Xingming_Item()
                the_item['name'] = name
                the_item['xingshi'] = xingshi
                yield scrapy.Request(the_person_info_url, meta={'the_item': the_item}, callback=self.parse_every_html)
    
    
        def parse_every_html(self, response):
            the_item = response.meta['the_item']
            the_same_people_number = \
            response.xpath('//div[@class="navbar-brand"]/text()').extract_first().split('人')[0].split('有')[1]
            boy_ratio = \
            response.xpath('//div[@class="progress"]/div[contains(@class,progress-bar)]/text()').extract()[0].split('情况')[0]
            girl_ratio = \
            response.xpath('//div[@class="progress"]/div[contains(@class,progress-bar)]/text()').extract()[1].split('情况')[0]
            five_elements = response.xpath('//div[@class="panel-body"]/div[@class="col-xs-6"]/blockquote/text()').extract()[
                0]
            three_talents = response.xpath('//div[@class="panel-body"]/div[@class="col-xs-6"]/blockquote/text()').extract()[
                1]
            the_item['the_same_people_number'] = the_same_people_number,
            the_item['boy_ratio'] = boy_ratio,
            the_item['girl_ratio'] = girl_ratio,
            the_item['five_elements'] = five_elements,
            the_item['three_talents'] = three_talents
    
            yield the_item
    
    

    5 pipelines.py

    # -*- coding: utf-8 -*-
    
    # Define your item pipelines here
    #
    # Don't forget to add your pipeline to the ITEM_PIPELINES setting
    # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
    import pymysql
    
    from baijiaxing1.items import Xingshi_Item, Xingming_Item
    
    
    class XingShiPipeline(object):
        def __init__(self,host, database, user, password, port):
            self.host = host
            self.database = database
            self.user = user
            self.password = password
            self.port = port
    
        def process_item(self, item, spider):
            if isinstance(item, Xingshi_Item):
                sql = 'INSERT INTO baijiaxing(xingshi,href,xingshi_zhongwen) VALUES (%s,%s,%s);'
                self.cursor.execute(sql,(item['xingshi'],str(item['href']),item['xingshi_zhongwen']))
                self.db.commit()
                return item
            elif isinstance(item,Xingming_Item):
                sql = 'SELECT id FROM xingshi WHERE xingshi = %s' %item['xingshi']
                xingshi_id = self.cursor.execute(sql)
                sql = 'INSERT INTO xingming(name,the_same_people_number,boy_ratio,girl_ratio,five_elements,three_talents,xingshi_id)values (%s,%s,%s,%s,%s,%s,%s);'
                self.cursor.execute(sql,(item['name'],item['the_same_people_number'][0],item['boy_ratio'][0],item['girl_ratio'][0],item['five_elements'][0],item['three_talents'],int(xingshi_id)))
                self.db.commit()
                return item
    
        @classmethod
        def from_crawler(cls, crawler):
            return cls(
                host=crawler.settings.get('PYMYSQL_HOST'),
                database=crawler.settings.get('PYMYSQL_DATABASE'),
                user=crawler.settings.get('PYMYSQL_USER'),
                password=crawler.settings.get('PYMYSQL_PASSWORD'),
                port=crawler.settings.get('PYMYSQL_PORT'),
            )
    
        def open_spider(self, spider):
            self.db = pymysql.connect(self.host, self.user, self.password, self.database, port=self.port)
            self.cursor = self.db.cursor()
    
        def close_spider(self, spider):
            self.db.close()
    
    
    

    因为有两个item,所以在process_item中要区分是哪个item返回的数据。

    6 settings.py

    最后在settings.py中设置数据库的配置以及请求头的配置

    USER_AGENT = 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)'
    ROBOTSTXT_OBEY = False
    # mysql数据库设置
    PYMYSQL_HOST = '127.0.0.1'
    PYMYSQL_DATABASE = 'test1'
    PYMYSQL_USER = 'root'
    PYMYSQL_PASSWORD = '123456'
    PYMYSQL_PORT = 3306
    

    7 数据库表

    Baijiaxing 表

    id xingshi href xingshi_zhongwen

    xingming表

    id name the_same_people_number boy_ratio girl_ratio five_elements three_talents xingshi_id

    8 开始

    因为有两个spiders,而且是先运行baijiaxing1 后运行spider_xingming 所以写了一个run.py文件

    run.py

    import os
    os.system("scrapy crawl baijiaxing2")
    os.system("scrapy crawl spider_xingming")
    

    github地址:https://github.com/Fizzyi/baijiaxing/tree/master

    相关文章

      网友评论

          本文标题:Python Scrapy 爬取姓名大全数据

          本文链接:https://www.haomeiwen.com/subject/gmluqqtx.html