美文网首页
贝壳网武汉二手房数据分析———数据采集

贝壳网武汉二手房数据分析———数据采集

作者: 一半芒果 | 来源:发表于2019-11-17 14:49 被阅读0次
    思路:

    1、贝壳网武汉二手房网页url:https://wh.ke.com/ershoufang/
    2、使用scrapy框架,通过循环访问共100个页面,每页30个房源信息;
    3、获取标题描述、楼盘信息、房屋标签、 总价、单价、楼层、建筑时间、户型、朝向、发布时间、关注人数等信息;
    4、使用xpath解析页面数据;
    5、保存为CSV表格;

    一、准备工作
    • 创建一个scrapy project:
     scrapy startproject BKZF
    
    • 创建spider file
     scrapy genspider beike ke.com
    
    二、构建框架

    (1)items.py / 定义item

    import scrapy
    
    class ItemItem(scrapy.Item):
        detailinfo = scrapy.Field()
        info = scrapy.Field()
        location= scrapy.Field()
        followinfo=scrapy.Field()
        tag = scrapy.Field()
        totalprice = scrapy.Field()
        unitprice = scrapy.Field()
    

    (2) spider.py

    import scrapy
    from ITEM.items import ItemItem
    
    class BeikeSpider(scrapy.Spider):
        name = 'beike'
        allowed_domains = ['ke.com']
        baseurl = 'https://wh.ke.com/ershoufang/PG{}/'
        start_urls =[]
        for i in range(1,101):
            url= baseurl.format(i)
            start_urls.append(url)
    
        def parse(self, response):
            room_list = response.xpath('//*[@id="beike"]//ul[@class="sellListContent"]//div[@class="info clear"]')
            #print('长度:',len(room_list))
            for i in room_list:
                item = ItemItem()
                item['detailinfo'] = i.xpath('.//div[@class="title"]/a/text()').extract()[0]
                item['info'] = i.xpath('.//div[@class="houseInfo"]/text()').extract()[1].strip().replace(' ','').replace('\n','')
                item['location'] = i.xpath('.//div[@class="flood"]/div[@class="positionInfo"]/a/text()').extract_first().strip()
                item['followinfo'] = i.xpath('.//div[@class="followInfo"]/text()').extract()[1].strip().replace(' ', '').replace('\n','').replace('/','|')
                item['tag']= i.xpath('.//div[@class="tag"]//text()').extract()[1].strip().replace(' ', '').replace('\n','').replace('/','|')
                item['totalprice'] = i.xpath('.//div[@class="totalPrice"]/span/text()').extract_first() 
                item['unitprice'] = i.xpath('.//div[@class="unitPrice"]//@data-price').extract_first()
                yield item
    

    (3) middlewares.py

    import random
    #添加User-Agent
    class ItemDownloaderMiddleware(object):
        def __init__(self):
            self. user_agent_list = ["Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
            "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
            "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
            "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
            "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
            "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
            "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
            "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
            "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
            "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
            "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
            "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
            "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
            "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"]
    
        def process_request(self, request, spider):
            ug = random.choice(self.user_agent_list)
            request.headers['User-Agent'] = ug
            return None
    
        def process_response(self, request, response, spider):
            print(request.headers['User-Agent'])
            return response
    

    (4)pipelines.py

    from scrapy.exporters import CsvItemExporter
    #数据持久化,保存CSV格式
    class ItemPipeline(object):
        def open_spider(self, spider):
            self.file = open('beike.csv', 'wb')
            self.exporter = CsvItemExporter(self.file)
            self.exporter.start_exporting()
    
        def process_item(self, item, spider):
            self.exporter.export_item(item)
            return item
    
        def close_spider(self, spider):
            self.exporter.finish_exporting()
            self.file.close()
    

    (5)一切准备就绪别忘记setting
    一般写好一部分代码就开启相应的设置,以防忘记

    LOG_FILE = 'beike.log'
    LOG_LEVEL = 'INFO'
    ROBOTSTXT_OBEY = False
    DOWNLOAD_DELAY = 3
    DOWNLOADER_MIDDLEWARES = {
       'ITEM.middlewares.ItemDownloaderMiddleware': 543,
    }
    ITEM_PIPELINES = {
       'ITEM.pipelines.ItemPipeline': 300,
    }
    AUTOTHROTTLE_MAX_DELAY = 60
    

    三、运行spider

    scrapy crawl beike
    

    四、查看数据


    image.png

    相关文章

      网友评论

          本文标题:贝壳网武汉二手房数据分析———数据采集

          本文链接:https://www.haomeiwen.com/subject/btppictx.html