美文网首页
利用Scrapy爬取链家杭州

利用Scrapy爬取链家杭州

作者: ISeeMoon | 来源:发表于2017-10-29 23:04 被阅读0次

    在恶补了一下关于class的概念之后,对于爬虫框架scrapy的运用稍微熟练了一点,于是对前段时间用beautifulsoup方式爬取链家的代码进行了更新。

    这次爬取的仍然是链家杭州二手房,只不过将上次爬取的在售区换成了成交区。

    Scrapy的学习,可以通过查阅下面的资料,适当穿插进行吧。

    Scrapy爬虫框架的参考资料

    好,言归正传。

    首先是就是分析网页结构,任意打开一个链家二手房板块页面,计数发现该页面下有总计30条(图中只截取了4条)的二手房信息,而总计有100个页面。 网页结构

    因此,不难得到应该采取的爬虫策略为:

    1. 爬取每一个页面30条的二手房信息的网址链接
    2. 爬取每个二手房网址链接内的标题、价格等因素

    notes: 如果对以上爬取过程进行细分,第1条则是首先获取所有页面的url,然后获取每个页面中30条二手房的url;第2条则是对第1条获得的二手房url进行分析,进一步获取标题、价格等具体因素。

    但是,又发现链家网站并没有把所有的二手房信息直接放出来,每个版块内无论有多少二手房,也只呈现总计100个页面,每页30条总计3000条的租房信息。


    筛选条件

    那么就只能通过选择不同的筛选条件,将所有的二手房进行划分,将每个筛选条件下的二手房数量控制在3000条以下,再将所有筛选条件下的二手房信息合并以取得所有的信息。

    此处,我选择的是以总价为条件进行筛选。

    #在0-50万的筛选条件下,url为
    # url = "https://hz.lianjia.com/chengjiao/pg1/ea10000bp0ep50/"
    #其中pg1为当前筛选条件下的第1页,bp0为总价筛选下限,ep50为总价筛选上限
    
    #1.设置筛选条件为
    # page_group_list = ['ea10000bp0ep50/',
    # 'ea10000bp50ep100/',
    # 'ea10000bp100ep120/',
    # 'ea10000bp120ep140/',
    # 'ea10000bp140ep160/',
    # 'ea10000bp160ep180/',
    # 'ea10000bp180ep200/',
    # 'ea10000bp200ep250/',
    # 'ea10000bp250ep300/',
    # 'ea10000bp300ep10000/']
    
    #2.每个筛选条件下的页面数量通过pg后的数字进行迭代
    #pg(1,2,3,4,5....)
    
    #3.每个筛选条件下的最大页面数量也需要获得,因为不是所有条件下都是100页
    

    url分析完毕,开始具体的写代码。这次所写的Scrapy爬虫框架,大致由items、peplines、settings以及Spiders几个部分构成,items用于定义所想爬取的元素,peplines用于实现爬取元素的输出,settings用于调整爬虫具体参数,而spiders则是爬虫的核心,在spiders中实现具体的爬取过程。

    a.定义items
    import scrapy
    class LianjiaItem(scrapy.Item):
    # 房屋名称
    housename = scrapy.Field()
    # 产权年限
    propertylimit = scrapy.Field()
    # 链接
    houselink = scrapy.Field()
    # 挂牌总价
    totalprice = scrapy.Field()
    # 单价
    unitprice = scrapy.Field()
    # 房屋户型
    housetype = scrapy.Field()
    # 建筑面积
    constructarea = scrapy.Field()
    # 套内面积
    housearea = scrapy.Field()
    # 楼层
    housefloor = scrapy.Field()
    # 房屋用途
    house_use = scrapy.Field()
    # 交易属性
    tradeproperty = scrapy.Field()
    # 关注次数
    guanzhu = scrapy.Field()
    # 带看次数
    daikan = scrapy.Field()
    # 所属行政区域
    district = scrapy.Field()
    # 成交总价
    selltotalprice = scrapy.Field()
    # 成交均价
    sellunitprice = scrapy.Field()
    # 成交时间
    selltime = scrapy.Field()
    # 成交周期
    sellperiod = scrapy.Field()
    # 小区均价
    villageunitprice = scrapy.Field()
    # 小区建成年代
    villagetime = scrapy.Field()
    
    b.定义spiders
    # -*- coding: utf-8 -*-
    import scrapy
    import requests
    from lxml import etree
    import json
    from Lianjia.items import LianjiaItem
    import re
    
    
    class ChengjiaoSpider(scrapy.Spider):
    name = 'chengjiao'
    # allowed_domains = ['lianjia.com']
    baseURL = 'https://hz.lianjia.com/chengjiao/pg'
    offset_page = 1
    offset_list = 0
    page_group_list = ['ea10000bp0ep50/',
    'ea10000bp50ep100/',
    'ea10000bp100ep120/',
    'ea10000bp120ep140/',
    'ea10000bp140ep160/',
    'ea10000bp160ep180/',
    'ea10000bp180ep200/',
    'ea10000bp200ep250/',
    'ea10000bp250ep300/',
    'ea10000bp300ep10000/']
    
    url = baseURL + str(offset_page) + page_group_list[offset_list]
    
    start_urls = [url]
    
    #用于获取当前筛选条件下的最大页面数量
    def getmax(self,url):
    requ = requests.get(url,allow_redirects=False)
    if requ.status_code == 200:
    resp = requ.text
    tree = etree.HTML(resp)
    str_max = tree.xpath("//div[@class='page-box house-lst-page-box']/@page-data")[0]
    dic_max = json.loads(str_max)
    maxnum = dic_max['totalPage']
    return maxnum
    else:
    print 'Open Page Error'
    
    #用于获取页面下的二手房url。
    #callback参数用于将返回的值传递给指定的方法,meta参数用于将变量item传递给指定的方法
    def parse(self, response):
    node_list = response.xpath("//div[@class='info']/div[@class='title']/a")
    for node in node_list:
    item = LianjiaItem()
    item['houselink'] = node.xpath("./@href").extract()[0]
    yield scrapy.Request(item['houselink'],callback=self.parse_content,meta={'key':item})
    #如果爬取的页数小于该筛选条件下的最大页面数,则页面数量+1,并继续爬取下一页;
    #当页数大于或等于该筛选条件下的最大页面数时,说明已经爬完该条件下的所有页面,
    #则页数重新从1开始计,并换下一个筛选条件。
    if self.offset_page < self.getmax(response.url):
    self.offset_page += 1
    nexturl = self.baseURL + str(self.offset_page) + self.page_group_list[self.offset_list]
    yield scrapy.Request(nexturl,callback=self.parse)
    else:
    if self.offset_list < len(self.page_group_list)-1:
    self.offset_page = 1
    self.offset_list += 1
    nexturl = self.baseURL + str(self.offset_page) + self.page_group_list[self.offset_list]
    yield scrapy.Request(nexturl,callback=self.parse)
    
    #爬取具体的信息
    #通过meta参数接受上一个方法传递的值item
    def parse_content(self,response):
    item = response.meta['key']
    # 房屋名称
    try:
    item['housename'] = response.xpath("//div[@class='house-title']/div[@class='wrapper']/h1/text()").extract()[0].strip()
    except:
    item['housename'] = 'None'
    # 产权年限
    try:
    item['propertylimit'] = response.xpath("//div[@class='content']/ul/li[13]/text()").extract()[0].strip()
    except:
    item['propertylimit'] = 'None'
    # 挂牌总价
    try:
    item['totalprice'] = response.xpath("//div[@class='msg']/span[1]/label/text()").extract()[0].strip()
    except:
    item['totalprice'] = 'None'
    # 房屋户型
    try:
    item['housetype'] = response.xpath("//div[@class='introContent']/div[@class='base']/div[@class='content']/ul/li[1]/text()").extract()[0].strip()
    except:
    item['housetype'] = 'None'
    # 建筑面积
    try:
    item['constructarea'] = response.xpath("//div[@class='introContent']/div[@class='base']/div[@class='content']/ul/li[3]/text()").extract()[0].strip()
    except:
    item['constructarea'] = 'None'
    # 套内面积
    try:
    item['housearea'] = response.xpath("//div[@class='introContent']/div[@class='base']/div[@class='content']/ul/li[5]/text()").extract()[0].strip()
    except:
    item['housearea'] = 'None'
    # 房屋用途
    try:
    item['house_use'] = response.xpath("//div[@class='introContent']/div[@class='transaction']/div[@class='content']/ul/li[4]/text()").extract()[0].strip()
    except:
    item['house_use'] = 'None'
    # 交易属性
    try:
    item['tradeproperty'] = response.xpath("//div[@class='introContent']/div[@class='transaction']/div[@class='content']/ul/li[2]/text()").extract()[0].strip()
    except:
    item['tradeproperty'] = 'None'
    # 关注次数
    try:
    item['guanzhu'] = response.xpath("//div[@class='msg']/span[5]/label/text()").extract()[0].strip()
    except:
    item['guanzhu'] = 'None'
    # 带看次数
    try:
    item['daikan'] = response.xpath("//div[@class='msg']/span[4]/label/text()").extract()[0].strip()
    except:
    item['daikan'] = 'None'
    # 行政区
    try:
    pre_district = response.xpath("//section[@class='wrapper']/div[@class='deal-bread']/a[3]/text()").extract()[0].strip()
    pattern = u'(.*?)二手房成交价格'
    item['district'] = re.search(pattern,pre_district).group(1)
    except:
    item['district'] = 'None'
    # 成交总价
    try:
    item['selltotalprice'] = response.xpath("//span[@class='dealTotalPrice']/i/text()").extract()[0].strip()
    except:
    item['selltotalprice'] = 'None'
    # 成交均价
    try:
    item['sellunitprice'] = response.xpath("//div[@class='price']/b/text()").extract()[0].strip()
    except:
    item['sellunitprice'] = 'None'
    # 成交时间
    try:
    item['selltime'] = response.xpath("//div[@id='chengjiao_record']/ul[@class='record_list']/li/p[@class='record_detail']/text()").extract()[0].split(u',')[-1]
    except:
    item['selltime'] = 'None'
    
    yield item
    
    c.定义settings
    # -*- coding: utf-8 -*-
    
    # Scrapy settings for Lianjia project
    #
    # For simplicity, this file contains only settings considered important or
    # commonly used. You can find more settings consulting the documentation:
    #
    #     http://doc.scrapy.org/en/latest/topics/settings.html
    #     http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
    #     http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
    
    BOT_NAME = 'Lianjia'
    
    SPIDER_MODULES = ['Lianjia.spiders']
    NEWSPIDER_MODULE = 'Lianjia.spiders'
    
    #LOG_FILE = r"C:\test\CHENGJ_pro.doc"
    #LOG_LEVEL = 'INFO'
    # Crawl responsibly by identifying yourself (and your website) on the user-agent
    #USER_AGENT = 'Lianjia (+http://www.yourdomain.com)'
    
    # Obey robots.txt rules
    ROBOTSTXT_OBEY = False
    
    # Configure maximum concurrent requests performed by Scrapy (default: 16)
    CONCURRENT_REQUESTS = 32
    
    # Configure a delay for requests for the same website (default: 0)
    # See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
    # See also autothrottle settings and docs
    DOWNLOAD_DELAY = 0.5
    # The download delay setting will honor only one of:
    #CONCURRENT_REQUESTS_PER_DOMAIN = 16
    #CONCURRENT_REQUESTS_PER_IP = 16
    
    # Disable cookies (enabled by default)
    COOKIES_ENABLED = False
    
    # Disable Telnet Console (enabled by default)
    #TELNETCONSOLE_ENABLED = False
    
    # Override the default request headers:
    DEFAULT_REQUEST_HEADERS = {
       'Accept': 'image/webp,image/apng,image/*,*/*;q=0.8',
       'Accept-Language': 'zh-CN,zh;q=0.9',
       'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.62 Safari/537.36'
    }
    
    # Enable or disable spider middlewares
    # See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
    #SPIDER_MIDDLEWARES = {
    #    'Lianjia.middlewares.LianjiaSpiderMiddleware': 543,
    #}
    
    # Enable or disable downloader middlewares
    # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
    #DOWNLOADER_MIDDLEWARES = {
    #    'Lianjia.middlewares.MyCustomDownloaderMiddleware': 543,
    #}
    
    # Enable or disable extensions
    # See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
    #EXTENSIONS = {
    #    'scrapy.extensions.telnet.TelnetConsole': None,
    #}
    
    # Configure item pipelines
    # See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
    ITEM_PIPELINES = {
        'Lianjia.pipelines.LianjiaPipeline': 300,
    }
    
    # Enable and configure the AutoThrottle extension (disabled by default)
    # See http://doc.scrapy.org/en/latest/topics/autothrottle.html
    #AUTOTHROTTLE_ENABLED = True
    # The initial download delay
    #AUTOTHROTTLE_START_DELAY = 5
    # The maximum download delay to be set in case of high latencies
    #AUTOTHROTTLE_MAX_DELAY = 60
    # The average number of requests Scrapy should be sending in parallel to
    # each remote server
    #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
    # Enable showing throttling stats for every response received:
    #AUTOTHROTTLE_DEBUG = False
    
    # Enable and configure HTTP caching (disabled by default)
    # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
    HTTPCACHE_ENABLED = True
    HTTPCACHE_EXPIRATION_SECS = 0
    HTTPCACHE_DIR = 'httpcache'
    HTTPCACHE_IGNORE_HTTP_CODES = []
    HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
    
    d.定义peplines
    import json
    
    class LianjiaPipeline(object):
        def __init__(self):
            self.f = open('c:\\test\\ceshi.json','w')
        
        
        def process_item(self, item, spider):
            content = json.dumps(dict(item),ensure_ascii=False)+'\n'
            self.f.write(content.encode('utf-8'))
            return item
        
        def close_spider(self,spider):
            self.f.close()
    
    补充:将json转换为excel
    import json
    import pandas as pd
    
    path = r"C:\test\ceshi.json"
    f = open(path)
    
    records = [json.loads(line) for line in f.readlines()]
    df = pd.DataFrame(records)
    
    df.to_csv(r"C:\test\chengjiao.csv",encoding='gb18030')
    

    在看了静觅的教程之后,将spiders的代码进行了更新,其它部分不变。整体上代码更加清晰,少了很多的判断语句和迭代。

    import scrapy
    import requests
    from lxml import etree
    import json
    from Lianjia.items import LianjiaItem
    import re
    from scrapy.http import Request
    
    
    class ChengjiaoSpider(scrapy.Spider):
        name = 'chengjiao_pro'
        baseURL = 'https://hz.lianjia.com/chengjiao/pg'
        offset_page = 1
        page_group_list = ['ea10000bp0ep50/',
                          'ea10000bp50ep100/',
                           'ea10000bp100ep120/',
                           'ea10000bp120ep140/',
                           'ea10000bp140ep160/',
                           'ea10000bp160ep180/',
                           'ea10000bp180ep200/',
                           'ea10000bp200ep250/',
                           'ea10000bp250ep300/',
                           'ea10000bp300ep10000/']    
    
        
        def start_requests(self):
            for i in self.page_group_list:
                url = self.baseURL + str(self.offset_page) + i
                yield Request(url,callback=self.parse) 
                
            
        def parse(self, response):
            maxnum_dict = json.loads(response.xpath("//div[@class='page-box house-lst-page-box']/@page-data").extract()[0])
            maxnum = int(maxnum_dict['totalPage'])
            for num in range(1,maxnum+1):
    #            item = LianjiaItem()
                split_str = self.baseURL + str(num)
                url = split_str + response.url.split(self.baseURL + str(self.offset_page))[1]
                yield Request(url,self.get_link,dont_filter=True)
    #            item['iurl'] = url
    #            item['resurl'] = response.url
    #            yield item
                
                
        def get_link(self,response):
            node_list = response.xpath("//div[@class='info']/div[@class='title']/a")
            for node in node_list:
                item = LianjiaItem()
                item['houselink'] = node.xpath("./@href").extract()[0]
                yield scrapy.Request(item['houselink'],callback=self.parse_content,meta={'key':item})
    
        def parse_content(self,response):
            item = response.meta['key']
    #        房屋名称
            try:
                item['housename'] = response.xpath("//div[@class='house-title']/div[@class='wrapper']/h1/text()").extract()[0].strip()
            except:
                item['housename'] = 'None'
    #        产权年限
            try:
                item['propertylimit'] = response.xpath("//div[@class='content']/ul/li[13]/text()").extract()[0].strip()
            except:
                item['propertylimit'] = 'None'
    #        挂牌总价
            try:
                item['totalprice'] = response.xpath("//div[@class='msg']/span[1]/label/text()").extract()[0].strip()
            except:
                item['totalprice'] = 'None'
    #        房屋户型
            try:
                item['housetype'] = response.xpath("//div[@class='introContent']/div[@class='base']/div[@class='content']/ul/li[1]/text()").extract()[0].strip()
            except:
                item['housetype'] = 'None'
    #        建筑面积
            try:
                item['constructarea'] = response.xpath("//div[@class='introContent']/div[@class='base']/div[@class='content']/ul/li[3]/text()").extract()[0].strip()
            except:
                item['constructarea'] = 'None'
    #        套内面积
            try:
                item['housearea'] = response.xpath("//div[@class='introContent']/div[@class='base']/div[@class='content']/ul/li[5]/text()").extract()[0].strip()
            except:
                item['housearea'] = 'None'
    #        房屋用途
            try:
                item['house_use'] = response.xpath("//div[@class='introContent']/div[@class='transaction']/div[@class='content']/ul/li[4]/text()").extract()[0].strip()
            except:
                item['house_use'] = 'None'
    #        交易属性
            try:
                item['tradeproperty'] = response.xpath("//div[@class='introContent']/div[@class='transaction']/div[@class='content']/ul/li[2]/text()").extract()[0].strip()
            except:
                item['tradeproperty'] = 'None'
    #        关注次数
            try:
                item['guanzhu'] = response.xpath("//div[@class='msg']/span[5]/label/text()").extract()[0].strip()
            except:
                item['guanzhu'] = 'None'
    #        带看次数
            try:            
                item['daikan'] = response.xpath("//div[@class='msg']/span[4]/label/text()").extract()[0].strip()
            except:
                item['daikan'] = 'None'
    #        行政区
            try:
                pre_district = response.xpath("//section[@class='wrapper']/div[@class='deal-bread']/a[3]/text()").extract()[0].strip()
                pattern = u'(.*?)二手房成交价格'
                item['district'] = re.search(pattern,pre_district).group(1)
            except:
                item['district'] = 'None'
    #        成交总价
            try:
                item['selltotalprice'] = response.xpath("//span[@class='dealTotalPrice']/i/text()").extract()[0].strip()
            except:
                item['selltotalprice'] = 'None'
    #        成交均价
            try:
                item['sellunitprice'] = response.xpath("//div[@class='price']/b/text()").extract()[0].strip()
            except:
                item['sellunitprice'] = 'None'
    #        成交时间
            try:
                item['selltime'] = response.xpath("//div[@id='chengjiao_record']/ul[@class='record_list']/li/p[@class='record_detail']/text()").extract()[0].split(u',')[-1]
            except:
                item['selltime'] = 'None'
            yield item
    

    相关文章

      网友评论

          本文标题:利用Scrapy爬取链家杭州

          本文链接:https://www.haomeiwen.com/subject/cxlupxtx.html