【爬虫实战】恒大的底气？--恒大楼盘数量爬取

作者: StataPython数据分析 | 来源:发表于2020-10-14 21:10 被阅读0次

【爬虫实战】恒大的底气？--恒大楼盘数量爬取
Python爬虫实战之爬取链家广州房价_03存储
恒大恒大
全员营销
2017-12-31
python-爬虫学习（文字、图片、视频）
python爬虫学习（文字、图片、视频）
python爬虫实战——爬取股票个股信息
恒大恒乎？
造车新势力兼并重组开始，恒大造车是来势汹汹还是很傻很天真

本文作者：任哲，中南财经政法大学经济学院
文字编辑：王子一
技术总编：张馨月

9月24日，一份名为《恒大集团有限公司关于恳请支持重大资产重组项目的情况报告》的文件在网络上广为流传，恒大一时间站在了风口浪尖之上。但很快恒大集团很快就辟谣处理，并且在9月30日与战略投资者达成协议，化解了传闻中即将到期的1300亿债务危机。恒大究竟凭借什么来与战略投资者进行谈判，我们不得而知，但是恒大的依仗，肯定和其地产项目离不开关系。
那么恒大地产作为国内房地产龙头企业之一，布局全国，具体下来又有多少地产项目呢？今天小编就带你一起爬取恒大官网，一起了解恒大的地产布局。

爬虫思路

我们进入恒大的官网，在首页找到地区公司选项，点击发现恒大在全国的所有分公司。

以华东公司为例，对其xpath进行分析，打开Chrome浏览器的开发者模式，在华东公司的xpath的href属性之中发现其官网链接。

image

进入华东公司的官网，再点击精品项目，便会出现该地区楼盘列表，接下来就可以根据楼盘列表爬取所有地区的楼盘。

image

具体操作

正式操作之前，引用我们所需的包，在本例中，由于要不停的更改需要爬取的网页，我们选择selenium来模拟鼠标操作帮助我们爬取：

<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" cid="n46" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 4px 6px; margin-bottom: 15px; margin-top: 15px; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">from selenium import webdriver  
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
from lxml import etree
import re
import pandas as pd
import time
import os </pre>

首先，对selenium进行设置，打开恒大官网：

<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" cid="n48" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 4px 6px; margin-bottom: 15px; margin-top: 15px; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">#设置selenium
CHROME_OPTIONS = webdriver.ChromeOptions()
prefs = {"profile.managed_default_content_settings.images":2}   # 1代表显示图片，2代表不显示图片
CHROME_OPTIONS.add_experimental_option("prefs", prefs)
#打开恒大官网
url = "https://www.evergrande.com/Home"
CHROME_DRIVER = './Driver/chromedriver.exe'
driver = webdriver.Chrome(executable_path=CHROME_DRIVER, options=CHROME_OPTIONS)   # 初始化浏览器
driver.set_window_position(0, 0)
driver.maximize_window()   # 设置浏览器窗口最大化
driver.get(url)   # 打开要访问的页面</pre>

我们在官网中，找到地区公司的xpath，得到各分公司的名单并借助href属性模拟点击分公司的网址。同时，为方便后期分析，把分公司所在地区保留在region之中。这里以华东分公司为例，代码如下：

<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" cid="n50" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 4px 6px; margin-bottom: 15px; margin-top: 15px; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">driver.find_element_by_xpath("""//a[@class='ad_a1']""").click()
time.sleep(2)

html = driver.page_source   # 获取页面源代码
tree = etree.HTML(html)

#各地分公司的网址都在属性之中
filiale_xpath = "//div[@class='innerWrapper']/ul/li[1]/a/@href"
filiale_web = tree.xpath(filiale_xpath)    
filiale_web = filiale_web[0] #提取网址
filiale_web = str(filiale_web) #将网址转化为字符型

#为方便后期分析，把分公司所在地区保留在region之中
region_xpath = "//div[@class='innerWrapper']/ul/li[1]/a/text()"
region=tree.xpath(region_xpath)
region=region[0]
region=str(region)
region=region[0:-2]</pre>

在打开分公司网站之后，再利用selenium模拟点击精品项目选项就可以打开楼盘列表进行爬取了。代码如下：

<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" cid="n52" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 4px 6px; margin-bottom: 15px; margin-top: 15px; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">#爬取各个分公司的楼盘项目
driver.get(filiale_web) 
driver.find_element_by_xpath("""//ul/li[@id='nav4']/a[@class='navA']""").click()
time.sleep(2)

estate_list=[]
estate_xpath = "//ul[@id='cl']/li/a/p[@class='title']"
estate_list = driver.find_elements_by_xpath(estate_xpath)
region_list = [region]*len(estate_list)</pre>

通过上述操作，便可将华东分公司的楼盘信息保存到estate_list当中，结果如下：

image

接下来，我们需要爬取其余公司的信息。通过观察可以发现各地分公司的xpath具有一定的规律，我们可以直接利用循环爬取所有公司的信息。在循环之中，我们添加一些代码进行完善：

<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" cid="n56" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 4px 6px; margin-bottom: 15px; margin-top: 15px; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">#设置列表，将每个分公司的数据拼接进去
all_estate=[]
all_region=[]

for num in range(1,27):
 #打开地区公司,获取每个分公司的网址
 driver.find_element_by_xpath("""//a[@class='ad_a1']""").click()
 time.sleep(2)

 html = driver.page_source   # 获取页面源代码
 tree = etree.HTML(html)

 #各地分公司的网址都在属性之中
 filiale_xpath = "//div[@class='innerWrapper']/ul/li[%g]/a/@href"%num
 filiale_web = tree.xpath(filiale_xpath)
 filiale_web = filiale_web[0] #提取网址
 filiale_web = str(filiale_web) #将网址转化为字符型

 #为方便后期分析，把分公司所在地区保留在region之中
 region_xpath = "//div[@class='innerWrapper']/ul/li[%g]/a/text()"%num
 region=tree.xpath(region_xpath)
 region=region[0]
 region=str(region)
 region=region[0:-2]

 #判断网页是否正常打开
 try :
 driver.get(filiale_web) 
 print(region,"分公司官网正常打开")
 except :
 print(region,"分公司官网无法打开")
 print(filiale_web)
 driver.get(url) #重新打开恒大官网
 time.sleep(2)
 continue

 #爬取各个分公司的楼盘项目 
 driver.find_element_by_xpath("""//ul/li[@id='nav4']/a[@class='navA']""").click()
 time.sleep(2)
 estate_list=[]
 estate_xpath = "//ul[@id='cl']/li/a/p[@class='title']"
 estate_list = driver.find_elements_by_xpath(estate_xpath)
 region_list = [region]*len(estate_list)

 for estate,region in zip (estate_list,region_list) :
 all_estate.append(estate.text)
 all_region.append(region)

 #为了下次循环顺利进行，重新打开恒大官网
 driver.get(url) 
 time.sleep(2)</pre>

最后，我们发现河南分公司无法正常打开。通过直接百度进入恒大河南分公司我们发现，河南分公司xpath的href属性提供的网址(http://hnzz.evergrande.com/)是错误的（扣鸡腿！），正确网址的应该是https://hdzy.evergrande.com/。这样，河南分公司的楼盘信息只好单独爬取了。

image

在爬取河南分公司的数据后，所有分公司的数据就全部得到了。通过对爬取的数据进行整理发现，恒大集团在中国大陆共有766个楼盘，与官网声明的870多个项目有所差距，这可能是由于楼盘信息公布不完整、项目不完全为房地产等原因。但不管怎样，在全国至少拥有766个楼盘，也给了恒大与战略投资者谈判的底气。这766个楼盘在全国的分布如下图所示：

image

从地域上来看，恒大地产布局空间很广，几乎在中国大陆各个地区都设有分公司。其中广东省作为恒大集团的大本营，一省拥有两家分公司：珠三角公司与深圳公司，共有楼盘120个，当之无愧地夺得冠军！

本次推文的内容到这里就全部结束了，如果你觉得有所收获，请不要忘记点赞哦~

（ps:完整的程序和数据可以在后台回复“恒大楼盘”来获取~）

【爬虫实战】恒大的底气？--恒大楼盘数量爬取
本文作者：任哲，中南财经政法大学经济学院文字编辑：王子一技术总编：张馨月 9月24日，一份名为《恒大集...
Python爬虫实战之爬取链家广州房价_03存储
问题引入系列目录： Python爬虫实战之爬取链家广州房价_01简单的单页爬虫 Python爬虫实战之爬取链家广...
恒大恒大
恒大恒大毫无悬念，恒大七连冠。尽管人员老化、队伍疲惫、核心缺失，但是冠军班底在。我不给，你拿不走。 ...
全员营销
今天是公司召开全员营销动员会。恒大“金九银十”内部员工推荐优惠卖房[玫瑰] 范围：全国的所有恒大楼盘！时间...
2017-12-31
python爬虫实战：爬取全站小说排行榜 ...
python-爬虫学习（文字、图片、视频）
爬虫-文字爬取爬虫-图片爬取爬虫-视频爬取
python爬虫学习（文字、图片、视频）
爬虫-文字爬取爬虫-图片爬取爬虫-视频爬取
python爬虫实战——爬取股票个股信息
python爬虫实战——爬取股票个股信息 python IDLE版本：(Python 3.6 64-bit) 爬虫...
恒大恒乎？
象豫国豫，象之大者；豫，人牵象之地也；豫，代有大象出焉。食遍周原扫楚冈，轰然一地腐生肠。眼枯喁望滋霜草，环伺眈...
造车新势力兼并重组开始，恒大造车是来势汹汹还是很傻很天真
恒大，首先是一个房地产企业，笔者身在南方的广州城，身边更多的是碧桂园、万科、越秀等房企，印象中还没有见过恒大的楼盘...