新手Python爬虫之链家房价篇

作者: DY人士 | 来源:发表于2019-02-20 09:04 被阅读0次

Python爬虫实战之爬取链家广州房价_03存储
新手Python爬虫之链家房价篇
3分钟带你了解世界第一语言Python 入门上手也这么简单！
scrapy抓取链家网二手房成交数据
Python爬虫之---爬链家
Django 创建第一个项目
南京链家爬虫系列文章（一）——工具篇
Python爬虫实战之爬取链家广州房价_02把小爬虫变大
Java爬虫之下载全世界国家的国旗图片
Python爬虫实战之爬取链家广州房价_04链家的模拟登录(记录

前景回顾

最近看房价飞涨就想是不是写个脚本分析下所关注的城市房价数据，爬虫语言的选有多种：Python、javaScript、Java等。因为python最近大热，有盖过Java的趋势，所以初步学习了下python，语法都是现学现用的（发现python语法真的是很飘逸)。没啥兴趣看长篇大论的话，可以在这里下载完整代码

xpath的使用

xpath是一门在XML文档中寻找对应信息的语言，使用路径表达式来选取节点或者节点集。

节点选取语法.png

<?xml version="1.0" encoding="ISO-8859-1"?>
<bookstore>
<book>
 <title lang="eng">你好</title>
 <price>1111</price>
</book>

<book>
 <title lang="eng">世界</title>
 <price>2222</price>
</book>
</bookstore>

路径选取语法.png

如上一个XML文档，此时我们需要获取到“世界”文本就可以这么写：/bookstore/book[2]/@title 或者 /bookstore/book[2]/title[@lang='eng']

是不是很麻烦，好在chrome已经自带xpath路径获取了，主要打开开发者工具就行了。

chrome调试.png

开始爬虫

话不多说，开始进入主题，构造模拟请求的浏览器信息，使用urllib2.urlopen获取到response，然后格式化成xpath能识别的格式。

url = 'https://sh.lianjia.com/ershoufang/pg1'
# 构造模拟请求的客户端信息
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.62 Safari/537.36'}
req = urllib2.Request(url=url,headers=headers)
response = urllib2.urlopen(req）
rst = response.read().decode("utf-8")
rst=html.fromstring(rst)
path = '/html/body/div[4]/div[1]/ul/li'
parseHtml(path, rst)

解析需要的字段

def parseHtml(path, rst):
    datas = rst.xpath(path)
    for data in datas:
        title=data.xpath('/html/body/div[4]/div[1]/ul/li[1]/div[1]/div[1]/a/text()')
        area=data.xpath('/html/body/div[4]/div[1]/ul/li[1]/div[1]/div[3]/div/a/text()')
        houseInfo=data.xpath('/html/body/div[4]/div[1]/ul/li[1]/div[1]/div[2]/div/text()')
        flood=data.xpath('/html/body/div[4]/div[1]/ul/li[1]/div[1]/div[3]/div/text()')
        followInfo=data.xpath('/html/body/div[4]/div[1]/ul/li[1]/div[1]/div[4]/text()')
        totalPrice=data.xpath('/html/body/div[4]/div[1]/ul/li[1]/div[1]/div[6]/div[1]/span/text()')
        unitPrice=data.xpath('/html/body/div[4]/div[1]/ul/li[1]/div[1]/div[6]/div[2]/span/text()')

使用pyecharts绘制图表数据

import pyecharts
from pyecharts import Pie
#绘制地址分布图表
def drawAddressSheet():
    labels = []
    sizes = []
    for key in addressDict:
        labels.append(key)
        vList = addressDict[key]
        sizes.append(len(vList))
    drawBarGraph(sizes,labels,'%s区域范围分布图'%CITY,'机会数量','区域范围分布图')

def drawBarGraph(sizes,labels,title,category,imgName):
    #饼状图
    pie = Pie(title,title_pos='center')
    # pie.add(category,labels,sizes,is_lable_show=True)
    pie.add(category, labels, sizes, radius=[40, 75],
        label_text_color=None,  #标签字体的颜色
        is_label_show=True,
        legend_orient='vertical',
        legend_pos='left')
    pie.render(path='./%s.png' % imgName)

附上两张结果饼状图