背景图片反爬之自如租房价格抓取

作者: 成长之路丶 | 来源:发表于2019-10-24 17:15 被阅读0次

背景图片反爬之自如租房价格抓取
Python 爬取自如网租房信息
爬取自如租房信息
Python3项目：练习爬取租房信息
Python爬取链家网上海市租房信息
江苏省58同城租房数据
入门级反扒-字体反扒基础
真实世界中的网页解析
爬一爬链家网北京租房数据
Python爬虫——学习字体反爬获取某招聘信息

此次我们要练手的网站是自如租房，自如租房网对租房的价格进行了CSS背景图片反爬，它把一张有随机0-9数字的图片作为背景图片，然后通过background-position样式来映射数字，我们需要找到背景反爬的规律以及通过OCR识别出背景图片中的数字。

目标分析

页面效果
可以发现同一个背景位置偏移量对应的映射的价格数字是一样的，而且行内样式里有背景图片的链接，并且background-size是20px，我们下载背景图片然后分析背景图片与背景位置偏移量的映射关系：

背景图片大小

背景图片
可以发现图片大小是300x28，图片里的数字是8670415923，我们首先需要把图片的宽度等比例缩小到20px，缩放后得到的规格是214.281x20，每个数字的大小214.281/10 = 21.4281，保留一位小数就是21.4px，再看网页源代码上的背景偏移量数字2的偏移量是-171.2px，171.2/21.4 = 8 因为第一个是的偏移量为0，所以数字2对应的应该是8+1 = 9 ，映射到图片上第9个数字，我们发现正是数字2，数字5和数字0的背景偏移量分别是-128.4px和-64.2px，128.4/21.4 +1 = 7以及64.2/21.4 +1 = 4，映射到图片上第7个数字和第4个数字，我们发现正是数字5和数字0，找到了背景反爬偏移量的映射规律，我们需要识别背景图片里的数字，我们可以使用Tesseract来识别，背景图片里的数字没有干扰线以及噪点所以识别出来基本上是100%准确率。

破解思路

先下载背景图片，用Tesseract来识别背景图片里的数字，再提取网页源代码中背景图片的偏移量，最后根据我们得出的规律计算出正确的数字。

'''
时间原因只抓1页
'''
import requests
from urllib.request import urlretrieve
import re
import pytesseract
from PIL import Image
from lxml import etree

# 请求的url
url = "http://sz.ziroom.com/z/"

headers = {
    'user-agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36",
    }


response = requests.get(url, headers=headers)

html = etree.HTML(response.text)
bg_img_info = html.xpath('//div[@class="price"]/span[@class="num"]/@style')[0]
if len(re.findall(r'background-image: url\((.*?)\)',bg_img_info))>0:
    bg_img_url = "http:" + re.findall(r'background-image: url\((.*?)\)',bg_img_info)[0]
    info = html.xpath('//div[@class="Z_list-box"]/div[@class="item"]')
    for i in info:  
        title = i.xpath('.//h5/a/text()')[0]
        area = i.xpath('.//div[@class="desc"]/div[1]/text()')[0]
        location = i.xpath('.//div[@class="desc"]/div[@class="location"]/text()')[0].replace('\n',"").replace('\t',"").replace(' ',"")
        tag = "，".join(i.xpath('.//div[@class="tag"]/span/text()'))
        position_list = [re.findall(r'background-position: -(.*?)px',j)[0] for j in i.xpath('.//div[@class="price"]/span[@class="num"]/@style')]
        price_list = [int(float(i)/21.4 + 1) for i in position_list]
        urlretrieve(bg_img_url,'background_img.png')
        image = Image.open('background_img.png')
        text = pytesseract.image_to_string(image)
        num = [i for i in text]
        price = "￥" + "".join([num[i-1] for i in price_list]) + "/月起"
        print(title, area, location, tag,price)
else:
      print("提取背景图片链接出错！")