爬虫：爬取小猪短租网

作者: 泠泠七弦客 | 来源:发表于2016-07-28 10:52 被阅读0次

爬虫：爬取小猪短租网
【爬虫篇】:爬取小猪短租网
python爬虫实战——爬取北京地区短租房信息(基于BS4)
第三节练习项目：爬取租房信息
第一节练习项目：在 MongoDB 中筛选房源
Python实战学习笔记：爬取租房网站信息
python爬取小猪短租网信息
爬取小猪短租内容
python使用scrapy框架爬取小猪短租
爬虫： example three -- 爬取小猪短租的信息

成果：

部分截图

代码：

from bs4 import BeautifulSoup

import requests
import time
import json

def sex_dis(content):
    content = str(content)
    if 'boy' in content:
        return 'boy'
    elif 'girl' in content:
        return 'girl'
    else:
        return None


def get_link(url):
    f = requests.get(url)
    soup = BeautifulSoup(f.text, 'lxml')
    links = soup.select('a.resule_img_a')
    link_list = []
    for link in links:
        link_content = link.get('href')
        link_list.append(link_content)
    return link_list


def get_details(url):
    f = requests.get(url)
    soup = BeautifulSoup(f.text, 'lxml')
    time.sleep(1)
    titles = soup.select('body > div.wrap.clearfix.con_bg > div.con_l > div.pho_info > h4 > em')
    addresses = soup.select('body > div.wrap.clearfix.con_bg > div.con_l > div.pho_info > p > span.pr5')
    rates = soup.select('#pricePart > div.day_l > span')
    imges = soup.select('#imgMouseCusor')
    host_imges = soup.select('#floatRightBox > div.js_box.clearfix > div.member_pic > a > img')
    sexes = soup.select('#floatRightBox > div.js_box.clearfix > div.w_240 > h6 > span')
    names = soup.select('#floatRightBox > div.js_box.clearfix > div.w_240 > h6 > a')

    for title, address, rate, img, host_img, name, sex in zip(titles, addresses, rates, imges, host_imges, names, sexes):
        title_content = title.get_text()
        address_content = address.get_text().split('\n')[0]
        rate_content = rate.get_text()
        img_content = img.get('style').split('\'')[1]
        host_img_content = host_img.get('src')
        name_content = name.get('title')
        sex1 = sex.get('class')
        sex_content = sex_dis(sex1)
        data = {
            'title': title_content,
            'address': address_content,
            'rate': rate_content,
            'img': img_content,
            'host_img': host_img_content,
            'name': name_content,
            'sex': sex_content,
        }
        return data


def save_to_text(content):
    with open('租房信息', 'a', encoding='utf-8') as f:
        f.write(content)
        f.write('\r\n')


def main():
    urls = ['http://zz.xiaozhu.com/search-duanzufang-p{}-0/'.format(str(i)) for i in range(1, 13)]
    for url in urls:
        link_list = get_link(url)
        for link in link_list:
            content = get_details(link)
            save_to_text(json.dumps(content, ensure_ascii=False))
# 本来写的str(content),但是有些野鸡，所以就按照格式来
# 这个ensure_ascii参数默认是True，写进去的是十六进制
# 所以要改为False

if __name__ == '__main__':
    main()```

这次粘贴代码很顺利。
其中遇到的问题：
1.这个和之前的不一样，这个样式是点开页面以后才能抓取。之前是信息全部在一个页面，我们去筛选。
所以我们需要做的有两步：先在预览页面爬取每个详细页面的链接href,这个网址会跳转到详细页面，然后在详细页面调用爬取信息的函数。
而每个预览页面是有规律的，页码在url中体现的有。所以可以做出来一个预览页面的list,每个预览页面有12个详细页面的list.循环调用爬虫程序即可

没什么大问题，还是套路。

这个没有用到用户验证，那就把用户验证总结一下：
* 有些信息是需要登录以后才能显示的，所以我们要伪装成用户来发送requests。
1.先登录，然后右键‘检查’

![具体步骤，在cookie下面有User-Agent](https://img.haomeiwen.com/i2582504/662fd5699da1fef4.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)
2.构造一个字典：

headers={
'User-Agent': '复制的User-Agent信息'，
'Cookie': '复制的cookie信息'
}

然后再在requests中发送

f = requests(url,headers=headers)

如此而已就假造成自然人了

* 有些网站做了反爬取功能，使得我们爬取的时候有困难，而且隔一天爬取可能爬的东西又不一样了。
我们有一种同用作法就是用浏览器模拟手机用户，来爬取。
因为手机用户登录的网页没那么滴水不漏，实现步骤

![模拟手机用户](https://img.haomeiwen.com/i2582504/60cd80595a6c0c01.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)
ok了，这样的话还是之前的套路就可以了。