python实战第三个练习：爬取租房信息

作者: 豆子她老公狼狼 | 来源:发表于2016-08-30 22:52 被阅读76次

爬取300个房源的详细信息，在判断房东性别问题很快就解决了，但是在住房图片的链接获取上费了一些劲，目前用了字符的切片；在抓300个详情的链接的时候开始没有找到，后来成功定位，用了循环添加，大于300停止

我的成果

屏幕快照 2016-08-30 下午10.46.41.jpg

我的代码

from bs4 import BeautifulSoup
import requests

headers={'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.130 Safari/537.36'}
def get_links():
    link_list=[]
    count = len(link_list)

    urls=['http://bj.xiaozhu.com/search-duanzufang-p{}-0/'.format(i) for i in range(1,20)]
    for url in urls:
        if count >300:
            break
        wb_data=requests.get(url,headers=headers)
        soup=BeautifulSoup(wb_data.text,'lxml')
        links=soup.select('#page_list > ul > li > div.result_btm_con.lodgeunitname')
        for link in links:
            link_list.append(link.get('detailurl'))
    return link_list

def get_info(url):
    # url='http://bj.xiaozhu.com/fangzi/4131080529.html'
    # url='http://bj.xiaozhu.com/fangzi/3828318529.html'
    wb_data=requests.get(url)
    soup=BeautifulSoup(wb_data.text,'lxml')
    titles=soup.select('body > div.wrap.clearfix.con_bg > div.con_l > div.pho_info > h4 > em')
    areas=soup.select('body > div.wrap.clearfix.con_bg > div.con_l > div.pho_info > p > span')
    prices=soup.select('#pricePart > div.day_l > span')
    housepics=soup.select('#imgMouseCusor')
    hostimgs=soup.select('#floatRightBox > div.js_box.clearfix > div.member_pic > a > img')
    hostnames=soup.select('#floatRightBox > div.js_box.clearfix > div.w_240 > h6 > a')
    hostsexes=soup.select('#floatRightBox > div.js_box.clearfix > div.member_pic > div')

    for title,area,price,housepic,hostimg,hostname,hostsex in zip(titles,areas,prices,housepics,hostimgs,hostnames,hostsexes):
        sex='male' if hostsex.get('class') =='member_ico1' else 'female'
        data={
            'title':title.get_text(),
            'area':area.get_text(),
            'price':price.get_text(),
            'housepic':'http://bj.xiaozhu.com'+housepic.get('style')[16:len(housepic.get('style'))-2],
            'hostimg':hostimg.get('src'),
            'hostname':hostname.get_text(),
            'hostsex':sex
        }
        print(data)

link_list=get_links()
for i in link_list[:300]:
    get_info(i)

总结

获取详情页面的房源图片需要再看一下答案
在实现level2分别构建获取链接函数存储在列表中返回，构建获取详情函数依次爬取
抓300链接用两层循环，翻页以及存每页链接

网友评论

本文标题：python实战第三个练习：爬取租房信息

本文链接：https://www.haomeiwen.com/subject/zfrqettx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

python实战第三个练习：爬取租房信息

我的成果

我的代码

总结

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读