美文网首页
Python学习笔记(3):爬取租房信息

Python学习笔记(3):爬取租房信息

作者: 8907a9c3d98f | 来源:发表于2016-08-09 21:25 被阅读0次

    我的代码

    from bs4 import BeautifulSoup
    import requests
    
    #观察每一页url的规律
    urls = ['http://bj.xiaozhu.com/search-duanzufang-p{}-0/'.format(i) for i in range(1,10)]
    
    #获取房东性别信息
    def get_lorder_sex(class_name):
        if class_name == ['member_girl_ico']:
            return '女'
        elif class_name == ['member_boy_ico']:
            return '男'
    
    #获取房源链接信息
    def get_links(url):
        wq_data = requests.get(url)
        soup = BeautifulSoup(wq_data.text,'lxml')
        links = soup.select('#page_list > ul > li > a')
        for link in links:
            href = link.get('href')
            get_attraction(href)
    
    def get_attraction(url,data=None):
        wb_data = requests.get(url)
    
        #采用lxml引擎解析请求得到的列表页面数据
        soup = BeautifulSoup(wb_data.text,'lxml')
    
        #Chrome浏览器打开网页,把鼠标放相应信息上,右键,检查元素,Copy Css Path,去掉:nth-child()
        titles = soup.select('body > div.wrap.clearfix.con_bg > div.con_l > div.pho_info > h4 > em')
        adds = soup.select('body > div.wrap.clearfix.con_bg > div.con_l > div.pho_info > p > span')
        rends = soup.select('div.day_l > span')
        imgs = soup.select('div.pho_show_l > div.pho_show_big > div > img')
        img_householders =  soup.select('div.js_box.clearfix > div.member_pic > a > img')
        names = soup.select('div.js_box.clearfix > div.w_240 > h6 > a')
        genders = soup.select('div.js_box.clearfix > div.w_240 > h6 > span')
    
        for title,add,rend,img,img_householder,name,gender in zip(titles,adds,rends,imgs,img_householders,names,genders):
            #从标签里提取内容,get_text()得到文本,get()得到属性内容,get()得到的是列表,不是字符串
            data = {
                'title':title.get_text(),
                'add':add.get_text(),
                'rend':rend.get_text(),
                'img':img.get('src'),
                'img_householder':img_householder.get('src'),
                'name':name.get_text(),
                'gender':get_lorder_sex(gender.get('class'))
            }
            print(data)
    
    for single_url in urls:
        get_links(single_url)
    

    总结

    • 使用time.sleep()方法避开网站反爬取
    • BeautifulSoup的get()方法得到是列表,不是字符串

    相关文章

      网友评论

          本文标题:Python学习笔记(3):爬取租房信息

          本文链接:https://www.haomeiwen.com/subject/romdsttx.html