美文网首页
Python实战计划学习笔记(4)网页解析作业

Python实战计划学习笔记(4)网页解析作业

作者: 如恒河沙 | 来源:发表于2016-08-26 00:39 被阅读0次

    第一周第三节练习项目

    总结

    • 使用BeautifulSoup库可以很方便地处理网页
    • 基本套路是用select确定元素位置,get()和get_text()方法提取数据
    • 使用User-Agent和Cookie可以”欺骗“服务器
    • 爬取网页时要注意通过延时来避免暴露身份

    任务

    (1)爬取小猪短租网站上一条房源信息
    (2)爬取小猪短租网站上300条房源信息

    任务1代码

    from bs4 import BeautifulSoup
    import requests
    import time
    
    url='http://bj.xiaozhu.com/fangzi/3686435130.html'
    
    web_data = requests.get(url)
    soup = BeautifulSoup(web_data.text,'lxml')
    titles = soup.select('div.pho_info > h4 > em')
    addresses = soup.select('span.pr5')
    prices = soup.select('div.day_l')
    room_photos = soup.select('div.pho_show_l > div > div:nth-of-type(2) > img')
    lord_photos = soup.select('div.member_pic > a > img')
    lord_names = soup.select('a.lorder_name')
    lord_genders = soup.select('div.js_box.clearfix > div.w_240 > h6 > span')
    for title,address,price,room_photo,lord_photo,lord_name,lord_gender in zip(titles,addresses,prices,room_photos,lord_photos,lord_names,lord_genders):
        if lord_gender.get('class')[0] == 'member_girl_ico':
            lord_gender = '美女'
        else:
            lord_gender = '帅哥'
        data = {
            'title':title.get_text(),
            'address':address.get_text().split('\n')[0],
            'price':price.get_text()[1:],
            'room_photo':room_photo.get('src'),
            'lord_photos':lord_photo.get('src'),
            'lord_name':lord_name.get_text(),
            'lord_gender':lord_gender
        }
        print(data)
    

    任务1运行结果

    1.jpg

    任务2代码

    from bs4 import BeautifulSoup
    import requests
    import time
    
    urls = ['http://bj.xiaozhu.com/search-duanzufang-p{}-0/'.format(str(i)) for i in range(1,14,1)]
    
    def get_page(url):
        link_list=[]
        web_data = requests.get(url)
        soup = BeautifulSoup(web_data.text,'lxml')
        links = soup.select('ul > li > div.result_btm_con.lodgeunitname')
        for link in links:
            link_url = link.get('detailurl')
            #print (link_url)
            link_list.append(link_url)
        return(link_list)
    
    def get_info_from_page(url):
        web_data = requests.get(url)
        soup = BeautifulSoup(web_data.text, 'lxml')
        titles = soup.select('div.pho_info > h4 > em')
        addresses = soup.select('span.pr5')
        prices = soup.select('div.day_l')
        room_photos = soup.select('div.pho_show_l > div > div:nth-of-type(2) > img')
        lord_photos = soup.select('div.member_pic > a > img')
        lord_names = soup.select('a.lorder_name')
        lord_genders = soup.select('div.js_box.clearfix > div.w_240 > h6 > span')
        for title, address, price, room_photo, lord_photo, lord_name, lord_gender in zip(titles, addresses, prices,
                                                                                         room_photos, lord_photos,
                                                                                         lord_names, lord_genders):
            if lord_gender.get('class')[0] == 'member_girl_ico':
                lord_gender = '美女'
            else:
                lord_gender = '帅哥'
            data = {
                'title': title.get_text(),
                'address': address.get_text().split('\n')[0],
                'price': price.get_text()[1:],
                'room_photo': room_photo.get('src'),
                'lord_photos': lord_photo.get('src'),
                'lord_name': lord_name.get_text(),
                'lord_gender': lord_gender
            }
        return(data)
    
    
    links_to_read=[]
    room_info=[]
    for single_url in urls:
        links_to_read = links_to_read + get_page(single_url)
        time.sleep(4)
    
    print('共爬取',len(links_to_read),'条房源信息')
    
    for crawl_url in links_to_read:
        room_info.append(get_info_from_page(crawl_url))
        time.sleep(4)
    
    print(room_info)
    

    任务2运行结果

    1.jpg

    相关文章

      网友评论

          本文标题:Python实战计划学习笔记(4)网页解析作业

          本文链接:https://www.haomeiwen.com/subject/rgbasttx.html