python四周实战1.03——爬取网页

作者: 白狼小将 | 来源:发表于2016-08-12 16:16 被阅读23次

python四周实战1.03——爬取网页
Python爬虫实战之爬取链家广州房价_03存储
2017-12-31
【Python】网页数据爬取实战
python爬虫实战——爬取股票个股信息
爬取不可视化爬虫源码，复制粘贴就能用！python 暴力爬_极简
Java爬虫实战—利用xpath表达式抓取页面信息
python 爬取BOSS直聘网页信息
利用python爬取英雄联盟皮肤
使用Scrapy爬取网页数据并保存到MongoDB

这次爬取网页花的时间比自己预期的长了很多，回想了一下，主要的问题在于：

在获取列表页那里消耗了很多时间，最后发现是错误的方向。只需要获取列表页就行了，而不是一步到位，遍历了列表页还一起把详情页的链接给得到了。这里应该分开的。
写程序是一个结构化很强的思考过程。在开始前应该有一个大概的构思，当然也可以改，但是有个流程图的话会思路清晰很多。
另一个是爬取图片那里，一直爬取的都是none，以为是网站反爬取了，还切换手机版，发现网址还不一样，霎时傻了。后来发现原来是因为在“检查”时选择的对象是加载在图片上面的左右箭头，而不是图片，这样获取其src属性肯定是空的。还是得更细心才行，这些都得好好留意对不对的，并不是有个链接就是图片。
判断性别那里自己写的不好，对比一下,发现可以直接选择class的，有想过选class参数的，用get方法就可以了，但是没想到。。。另外，自己选择的if…else结构并不合理，因为有未选择性别的，而在爬取时就变成了男性，像答案的用上elseif就很好了。

def get_lorder_sex(class_name):
    if class_name == ['member_boy_ico']:
        return '男'
    elif class_name == ['member_girl_ico']:
        return '女'

"sex": get_lorder_sex(sex.get("class"))

代码：

from bs4 import BeautifulSoup
import requests

urls=("http://bj.xiaozhu.com/search-duanzufang-p{}-0/".format(str(i)) for i in range(1,13))

def get_link(url):
    wb_data=requests.get(url)
    soup=BeautifulSoup(wb_data.text,'lxml')
    links=soup.select('#page_list > ul > li > a')
    for link in links:
        href=link.get('href')
        get_detail_info(href)

def get_detail_info(url,data=None):
    wb_data =requests.get(url)
    soup = BeautifulSoup(wb_data.text,'lxml')

    titles=soup.select('body > div.wrap.clearfix.con_bg > div.con_l > div.pho_info > h4')
    adds  =soup.select('body > div.wrap.clearfix.con_bg > div.con_l > div.pho_info > p > span')
    prices=soup.select('#pricePart > div.day_l > span')
    imgs  =soup.select('#curBigImage')
    owners=soup.select('#floatRightBox > div.js_box.clearfix > div.member_pic > a > img')
    males =soup.select('#floatRightBox > div.js_box.clearfix > div.member_pic > div.member_ico1')
    names =soup.select('#floatRightBox > div.js_box.clearfix > div.w_240 > h6 > a')
    if males != []:
        members = '女'
    else:
        members = '男'

    for title,add,price,img,owner,member,name in zip(titles,adds,prices,imgs,owners,members,names):
        data = {
            'title':title.get_text(),
            'add'  :add.get_text(),
            'price':price.get_text(),
            'img'  :img.get('src'),
            'owner':owner.get('src'),
            'member':str(members),
            'name' :name.get_text(),
        }
        print(data)

for single_url in urls:
     get_link(single_url)