爬虫实战第三天
任务
爬取小猪短租北京地区(http://bj.xiaozhu.com/) 租房信息(前三页)。
成果
将爬取的信息写入到了MongoDB中,并且查询了价格大于等于500/晚的租房信息。
源码
from bs4 import BeautifulSoup
from pymongo import MongoClient
import requests
pages = ['http://bj.xiaozhu.com/search-duanzufang-p{}-0/'.format(str(i)) for i in range(1, 4)]
info = []
client = MongoClient('localhost', 27017)
xiao_zhu = client['xiao_zhu']
xiao_zhu_sheet = xiao_zhu['xiao_zhu_sheet']
def get_info(url):
wb_data = requests.get(url)
soup = BeautifulSoup(wb_data.text, 'lxml')
data = {
'title': soup.select('div.pho_info > h4 > em')[0].get_text(),
'address': soup.select('div.pho_info > p > span')[0].get_text().strip(' ').strip('\n'),
'price': int(soup.select('#pricePart > div.day_l > span')[0].get_text()),
# 图片链接在chrome中不是直接打开而是下载,在IE中可以直接打开
'house_image': soup.select('#curBigImage')[0]['src'],
'master_name': soup.select('#floatRightBox > div.js_box.clearfix > div.w_240 > h6 > a')[0]['title'],
'master_sex': soup.select('#floatRightBox > div.js_box.clearfix > div.w_240 > h6 > span')[0]['class'][0].split('_')[1],
'master_image': soup.select('#floatRightBox > div.js_box.clearfix > div.member_pic > a > img')[0]['src']
}
xiao_zhu_sheet.insert_one(data)
def get_url(start_url):
wb_data = requests.get(start_url)
soup = BeautifulSoup(wb_data.text, 'lxml')
urls = soup.select('#page_list > ul > li > a')
return urls
for page in pages:
urls = get_url(page)
for url in urls:
try:
get_info(url['href'])
except Exception as e:
pass
'''
'price'关键字的属性类型必须为数值型,这样才能比较大小
$lt/$lte/$gt/$gte/$ne (l == less g == greater e == equal n == not)
使用print(type(xiao_zhu_sheet.find({'price': {'$gte': 500}})[0]))发现每个item实际上是一个dict
'''
for item in xiao_zhu_sheet.find({'price': {'$gte': 500}}):
print(item)
小结
- Pymongo操作MongoDB首先建立client连接(感觉有点类似于MySQL中的conn??),然后通过连接用python进行操作MongDB,建立具体的db和collection。
- Pymongo具体语法参考: http://api.mongodb.com/python/current/
网友评论