爬取商品信息
由于58的二手商品平台转转上线,爬取的方法与老师的讲解有一些不一样:
- 58的二手商品新平台转转,全是转转商品
- 不区分个人商品与企业商品
- 浏览量与网页一起加载,不再单独请求
- 新的详情页无发贴时间信息,故不爬取
#!usr/bin/env python
#_*_ coding: utf-8 _*_
# python3.5 vs python2.7
# 58zhuanzhuan
from bs4 import BeautifulSoup
import requests
import time
def geturls(urls):
for url in urls:
webdata = requests.get(url)
soup = BeautifulSoup(webdata.text, 'lxml')
itemlist = soup.select('tr.zzinfo > td.t > a.t')
nav = getemtext(soup.select('div.nav a')[-1])
for item in itemlist:
itemurl = item.get('href')
title = getemtext(item)
get_target_info(itemurl, title, nav)
time.sleep(1)
def getemtext(element):
return element.get_text().strip()
def get_target_info(url, title='', nav=''):
wbdata = requests.get(url)
soup = BeautifulSoup(wbdata.text, 'lxml')
#title = soup.select('div.box_left > div > div > h1')
looktime = soup.select('span.look_time')[0]
price = soup.select('span.price_now i')[0]
place = soup.select('div.palce_li i')[0]
data = {
'title': title,
'nav': nav,
'looktime': getemtext(looktime).strip(u'次浏览'),
'price': getemtext(price),
'place': getemtext(place)
}
#print(data)
print(data['title'])
print('price: '+ data['price'] + ', view: '+ data['looktime']+ ' times' + ', area: ' + data['place'])
if __name__ == "__main__":
urls = ["http://bj.58.com/pbdn/0/pn{}/".format(pageid) for pageid in range(1, 14)]
geturls(urls)
#http://bj.58.com/tushu/pn2
部分运行结果
微软平板SURFACE RT
price: 1500, view: 2560 times, area: 北京-丰台
三星超薄平板,
price: 1200, view: 801 times, area: 北京-通州
iPad1代
price: 680, view: 1333 times, area: 北京-朝阳
转让iPadmini2带发票和包装盒子16G配件齐全体大
price: 1512, view: 355 times, area: 北京-海淀
95成新16G IPAD4(the new ipad) 第一代高清屏的ipad,现使用无卡顿...
price: 1299, view: 1998 times, area: 北京-通州
全新ipad 没有注册的 零磨损 看图吧
price: 1599, view: 1400 times, area: 北京-大兴
苹果iPad4代贱卖
price: 1200, view: 114 times, area: 北京-顺义
总结
- 类目与标题信息从列表页获取,作为参数传给get_target_info(),节省信息提取时间
- 打印爬取的结果时,直接print(data),中文以unicode编码输出。print(data['title'])可以正常显示中文字符
网友评论