运行结果:
列表页上关于详情页的链接:
详情页上的部分信息:
我的代码:
from bs4 import BeautifulSoup
import requests, time
import pymongo
client = pymongo.MongoClient("localhost", 27017)
phone_number = client["phone_number"]
sheet1 = phone_number["sheet1"]
item_phone_number = phone_number["item_phone_number"]
def get_links_from(channel, pages):
# http://bj.58.com/shoujihao/pn2/
list_view = "{}pn{}/".format(channel, str(pages))
wb_data = requests.get(list_view)
soup = BeautifulSoup(wb_data.text, 'lxml')
titles = soup.select("strong.number")
links = soup.select("li > a.t")
for title, link in zip(titles, links):
data = {"title":title.text, "link":link.get("href").split('?')[0]}
sheet1.insert_one(data)
print(data)
def get_info_from(url):
wb_data = requests.get(url)
soup = BeautifulSoup(wb_data.text, "lxml")
title1 = soup.select("h1")[0].text.split()
title = " ".join(title1)
date = soup.select("li.time")[0].text.strip()
price = soup.select("div.su_con > span")[0].text.strip()
sellor = soup.select("ul > ul > li > a")[0].text
data = {
'title' : title,
'date' : date,
'price' : price,
'sellor': sellor,
}
item_phone_number.insert_one(data)
print(data)
get_links_from("http://bj.58.com/shoujihao/", 2)
for item in sheet1.find():
get_info_from(item["link"])
'''
要想获得更多的信息:
numbers = soup.select("div.hm_1 > span > a")[3:]
types = soup.select("ul > li > div.hm_2 > span")[3:]
prices = soup.select("div.hm_3 > span")[3:]
for number, type, price in zip(numbers, types, prices):
print(number.text, type.text, price.text+'元')
'''
总结:
-1 对标签中文本中间含有空格、tab键和换行符时,可以先对文本利用split函数分片,再利用join函数进行聚合
-2 尽量使代码简单,最好不要循环套循环,字典套字典。
说明:对详情页只是进行了上部分的抓取,如果想对下部分进行抓取,可以另外见一个表,进行抓取存储
网友评论