作为练习的小爬虫项目,它的页面布局,如下,这个样子,so easy。布局比价简单就不做过多介绍了。想要爬取的也就图中显示的这些信息。
既然是练习,那就多写一种写法。代码如下,一种是使用css选择器定位元素,一种是使用xpath。个人比较喜欢使用css选择器,它比xpath更灵活。
# coding:utf-8
# css 选择器练习
# import requests
# from scrapy.selector import Selector
# url = "https://www.meijutt.com/new100.html"
# res = requests.get(url).content.decode("gb2312")
# data = Selector(text=res)
# datalist = data.css(".top-list li")
# for dataline in datalist:
# title = dataline.css("h5 a::attr(title)").extract_first("")
# status = dataline.css(".new100state1 font::text").extract_first("")
# subtitle = dataline.css(".new100state1 .sub .subsheng::text").extract_first("")
# category = dataline.css(".mjjq::text").extract_first("")
# tvstation = dataline.css(".mjtv::text").extract_first("")
# print(title+"--"+status+"--"+subtitle+"--"+category+"--"+tvstation)
# xpath练习
import requests
import lxml.etree
url = "https://www.meijutt.com/new100.html"
res = requests.get(url).content.decode("gb2312")
mytree = lxml.etree.HTML(res)
mydata = mytree.xpath("//div[@class='top_warp']/div/ul[contains(@class, top-list)]/li")
for dataline in mydata:
title = dataline.xpath("./h5/a/text()")[0]
status = dataline.xpath("./span/font/text()")[0]
subtitle = dataline.xpath("./span/span[@class='sub']/em/text()")
category = dataline.xpath("./span[@class='mjjq']/text()")[0]
tvstation = dataline.xpath("./span[@class='mjtv']/text()")[0]
update = dataline.xpath("./div[contains(@class, lasted-time)]/text()")
if len(update) == 0:
updatetime = dataline.xpath("./div[contains(@class, lasted-time)]/font/text()")[0]
else:
updatetime = update[0]
if len(subtitle) > 0:
print(title+"--"+status+"--"+subtitle[0]+"--"+category+"--"+tvstation+"--"+updatetime)
else:
print(title+"--"+status+"--"+category+"--"+tvstation+"--"+updatetime)
上面小练习,能够把想要爬取的内容给爬下来,下面的代码是将爬下来的内容保存到远程MongoDB中。
# coding:utf-8
import requests
from scrapy.selector import Selector
import pymongo
url = "https://www.meijutt.com/new100.html"
res = requests.get(url).content.decode("gb2312")
data = Selector(text=res)
datalist = data.css(".top-list li")
meijulist = []
for dataline in datalist:
title = dataline.css("h5 a::attr(title)").extract_first("")
status = dataline.css(".new100state1 font::text").extract_first("")
subtitle = dataline.css(".new100state1 .sub em::text").extract_first("")
category = dataline.css(".mjjq::text").extract_first("")
tvstation = dataline.css(".mjtv::text").extract_first("")
update = dataline.css(".lasted-time::text")
if len(update) == 0:
updatetime = dataline.css(".lasted-time font::text").extract_first("")
else:
updatetime = update.extract_first("")
meijulist.append(title+"--"+status+"--"+subtitle+"--"+category+"--"+tvstation+"--"+updatetime)
# 连接MongoDB
myclient = pymongo.MongoClient(host="192.168.1.4", port=27017)
db = myclient['meiju']
db['meiju'].insert({"title": title, "status": status, "subtitle": subtitle, "category": category,
"tvstation": tvstation, "updatetime": updatetime})
print("报告老板,数据已插入, 共插入{0}条数据".format(len(meijulist)))
上面代码中涉及到字符串格式化,这里说一下,下面这种也能实现相同效果。
abc = [1, 2, 3]
print("报告老板数据已插入,共插入{0}条数据".format(len(abc)))
print("报告老板数据已插入,共插入{}条数据".format(len(abc)))
print("报告老板数据已插入,共插入%s条数据"%len(abc))
网友评论