美文网首页
21.爬取美剧天堂信息保存到远程MongoDB中

21.爬取美剧天堂信息保存到远程MongoDB中

作者: starrymusic | 来源:发表于2019-04-02 08:26 被阅读0次

作为练习的小爬虫项目,它的页面布局,如下,这个样子,so easy。布局比价简单就不做过多介绍了。想要爬取的也就图中显示的这些信息。


既然是练习,那就多写一种写法。代码如下,一种是使用css选择器定位元素,一种是使用xpath。个人比较喜欢使用css选择器,它比xpath更灵活。

# coding:utf-8
# css 选择器练习
# import requests
# from scrapy.selector import Selector
# url = "https://www.meijutt.com/new100.html"
# res = requests.get(url).content.decode("gb2312")
# data = Selector(text=res)
# datalist = data.css(".top-list li")
# for dataline in datalist:
#     title = dataline.css("h5 a::attr(title)").extract_first("")
#     status = dataline.css(".new100state1 font::text").extract_first("")
#     subtitle = dataline.css(".new100state1 .sub .subsheng::text").extract_first("")
#     category = dataline.css(".mjjq::text").extract_first("")
#     tvstation = dataline.css(".mjtv::text").extract_first("")
#     print(title+"--"+status+"--"+subtitle+"--"+category+"--"+tvstation)

# xpath练习
import requests
import lxml.etree
url = "https://www.meijutt.com/new100.html"
res = requests.get(url).content.decode("gb2312")
mytree = lxml.etree.HTML(res)
mydata = mytree.xpath("//div[@class='top_warp']/div/ul[contains(@class, top-list)]/li")
for dataline in mydata:
    title = dataline.xpath("./h5/a/text()")[0]
    status = dataline.xpath("./span/font/text()")[0]
    subtitle = dataline.xpath("./span/span[@class='sub']/em/text()")
    category = dataline.xpath("./span[@class='mjjq']/text()")[0]
    tvstation = dataline.xpath("./span[@class='mjtv']/text()")[0]
    update = dataline.xpath("./div[contains(@class, lasted-time)]/text()")
    if len(update) == 0:
        updatetime = dataline.xpath("./div[contains(@class, lasted-time)]/font/text()")[0]
    else:
        updatetime = update[0]
    if len(subtitle) > 0:
        print(title+"--"+status+"--"+subtitle[0]+"--"+category+"--"+tvstation+"--"+updatetime)
    else:
        print(title+"--"+status+"--"+category+"--"+tvstation+"--"+updatetime)

上面小练习,能够把想要爬取的内容给爬下来,下面的代码是将爬下来的内容保存到远程MongoDB中。

# coding:utf-8
import requests
from scrapy.selector import Selector
import pymongo
url = "https://www.meijutt.com/new100.html"
res = requests.get(url).content.decode("gb2312")
data = Selector(text=res)
datalist = data.css(".top-list li")
meijulist = []
for dataline in datalist:
    title = dataline.css("h5 a::attr(title)").extract_first("")
    status = dataline.css(".new100state1 font::text").extract_first("")
    subtitle = dataline.css(".new100state1 .sub em::text").extract_first("")
    category = dataline.css(".mjjq::text").extract_first("")
    tvstation = dataline.css(".mjtv::text").extract_first("")
    update = dataline.css(".lasted-time::text")
    if len(update) == 0:
        updatetime = dataline.css(".lasted-time font::text").extract_first("")
    else:
        updatetime = update.extract_first("")
    meijulist.append(title+"--"+status+"--"+subtitle+"--"+category+"--"+tvstation+"--"+updatetime)
    
    # 连接MongoDB
    myclient = pymongo.MongoClient(host="192.168.1.4", port=27017)
    db = myclient['meiju']
    db['meiju'].insert({"title": title, "status": status, "subtitle": subtitle, "category": category,
                        "tvstation": tvstation, "updatetime": updatetime})
print("报告老板,数据已插入, 共插入{0}条数据".format(len(meijulist)))

上面代码中涉及到字符串格式化,这里说一下,下面这种也能实现相同效果。

abc = [1, 2, 3]
print("报告老板数据已插入,共插入{0}条数据".format(len(abc)))
print("报告老板数据已插入,共插入{}条数据".format(len(abc)))
print("报告老板数据已插入,共插入%s条数据"%len(abc))

相关文章

网友评论

      本文标题:21.爬取美剧天堂信息保存到远程MongoDB中

      本文链接:https://www.haomeiwen.com/subject/lthbbqtx.html