一次性的爬完某位博主的文章,并把它全部集中在一个html页面 ,怎么做呢??
我们在网页中打开微博,并搜索某位知名情感博主的博客
f12,加载页面,在network中可以看到一大堆链接
先按f12再加载页面,不然network不会出现之前的链接
仔细分析可以知道,有一条链接会返回一大堆json
可以看到里面有数字,可以猜测,里面的数字与当前显示博主发微博的条数一致,里面可能包含我们想要的数据,打开。
微博-data.png
果然!!!里面有我们要的数据。
不过,这只是第一页的数据
把浏览器往下拉
网页会加载第二页的数据 ,准确的说,这种加载方式叫做懒加载
第二页的数据也是以json数据返回
比较这两条链接
第一条:https://m.weibo.cn/api/container/getIndex?uid=1333608873&t=0&luicode=10000011&lfid=100103type%3D1%26q%3D%E7%BE%8A%E5%B8%88%E5%A4%AA&type=uid&value=1333608873&containerid=1076031333608873
第二条:https://m.weibo.cn/api/container/getIndex?uid=1333608873&t=0&luicode=10000011&lfid=100103type%3D1%26q%3D%E7%BE%8A%E5%B8%88%E5%A4%AA&type=uid&value=1333608873&containerid=1076031333608873&page=2
显而易见,前面的参数都是相同的,只有后面的page不一样
上代码
目录结构:
test.py
utilsaveUtil
init.py
htmlUtil.py
templates.html
test.py
import requests
import json
from util.saveUtil.htmlUtil import *
myCards = []
def getWeb(url):
try:
header = {
'User-Agent': 'Mozilla/5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X) AppleWebKit/604.1.38 (KHTML, like Gecko) Version/11.0 Mobile/15A372 Safari/604.1'
}
r = requests.get(url, timeout=(3, 6), headers=header)
r = json.loads(r.text)
if r.get("ok") == 1:
cards = r.get("data").get("cards")
print(len(cards))
for card in cards:
print(card)
created_at = card.get("mblog").get("created_at") # 创作日期
text = card.get("mblog").get("text") # 文章内容
if text.find("全文") != -1:
print("全文")
url = 'https://m.weibo.cn/statuses/extend?id=' + str(text).split('status/')[1].split('"')[0]
r2 = requests.get(url, timeout=(3, 6), headers=header)
print(r2.text)
try:
r2 = json.loads(r2.text)
print(r2)
if r2.get("ok") == 1:
text = r2.get("data").get("longTextContent") # 文章内容
except Exception as e:
pass
comments_count = card.get("mblog").get("comments_count") # 评论数
attitudes_count = card.get("mblog").get("attitudes_count") # 点赞数
reposts_count = card.get("mblog").get("reposts_count") # 分享数
images = []
try:
for pic in card.get("mblog").get("pics"):
print("image")
images.append(pic.get("url"))
except Exception as e:
print(e)
myCard = {"created_at": created_at, "text": text, "comments_count": comments_count,
"attitudes_count": attitudes_count, "reposts_count": reposts_count, "images": images}
myCards.append(myCard)
except Exception as e:
pass
i = 1
while True:
try:
url = ""
if i == 1:
url = 'https://m.weibo.cn/api/container/getIndex?uid=1333608873&t=0&luicode=10000011&lfid=100103type%3D1%26q%3D%E7%BE%8A%E5%B8%88%E5%A4%AA&type=uid&value=1333608873&containerid=1076031333608873'
else:
url = 'https://m.weibo.cn/api/container/getIndex?uid=1333608873&t=0&luicode=10000011&lfid=100103type%3D1%26q%3D%E7%BE%8A%E5%B8%88%E5%A4%AA&type=uid&value=1333608873&containerid=1076031333608873&page=' + str(
i)
getWeb(url)
i = i + 1
if i >= 1000:
print("结束")
readerHtml("羊师太", myCards)
break
print(i)
except Exception as e:
print(e)
break
htmlUtil.py
安装jinja2模块
pip install jinja2
from jinja2 import Environment, FileSystemLoader
def readerHtml(title,myCards):
env = Environment(loader=FileSystemLoader('./'))
template = env.get_template('util/saveUtil/template.html')
with open("util/saveUtil/result.html", 'wb+') as fout:
html_content = template.render(title=title,cards = myCards)
fout.write(str(html_content).encode("utf-8"))
templates.html
<!DOCTYPE html>
<meta http-equiv="Content-Type"content="text/html;charset=utf-8">
<html align='left'>
<h1>{{title}}</h1>
<body>
<ul >
<hr>
{% for card in cards %}
<li style="float: none;list-style:none">
<h6>创作时间:{{card.created_at}}</h6>
{{card.text}}
<br>
{%if card.images!=[]%}
<ul style="width: 1200px;height: 300px">
{% for image in card.images %}
<li style="list-style:none;float:left;margin: 5px">
<div style="width: 200px;height:300px;overflow:hidden;">
<img src={{image}} >
</div>
</li>
{% endfor %}
</ul>
{% endif %}
<br>
<p style="float: none">分享数:{{card.reposts_count}} 评论数:{{card.comments_count}} 点赞数:{{card.attitudes_count}}</p>
</li>
<hr>
{% endfor %}
</ul>
</body>
</html>
这个templates.html是我们生成网页的模板,爬到的数据会加载到模板中,生成一个result.html
这里只显示一部分
在浏览器中打开,这里只截取了一部分 微博-网页.png
这是未登录情况下的爬取,有些数据会爬取不到,登录的情况还在研究中
网友评论