准备工作
采用语言python
所需库requests、re、math
获取动态链接
打开简书个人主页,推荐使用浏览器Firefox,Chrome,点右键查看元素,选择网络选项,然后滑动页面,加载新内容,然后找到如下图中的链接以及请求头备用。

链接地址:
https://www.jianshu.com/u/d2121d5ecf94?order_by=shared_at&page=2
最后一个参数page=2,这个是页数,后面会通过文章总数计算总页数。
请求头如下,存储为字典:
headers = {
'Accept':'text/html, */*; q=0.01',
'Accept-Encoding':'gzip, deflate, br',
'Accept-Language':'zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2',
'Connection':'keep-alive',
'Cookie':'Hm_lvt_0c0e9d9b1e7d617b3e6842e85b9fb068=1515486399,1515486408,1515505449,1515550122; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%228748118%22%2C%22%24device_id%22%3A%22160d07688ab449-08fc8d4ba508368-4c322e7d-1327104-160d07688ac316%22%2C%22props%22%3A%7B%22%24latest_traffic_source_type%22%3A%22%E7%9B%B4%E6%8E%A5%E6%B5%81%E9%87%8F%22%2C%22%24latest_referrer%22%3A%22%22%2C%22%24latest_referrer_host%22%3A%22%22%2C%22%24latest_search_keyword%22%3A%22%E6%9C%AA%E5%8F%96%E5%88%B0%E5%80%BC_%E7%9B%B4%E6%8E%A5%E6%89%93%E5%BC%80%22%2C%22%24latest_utm_source%22%3A%22weixin%22%2C%22%24latest_utm_medium%22%3A%22reader_share%22%2C%22%24latest_utm_campaign%22%3A%22haruki%22%2C%22%24latest_utm_content%22%3A%22note%22%7D%2C%22first_id%22%3A%22160d07688ab449-08fc8d4ba508368-4c322e7d-1327104-160d07688ac316%22%7D; remember_user_token=W1s4NzQ4MTE4XSwiJDJhJDExJGJLRERsWHFzY0N5U2lPNHFIN1B4Wk8iLCIxNTE1MzI1OTM3LjkzMzQ4NzciXQ%3D%3D--8a5ac5ef497afd64b897cb834f68d7c8721e2a58; read_mode=day; default_font=font2; locale=zh-CN; _m7e_session=d6bfc751747df6b4b7a2d40d38274bf9; Hm_lpvt_0c0e9d9b1e7d617b3e6842e85b9fb068=1515550507',
'Host':'www.jianshu.com',
'Referer':'https://www.jianshu.com/u/d2121d5ecf94',
'User-Agent':'maoGbanLGlZUl3lesp7UfB+2qZGuaLopoDeb8kRGecBdjtsdzH+NOJ2bvp1JLfaPyoDnbh4NS7vVjUHCG0D/6Q==',
'X-CSRF-Token':'maoGbanLGlZUl3lesp7UfB+2qZGuaLopoDeb8kRGecBdjtsdzH+NOJ2bvp1JLfaPyoDnbh4NS7vVjUHCG0D/6Q==',
'X-INFINITESCROLL':'true',
'X-Requested-With':'XMLHttpRequest'
}
爬取思路
获取动态加载链接,通过文章数量计算page数量,然后通过requests循环抓取页面信息。最后通过re获取需要信息并存储。
本例最后以list形式存储结果。
详细代码如下:
#coding=utf-8
import os
import requests
import re
import math
# 定制headers
headers = {
'Accept':'text/html, */*; q=0.01',
'Accept-Encoding':'gzip, deflate, br',
'Accept-Language':'zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2',
'Connection':'keep-alive',
'Cookie':'Hm_lvt_0c0e9d9b1e7d617b3e6842e85b9fb068=1515486399,1515486408,1515505449,1515550122; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%228748118%22%2C%22%24device_id%22%3A%22160d07688ab449-08fc8d4ba508368-4c322e7d-1327104-160d07688ac316%22%2C%22props%22%3A%7B%22%24latest_traffic_source_type%22%3A%22%E7%9B%B4%E6%8E%A5%E6%B5%81%E9%87%8F%22%2C%22%24latest_referrer%22%3A%22%22%2C%22%24latest_referrer_host%22%3A%22%22%2C%22%24latest_search_keyword%22%3A%22%E6%9C%AA%E5%8F%96%E5%88%B0%E5%80%BC_%E7%9B%B4%E6%8E%A5%E6%89%93%E5%BC%80%22%2C%22%24latest_utm_source%22%3A%22weixin%22%2C%22%24latest_utm_medium%22%3A%22reader_share%22%2C%22%24latest_utm_campaign%22%3A%22haruki%22%2C%22%24latest_utm_content%22%3A%22note%22%7D%2C%22first_id%22%3A%22160d07688ab449-08fc8d4ba508368-4c322e7d-1327104-160d07688ac316%22%7D; remember_user_token=W1s4NzQ4MTE4XSwiJDJhJDExJGJLRERsWHFzY0N5U2lPNHFIN1B4Wk8iLCIxNTE1MzI1OTM3LjkzMzQ4NzciXQ%3D%3D--8a5ac5ef497afd64b897cb834f68d7c8721e2a58; read_mode=day; default_font=font2; locale=zh-CN; _m7e_session=d6bfc751747df6b4b7a2d40d38274bf9; Hm_lpvt_0c0e9d9b1e7d617b3e6842e85b9fb068=1515550507',
'Host':'www.jianshu.com',
'Referer':'https://www.jianshu.com/u/d2121d5ecf94',
'User-Agent':'maoGbanLGlZUl3lesp7UfB+2qZGuaLopoDeb8kRGecBdjtsdzH+NOJ2bvp1JLfaPyoDnbh4NS7vVjUHCG0D/6Q==',
'X-CSRF-Token':'maoGbanLGlZUl3lesp7UfB+2qZGuaLopoDeb8kRGecBdjtsdzH+NOJ2bvp1JLfaPyoDnbh4NS7vVjUHCG0D/6Q==',
'X-INFINITESCROLL':'true',
'X-Requested-With':'XMLHttpRequest',
}
url = 'https://www.jianshu.com/u/d2121d5ecf94'
response = requests.get(url)
text = response.content.decode("utf-8")
# 文章总数
counts = int(re.findall('<a href="/u/d2121d5ecf94">\n <p>(.*?)</p>\n 文章 <i class="iconfont ic-arrow">',text,re.S)[0])
articles = []
# 获取主页内容
for i in range(1,math.ceil(counts/9)+1):
url = 'https://www.jianshu.com/u/d2121d5ecf94?order_by=shared_at&page=%d' % i # 加载时的链接地址
response = requests.get(url,headers=headers) # 请求
text = response.content.decode("utf-8")
time = re.findall('<span class="time" data-shared-at="(.*?)T(.*?)\+08:00">',text) # 获取发布时间
title = re.findall('<a class="title" target="_blank" href="(.*?)">(.*?)<\/a>', text) # 获取标题
content = re.findall('<p class="abstract">\n (.*?)\n </p>',text,re.S) # 获取summary
for i in range(len(time)): # 将获取的内容加入list
articles.append((time[i][0]+' '+time[i][1],'https://www.jianshu.com'+title[i][0],title[i][1],content[i]))
print(articles)
执行以上代码就可以获得发布时间、文章标题、以及文章简介,如果需要其他信息,可以自行抓取,如头像、作者、阅读数量、评论数量、喜欢数量等。
本文只作技术交流,请勿用做他途。版权所有,转载请与作者联系,谢谢!
网友评论