需求分析
简书文章的分析功能比较弱,只能按照热度排序,从页面上看,热度指的是点赞数。
![](https://img.haomeiwen.com/i1864602/ec74810f572f9031.png)
可文章还有其他的分析维度:阅读数,评论数, 点赞数。简书并没有提供对这些维度的分析。
既然如此,就自己撸起袖子干吧...
实现的需求很简单:将自己简书文章的阅读、评论、点赞、打赏、标题、发布时间抓取下来,存入数据库,再进行分析展示
效果如下:
![](https://img.haomeiwen.com/i1864602/5e32b3298932ef83.gif)
以上只是最简单的展示,可以自定义其他数据分析效果
具体实现
数据抓取
使用python抓取页面数据,抓取之前先分析页面的html结构
![](https://img.haomeiwen.com/i1864602/c43e0a7927f678e4.png)
具体实现代码:
# -*- coding: utf-8 -*-
import requests
import pyquery
import time
import datetime
import pymysql
# 数据库连接信息
conn = pymysql.connect(host='127.0.0.1', user='root', passwd=None, db='test', charset='utf8')
cur = conn.cursor()
user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36' \
' (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'
headers = {"User-Agent": "user-agent:%s" % user_agent}
page = 0
flag = True
while flag:
baseUrl = 'https://www.jianshu.com/u/f9338eda7dda?page='
page = int(page) + 1
url = baseUrl + str(page)
print(url)
# 抓取数据
req = requests.get(url, headers=headers, timeout=2)
pageText = req.text
pq = pyquery.PyQuery(pageText)
contents = pq('li')
for x in contents:
el = pq(x)
title = el.find('a.title').text()
if title:
nodeId = el.attr('data-note-id')
# data-note-id为空时,表示文章已抓取完毕,此时退出循环
if nodeId is None:
flag = False
break
link = 'https://www.jianshu.com' + el.find('a.title').attr('href') # 文章链接
postTime = el.find('span.time').attr('data-shared-at') # 发布时间
dateTime = datetime.datetime.strptime(postTime, "%Y-%m-%dT%H:%M:%S+08:00")
create_time = int(time.mktime(dateTime.timetuple()))
read_num = el.find('i.ic-list-read').parent().text() # 阅读数
comment_num = el.find('i.ic-list-comments').parent().text() # 评论数
like_num = el.find('i.ic-list-like').parent().text() # 点赞数
money_num = el.find('i.ic-list-money').parent().text() # 打赏数
if money_num is '':
money_num = 0
# 数据入库
analyze_time = int(time.time())
sql = "insert into analyze_article \
(title, link, create_time, analyze_time, read_num, like_num, comment_num, money_num) values \
('%s', '%s', '%s', '%s', '%s', '%s', '%s', '%s')" % \
(title, link, create_time, analyze_time, read_num, like_num, comment_num, money_num)
cur.execute(sql)
conn.commit()
# 暂停1秒,避免被简书的反爬虫拦截
time.sleep(1)
php读取数据
爬虫将数据入库后,用php作为服务端读取数据表数据
极简单的数据读取脚本,无需解释,直接贴代码
<?php
header("Access-Control-Allow-Origin:*"); // 如果客户端和服务端不同域,要加上这行代码,不然会报跨域错误
$con=mysqli_connect("localhost","root","","test");
$analyzeTime = strtotime(date('Y-m-m', time())) - 3600 * 24;
$sql="SELECT * FROM analyze_article where analyze_time >= $analyzeTime";
$order = '';
if (isset($_GET['read_num'])) {
$order = " order by read_num desc";
}
if (isset($_GET['like_num'])) {
$order = " order by like_num desc";
}
if (isset($_GET['comment_num'])) {
$order = " order by comment_num desc";
}
if (isset($_GET['money_num'])) {
$order = " order by money_num desc";
}
$sql .= $order;
$result=mysqli_query($con,$sql);
$data=mysqli_fetch_all($result, MYSQLI_ASSOC);
mysqli_free_result($result);
mysqli_close($con);
echo json_encode($data, true);
前端使用vue.js展现
php后端返回json数据,vue.js将json数据解析展现到页面
<!doctype html>
<html lang="en">
<head>
<meta charset="UTF-8">
<link href="https://cdn.bootcss.com/bootstrap/3.3.7/css/bootstrap.min.css" rel="stylesheet">
<script src="https://cdn.jsdelivr.net/npm/vue/dist/vue.js"></script>
<script src="https://unpkg.com/axios/dist/axios.min.js"></script>
<title>简书文章分析</title>
<style>
.container {
margin-top: 2%;
}
</style>
</head>
<body>
<div class="container">
<div id="app">
<h3 class="text-center">简书文章分析</h3>
<table class="table table-bordered table-hover">
<tr>
<th>标题</th>
<th><a href="" @click.prevent="changeOrder('read_num')" class="text-info">阅读</a></th>
<th><a href="" @click.prevent="changeOrder('like_num')" class="text-danger">点赞</a></th>
<th><a href="" @click.prevent="changeOrder('comment_num')" class="text-warning">评论</a></th>
<th><a href="" @click.prevent="changeOrder('money_num')" class="text-success">打赏</a></th>
</tr>
<tr v-for="item in list">
<td><a :href="item.link" target="_blank">{{ item.title }}</a></td>
<td>{{ item.read_num }}</td>
<td>{{ item.like_num }}</td>
<td>{{ item.comment_num }}</td>
<td>{{ item.money_num }}</td>
</tr>
</table>
</div>
</div>
<script>
let url = 'http://local.php.com/jianshu.php';
let vm = new Vue({
el: '#app',
data: {
list: []
},
methods: {
changeOrder: function (sign) {
let reqUrl = url + '?' + sign + '=1'
axios.get(reqUrl, {})
.then(function (response) {
vm.$data.list = response.data;
})
},
}
});
axios.get(url, {})
.then(function (response) {
vm.$data.list = response.data;
})
.catch(function (error) {
console.log(error);
})
.then(function () {
// always executed
});
</script>
</body>
</html>
对于vue.js不熟悉的同学,推荐查看:实例学习vue.js目录
小结
除了以上极简的按不同维度排序外,还可以从不同角度进行分析,前提是你的数据量要多,你也可以拿那些大v的简书主页放到程序中进行分析,有助于你了解大v的文章好在哪里。
网友评论