爬虫小记（二） --- 分析热门标题的方法

作者: 虎七 | 来源:发表于2018-03-08 02:36 被阅读136次

爬虫小记（二） --- 分析热门标题的方法
scrapy的暂停和重启
go语言爬虫 - TapTap用户都喜欢些什么游戏
如何开始写你的第一个爬虫脚本——简单爬虫入门！
一片文章教你爬虫入门，学习原来这么简单！
如何分析对标自媒体账号? 学会谷歌浏览器的爬虫插件就可以
爬虫的测试方法
PHP爬虫网易云音乐歌手和热门歌曲信息抓取
JetBrains PyCharm Pro 2020.1 破解版
原创|实战爬虫|客路旅行

引

有很多文章打着标题党，拥有很高的点击率；它的标题命中了很多读者的兴趣点，那么它是怎么命中的呢？

我们把环境设定在一个特定专题里，这里的每篇文章的标题，都蕴含着潜在的兴趣点；如果把标题分割为一个个关键词，那么热门标题里应该大概率会包含代表着用户兴趣的关键词。

下面的试验是基于已经爬好的数据的，请参考：

爬虫小记（一）--- 爬取简书专题

准备知识

mysql

这里需要用到数据库的查询(query)，具体使用参考：

https://dev.mysql.com/doc/connector-python/en/connector-python-example-cursor-select.html

中文分词

使用了中文分词库jieba：

安装：

pip install jieba

工程：

https://github.com/fxsjy/jieba

使用也很简单：

import jieba

jieba.initialize() # 手动初始化

str = '你好好陪她，我四海为家'

seg_list = jieba.cut(str, cut_all=False) # 精确模式

print ' '.join(seg_list)

数据处理

查询数据，同时做分词处理

# 打开数据库

cnx = mysql.connector.connect(user='xxx', password='xxx', database='jianshu')

cursor = cnx.cursor()

# 分条拉取

result = {}

result_hp = {}

hp_stat = {}

cursor.execute("SELECT title,read_num,reply_num,favor_num FROM article_desc WHERE topic='{}' ORDER BY title ASC".format(topic))

for title, read_num, reply_num, favor_num in cursor:

words = jieba.cut(title, cut_all = False)

for word in words:

if word in result:

result[word] += int(read_num)

else:

result[word] = int(read_num)

# 词性分析

words = pseg.cut(title)

for word, flag in words:

result_hp[word] = flag

if flag in hp_stat:

hp_stat[flag] += 1

else:

hp_stat[flag] = 1

# 关闭数据库

cursor.close()

cnx.close()

上面的分词中，有一段对词性的分词处理：

# 词性分析

words = pseg.cut(title)

这里除了返回关键字之外，还会有一个关键字的词性表示，如名词用'n'表示；以下汇总部分常用的：

HP_DESC = {

u'Ag' : '形语素',

u'n' : '名词',

u'v' : '动词',

u'x' : '非语素字',

u'r' : '代词',

u'm' : '数词',

u'd' : '副词',

u'a' : '形容词',

u'nr' : '人名',

u'p' : '介词',

u'c' : '连词',

u't' : '时间词',

u'ns' : '地名',

u'f' : '方位词',

u'i' : '成语',

u'l' : '习用语',

u'vn' : '名动词',

u'y' : '语气词',

u'u' : '助词',

u'nz' : '其他专名',

u's' : '处所词',

u'q' : '量词',

}

接着对上面获取到的结果数组，按照点击值进行排序：

# 排序

sorted_items = sorted(result.items(),key = operator.itemgetter(1), reverse = True)

sorted_hp = sorted(hp_stat.items(),key = operator.itemgetter(1), reverse = False)

为了可以图形化显示，我们定义数据到图形的显示方法：

from matplotlib import pyplot as plt

import numpy as np

from pylab import mpl

from matplotlib.ticker import MultipleLocator, FormatStrFormatter

# 解决UnicodeWarning问题

import sys

reload(sys)

sys.setdefaultencoding('utf-8')

# 解决图形显示中文异常的问题

mpl.rcParams['font.sans-serif'] = ['FangSong'] # 指定默认字体

mpl.rcParams['axes.unicode_minus'] = False # 解决保存图像是负号'-'显示为方块的问题

def showBar(title, result):

# 抽取数据

  x_datas = [int(x[1]) for x in result]

  x_labels = [x[0] for x in result]

  fig=plt.figure(figsize=(len(x_datas), 6))

  ax1=plt.subplot(111)

  data=np.array(x_datas)

  width=0.5

  x_bar=np.arange(len(x_datas))

  rect=ax1.bar(left=x_bar,height=data,width=width,color="lightblue")

  #向各条形上添加数据标签

  # for rec in rect:

  # x=rec.get_x()

  # height=rec.get_height()

  # ax1.text(x+0.01,1.02*height,'%.0f' % height)

  #绘制x，y坐标轴刻度及标签，标题

  ax1.yaxis.set_major_formatter(FormatStrFormatter('%1.0f'))

  ax1.set_xticks(x_bar)

  ax1.set_xticklabels(tuple(x_labels), rotation=45)

  ax1.set_ylabel("点击次数")

  ax1.set_title(title)

  ax1.grid(False)

  ax1.set_ylim(0,max(x_datas)*1.1)

  # 显示或者保存

  plt.show()

  # plt.savefig('{}.png'.format(title), dpi=100)

最后是测试程序：

def stat_topic(topic):

...

# 打印

showBarAndSave(topic, '名词', sorted_items, result_hp, 20)

showBarAndSave(topic, '动词', sorted_items, result_hp, 20)

...

# 测试程序

stat_topic('故事')

stat_topic('程序员')