分析伯乐在线文章数据

作者: 人机分离机 | 来源:发表于2018-02-15 20:34 被阅读0次

分析伯乐在线文章数据
Python3 爬虫学习(一) - 爬虫前奏
爬虫介绍
4. 编程风格和技巧
metaclass
使用scrapy爬虫框架抓取伯乐在线的文章标题、标题url与发布
python爬虫知识点汇总大全，初学者必备资料
棒棒哒！一大波能提高编程技能的游戏
1. 造轮子-1
什么是网络爬虫

一、读取文章数据

pandas读取mysql数据到DataFrame中

import pandas as pd
from sqlalchemy import create_engine

db_info = {'user':'root',
          'password':'',
           'host':'localhost',
           'database':'article_spider'
          }
engine = create_engine('mysql://%(user)s:%(password)s@%(host)s/%(database)s?charset=utf8' % db_info,encoding='utf-8')
sql = 'select * from jobbole_article;'
df = pd.read_sql(sql , con = engine)

二、数据分析

1. 查看数据

df.info() 查看数据信息
df.isnull() 判断数据是否缺失
[图片上传失败...(image-617282-1518698087842)]

2. 清洗数据

只保留title、creat_data、tags三个属性的数据

df.loc[:,['create_date','title','tags']]

[图片上传失败...(image-a426da-1518698087842)]
按时间进行排序

df.sort_values(by='create_date',ascending = False)

[图片上传失败...(image-4c0fec-1518698087842)]
将数据类型转换为日期类型并设置为索引

df['create_date'] = pd.to_datetime(df['create_date']) #将数据类型转换为日期类型
df = df.set_index('create_date')  # 将dcreate_date设置为索引

获取2017年的文章信息及tags和title内容

df = df['2017']
tags = df['tags']
title = df['title']

[图片上传失败...(image-ab1994-1518698087842)]

3.数据类型转换

首先使用np.array()函数把DataFrame转化为np.ndarray()，再利用tolist()函数把np.ndarray()转为list类型

tags_data = np.array(tags)#np.ndarray()
tags_list = tags_data.tolist()#list
tags_text = "".join(tags_list) # 拼接成text
tags_text = tags_text.replace(',','') #把逗号换为空
tags_text = tags_text.replace('/','')

4.中文分词

利用结巴分词进行中文分词操作

import jieba    
import pandas as pd  
jieba.add_word('C/C++')
segment = jieba.lcut(tags_text)
words_df = pd.DataFrame({'segment':segment})
words_df.head()

[图片上传失败...(image-3ab9a-1518698087842)]
进行词频统计

import numpy
words_stat = words_df.groupby(by=['segment'])['segment'].agg({"计数":np.size})
words_stat = words_stat.reset_index().sort_values(by=["计数"],ascending=False)
words_stat.head()

5. 词云显示数据

import matplotlib.pyplot as plt
%matplotlib inline
import matplotlib
matplotlib.rcParams['figure.figsize'] = (10.0, 5.0)

from wordcloud import WordCloud#词云包

#用词云进行显示
wordcloud=WordCloud(font_path="simhei.ttf",background_color="white",max_font_size=80)
word_frequence = {x[0]:x[1] for x in words_stat.head(1000).values}
wordcloud = wordcloud.fit_words(word_frequence)
plt.imshow(wordcloud)

得到关于伯乐在线2017年的文章的标签的使用程度如下
[图片上传失败...(image-fef70e-1518698087842)]