美文网首页
作业笔记13_bili user

作业笔记13_bili user

作者: ChZ_CC | 来源:发表于2017-02-22 16:37 被阅读138次

目标:爬取B站用户信息,对地区分布、关注人数、播放量进行分析。(结果只爬了最早期的一部分数据。依然有bug。

参考资料:


为了方便保存数据,用了MySQL数据库。

创建MySQL数据库

新数据库

create database bili;

创建数据表

CREATE TABLE userinfo (
    id          BIGINT      NOT NULL    AUTO_INCREMENT,
    uid         BIGINT,
    name        VARCHAR(225),
    sex         CHAR(8),
    regtime     DATETIME,
    coins       INT,
    birthday    DATE,
    fans        INT,
    attention   INT,
    place       VARCHAR(80),
    playNum     BIGINT,
    level       INT,
    exp         INT,
    created     TIMESTAMP       DEFAULT CURRENT_TIMESTAMP,
    PRIMARY KEY(id)
);

unicode设置,设置所有字符串数据的编码格式。

ALTER DATABASE bili CHARACTER SET = utf8mb4 COLLATE =utf8mb4_unicode_ci;
ALTER TABLE userinfo CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
ALTER TABLE userinfo CHANGE name name VARCHAR(225) CHARACTER SET utf8mb4 COLLATE  utf8mb4_unicode_ci;
ALTER TABLE userinfo CHANGE sex sex CHAR(8) CHARACTER SET utf8mb4 COLLATE  utf8mb4_unicode_ci;
ALTER TABLE userinfo CHANGE place place varchar(80) CHARACTER SET utf8mb4 COLLATE  utf8mb4_unicode_ci;

爬虫代码

import requests
import json
import pymysql
from multiprocessing.dummy import Pool as ThreadPool
import time
import random

user = "user"
passwd = "password"
db =  "bili"
uids = range(10000)

def get_data(mid):
    header = {
        'Referer': 'http://space.bilibili.com/'+str(mid)+'/',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36',
        'X-Requested-With': 'XMLHttpRequest',
        'Origin': 'http://space.bilibili.com',
        'Host': 'space.bilibili.com',
        'AlexaToolbar-ALX_NS_PH': 'AlexaToolbar/alx-4.0',
        'Accept-Language': 'zh-CN,zh;q=0.8,en;q=0.6,ja;q=0.4',
        'Accept': 'application/json, text/javascript, */*; q=0.01',
       }
    payload = {'_': int(round(time.time() * 1000)), 'mid':mid}
#    time.sleep(random.random())
    try:
        jscontent = requests.post('http://space.bilibili.com/ajax/member/GetInfo', headers=header,  data=payload).content
        jsDict = json.loads(jscontent.decode('utf-8'))
        jsData = jsDict['data']
        mid = jsData['mid']
        name = jsData['name']
        sex = jsData['sex']
        regtime = jsData['regtime']
        coins = jsData['coins']
        birthday = jsData['birthday']
        fans = jsData['fans']
        attention = jsData['attention']
        place = jsData['place']
        playNum = jsData['playNum']
        level = jsData['level_info']['current_level']
        exp = jsData['level_info']['current_exp']

        regtime = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(regtime))
        into_mysql( [mid, name, sex, regtime, coins, birthday, fans, attention, place, playNum, level, exp] )
    except:
        pass
        #print(mid)
def into_mysql(data):
    try:
        cur.execute('insert into userinfo (uid, name, sex, regtime, coins, birthday, fans, attention, place, playNum, level, exp) \
                    values (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)', data)
        conn.commit()
    except:
        pass

if __name__ =='__main__':
    conn = pymysql.connect(host="localhost",user=user,passwd=passwd,db=db,use_unicode=True, charset="utf8")
    cur = conn.cursor()

    pool = ThreadPool(1)
    results = pool.map(get_data,uids)

    pool.close()
    pool.join()

    cur.close()
    conn.close()

数据分析

B站到现在(2017.2.15 13:44)用户数量达到90568280,而且一直在增加中。不太懂多线程、分布式的东西,在一台电脑上爬很慢的。而且b站有反爬虫机制,访问太频繁了会出现验证信息。勉为其难的用现有的数据做分析看看。

读取数据

import pandas as pd
#import pymysql
import matplotlib.pyplot as plt
import seaborn as sns

try:
    data = pd.read_csv('D:/bili_user.csv', encoding='utf-8', index_col='id')
except:
    user = "user"
    passwd = "password"
    db =  "bili"
    conn = pymysql.connect(host="localhost",user=user,passwd=passwd,db=db,use_unicode=True, charset="utf8")
    cur = conn.cursor()
    cur.execute('select DISTINCT * from userinfo;')
    raw_data = cur.fetchall()
    cur.close()
    conn.close()

    columns = ['id', 'userID', 'name', 'sex', 'regtime', 'coins', 'birthday', 'fans', 'attention', 'place', 'playNum', 'level', 'exp', 'created']
    df = pd.DataFrame(list(raw_data), columns=columns, index_col='id')
    print(df.head())
    df.to_csv('D:/bili_user.csv', encoding='utf-8')
    data = pd.read_csv('D:/bili_user.csv', encoding='utf-8', index_col='id')

data.drop_duplicates('userID', inplace=True)
data['sex'].fillna("未填写", inplace=True)

一共一万多条:

查看性别比例

data.groupby(['sex']).count()['userID'].plot.pie()
plt.show()

现在男女比例其实差不多的,而且不填写性别的站很少一部分。因为这只是爬到10年之前的用户,所以结果和现在不太一样。

粉丝最多的20个账号

data.sort_values(by='fans', ascending=False, inplace=True)
data[['userID','name','fans']].head(20)
data['fans'].head(20).plot.bar()
plt.show()

'''
        userID       name               fans
id                                
18052   122879        敖厂长          2115125
17478   883968       暴走漫画         2045366
18231   221648     柚子木字幕组        1599162
17230   777536      LexBurner         1590006
18758   375375      伊丽莎白鼠         1529054
18947   486183       排骨教主          1276675
18368   585267   纯黑哥居然被用了       1222528
17668  1643718       山下智博          1180853
19309   423895    怕上火暴王老菊        1026379
19117   391679        A路人            867697
2151      7714    女孩为何穿短裙        590793
3841     11073     hanser             409310
4189     13046       少年Pi            378439
4373     14082         山新            195274
74          79     saber酱             162584
427        608        晚香玉           129534
2            2         碧诗            128235
9526     33696        Lov             115770
5801     19919       百合花开           114397
14011    44524    螺螺螺螺螺螺螺        108256
'''

分布地区

datap = data[['userID', 'place']].dropna()
datap['place_s'] = datap['place'].str.split(' ').str.get(0)
datap.groupby(['place_s']).count()['userID'].plot.bar()
plt.show()

相关文章

网友评论

      本文标题:作业笔记13_bili user

      本文链接:https://www.haomeiwen.com/subject/hnsmwttx.html