跟着kaggle学数据分析——YouTube数据：谁获得了最多订

作者: 大力SAMA | 来源:发表于2018-09-29 16:33 被阅读0次

image.png

因为kaggle上的项目都是英文的，所以找了一个比较好理解的关于YouTube频道视频播放量、视频发布数量和频道订阅量之间关系的分析报告。
链接：https://www.kaggle.com/roshan77/youtube-data-who-got-the-most-subscribers/notebook
代码和数据在网站中都可下载

用到的各种包、库和模块

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.font_manager import FontProperties#中文显示
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics
import statsmodels.api as sm

font = FontProperties(fname=r"D:\anaconda\shirleylearn\cipintongji\simsun.ttf", size=14)#设置中文字体

path = r'D:\kaggle\youtube\youtube data.csv'
df = pd.read_csv(path)

#把字段转换为数字格式
df['Subscribers'] = df['Subscribers'].convert_objects(convert_numeric=True)
df['Video views'] = df['Video views'].convert_objects(convert_numeric=True)
df['Video Uploads'] = df['Video Uploads'].convert_objects(convert_numeric=True)

作者原话：Here I start with plotting some bar graphs showing top 20 in each kind of classification of the channels. First three are top 20 by their ranking, where their number of viewers, subscribers and video views are presented. The second three are top 20 based on each of the group themselves.

概要：通过展示几张条形图来开始此次数据分析，前3张图为现有排名前20个频道的订阅量、播放量和发布量；后三张图为订阅量、播放量和发布量各自排名前20的频道。

一、排名前20的频道订阅数量展示
这里的排名是指YouTube给频道的排名

image.png

df.head(20).plot.bar(x = "Channel name" , y = "Subscribers")
plt.title("排名前20的频道订阅数量",fontproperties = font)#增加fontproperties后中文显示正常
plt.show()

image.png

图中le7表示X轴*10000000（7个零）

二、排名前20的频道视频播放数量展示

df.head(20).plot.bar(x = "Channel name" , y = "Video views")
plt.title("排名前20的频道视频播放数量",fontproperties = font)
plt.show()

image.png

三、排名前20的频道视频发布数量展示

df.head(20).plot.bar(x = "Channel name" , y = "Video Uploads")
plt.title("排名前20的频道视频发布数量",fontproperties = font)
plt.show()

image.png

四、订阅数量排名前20的频道

df.sort_values(by = ['Subscribers'],ascending = False).head(20).plot.bar(x = 'Channel name',y = 'Subscribers')#ascending = False是降序
plt.title("订阅数量排名前20的频道", fontproperties = font)
plt.show()

image.png

五、播放数量排名前20的频道

df.sort_values(by = ['Video views'],ascending = False).head(20).plot.bar(x = 'Channel name',y = 'Video views')
plt.title("播放数量排名前20的频道", fontproperties = font)
plt.show()

image.png

六、视频发布数量排名前20的频道

df.sort_values(by = ['Video Uploads'],ascending = False).head(20).plot.bar(x = 'Channel name',y = 'Video Uploads')
plt.title("视频发布数量排名前20的频道", fontproperties = font)
plt.show()

image.png

七、订阅数量排序

df.sort_values(by = ['Subscribers'], ascending = False).plot(x= "Channel name",y = 'Subscribers')
plt.xlabel("按订阅数排序",fontproperties = font)
plt.ylabel("订阅数量",fontproperties = font)
plt.show()

image.png

八、视频播放量排序

df.sort_values(by = ['Video views'], ascending = False).plot(x= "Channel name",y = 'Video views')
plt.xlabel("按播放数量排序",fontproperties = font)
plt.ylabel("播放数量",fontproperties = font)
plt.show()

image.png

九、视频发布数量排序

df.sort_values(by = ['Video Uploads'], ascending = False).plot(x= "Channel name",y = 'Video Uploads')
plt.xlabel("按视频发布数量排序",fontproperties = font)
plt.ylabel("视频发布数量",fontproperties = font)
plt.show()

image.png

对图片七、八、九的总结：
作者原话：Here I am interested how all the channels in the list distribute in terms of subscribers, video uploads and subscribers going from maximum to minimum in each class. Interestingly there is huge peak at the top list and tend to gain a plateau for the other channels quickly.

概要：从频道订阅量、视频播放量、视频发布数量排名曲线图来看，除了排名前几的频道有较高的值，后续频道的值都比较稳定。

下面对频道评级（grade）进行分析
十、频道评级饼图

grade_name = list(set(df['Grade']))#set的作用是去重
df_by_grade = df.set_index(df['Grade'])
#print(df_by_grade.head())
count_grade = list()

for grade in grade_name:
    count_grade.append(len(df_by_grade.loc[[grade]]))#统计每个评级的数量，[[grade]]中括号不可少
#print(count_grade)
#print(grade_name)
grade_name[-1] = "missing"#把评级为“\xa0”即为空的数据替换为missing，索引根据实际情况来

labels = grade_name
sizes = count_grade
explode1 = (0.2,0,0,0.2,0,0)#有数字则表示对应饼图中的部分与其他部分分离开，以达到凸出显示的目的
color_list = ['green','brown','gold','lightblue','blue','red']

patches,texts = plt.pie(sizes,colors = color_list, explode = explode1,shadow = False, startangle = 90 ,radius = 3)#sizes是数据,radius 是饼图半径;"patches,texts"的用法不清楚？？
plt.legend(patches,labels,loc = "upper right")#图例位于右上角
plt.axis('equal')# 正圆
plt.title("频道评级饼图", fontproperties = font)
plt.show()

image.png

用df.describe()查看订阅数、播放数和上传数的均值、标准差和四分位数等统计值

print(df.describe())

image.png

十一、订阅数、播放数和上传数的箱型图

color ={"boxes":"gold","whiskers":"black", "medians":"black","caps":"black"} #另一种建字典的方法dict(boxes="gold",whiskers="black", medians="black",caps="black")
df.plot.box(color = color,patch_artist=True)#patch_artist=True表示箱型中的颜色填充
plt.yscale('log')#因为数字比较大，用对数表示更加合适
plt.ylable('Log count')
plt.show()

image.png

十二、变量相关性矩阵图

plt.subplots(figsize=(8,5))#创建一个800*500像素的图形
sns.heatmap(df.corr(),cmap = 'RdBu',annot=True)#cmap有多个颜色选项可选,annot=True显示系数
plt.title("变量相关性矩阵图", fontproperties = font)
plt.show()

image.png

对图片十二的分析：
作者原话：Looking at the plot below, it is seen that number of subscribers is positively correlated with the number of viewers. That is expected. But the number of subscribers is negativley correlated with the number of video uploaded by that channel. This might be surprising. The video channels attracting the larger number of viwers and subscribers are uploading smaller number of videos.
概要：从矩阵图可以看出，订阅数与播放数正相关，这是可以预见的，但是订阅数与上传数却是负相关的（0.092，还算正相关啊，只是相关性比较弱，不知道作者怎么理解的），这个结果出乎意料。总的来说，有更多播放数和订阅数的频道，其视频上传数反而不多。

十三、变量相关性散点图
作者原话：The data contains non numeric values. So if the cleaned data is presented on the correlation scatter plot matrix the above mentioned conclusion about the correlation of three variables is more evident.
概要：数据包含非数字值。因此，如果在相关散点图矩阵上呈现清除非数字值后的数据，则上述关于三个变量的相关性的结论更明显。

df_clean = df.dropna()#去除空值
sns.pairplot(df_clean)

image.png

建立线性模型
作者原话：Here I tried to make a linear model based on the data. I am tring to predict the number of subscribers given the the number of video uploaded and number of video viewed. First started with the linear relation between two variables.
概要：我尝试建立一个线性模型，然后根据已知的上传数和播放数来预测订阅数。首先，要找出2个变量间的线性关系。

十四、用sklearn.linear_model的LinearRegression获得线性模型
图为预测值和实际值的散点图

f_clean = df.dropna()#去除空值
X = df_clean[['Video Uploads','Video views']]
Y = df_clean[['Subscribers']]
X_train,X_test,y_train,y_test = train_test_split(X,Y,test_size=0.2)#分割数据，从样本中随机的按比例选取train data和test data，0.2表示20%的随机数据用来测试结果
lm = LinearRegression()#线性回归
lm.fit(X_train.dropna(),y_train.dropna())#fit训练模型，predict模型预测，score模型打分
predictions = lm.predict(X_test)

plt.scatter(y_test,predictions,color="blue")#散点图
plt.xlabel('Y Test')
plt.ylabel('Predicted Y')
plt.show()

image.png

作者原话：It is seen that there is already good correlation between the predicted value of the number of subscribers and the observed number of them in the test set. So the model is working satisfactorily for the data it never seen in the training.
概要：可以看出，在测试集中，订阅数的预测结果和已知的实际值有较好的相关性，所以，根据训练集获得的模型令人满意。

十五、标准残差图（目的：检查模型）

sns.residplot(y_test,predictions,color = "g")#残差图是指以残差为纵坐标，以任何其他指定的量为横坐标的散点图。
plt.ylabel("残差",fontproperties = font)
plt.xlabel('instances')
plt.title("标准残差图",fontproperties = font)

image.png

十六、查看误差值

print('MAE:', metrics.mean_absolute_error(y_test, predictions))#MAE：平均绝对误差;MSE：均方误差;RMSE：均方根误差（标准误差）
print('MSE:', metrics.mean_squared_error(y_test, predictions))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, predictions)))

image.png

十七、用sklearn.linear_model的LinearRegression获取线性模型系数
因为我和作者用的训练集不同，因此得到的系数也不同（疑问：模型改进后结果会一致吗？）

coefficients = pd.DataFrame(X.columns)#创建一个dataframe对象，X=df_clean[['Video Uploads','Video views']]，X.column是'Video Uploads'和'Video views'
coefficients['coefficients'] = lm.coef_[0]#获取系数
print(coefficients)

我的结果

作者结果

十八、用statsmodels.api建立线性模型

model = sm.OLS(Y,X).fit()#和sklearn.linear_model不同的是这里用全部数据进行分析
print(model.summary())

image.png
看参数和上个模型的结果还是相差很多的。
具体原因不太清楚，网上找到一个有相同疑问的，他最后得到了相同结果，可是英语理解起来太累了o(╥﹏╥)o，还没仔细看。
链接：https://stackoverflow.com/questions/22054964/ols-regression-scikit-vs-statsmodels
我尝试把model = sm.OLS(Y,X).fit()中的X,Y换成X_train.dropna()和y_train.dropna())，但是出现了报错

分析数据的偏差问题
作者原话：From the following three histogram, we can see that all three variables are highly positively skewed.
概要：根据以下三个直方图我们可以看到所有三个变量都是明显正偏的。（绝大多数频道的订阅数、播放数和上传数都不高）

十九、订阅数分布直方图

df["Subscribers"].hist(bins = 200)
plt.ylabel("频道数量",fontproperties = font)
plt.xlabel("订阅数",fontproperties = font)
plt.title("订阅数分布直方图",fontproperties = font)
plt.show()

image.png

二十、播放数分布直方图

df["Video views"].hist(bins = 200)
plt.ylabel("频道数量",fontproperties = font)
plt.xlabel("播放数",fontproperties = font)
plt.title("播放数分布直方图",fontproperties = font)
plt.show()

image.png

二十一、上传数分布直方图

df["Video Uploads"].hist(bins = 200)
plt.ylabel("频道数量",fontproperties = font)
plt.xlabel("上传数",fontproperties = font)
plt.title("上传数分布直方图",fontproperties = font)
plt.show()

image.png

Log transformation（对数转换）
作者原话：In view of the positive skewness of the data, simple log transformation could be a good choice to deal with.
概要：鉴于数据是正偏的，把数据进行对数转换后结果更加直观。（np.log(订阅数)，以e为底订阅数的对数）

二十二、订阅数对数直方图

np.log(df["Subscribers"]).hist(bins = 20)
plt.ylabel("频道数量",fontproperties = font)
plt.xlabel("订阅数",fontproperties = font)
plt.title("订阅数对数直方图",fontproperties = font)
plt.show()

image.png

二十三、播放数对数直方图

np.log(df["Video views"]).hist(bins = 20)
plt.ylabel("频道数量",fontproperties = font)
plt.xlabel("播放数",fontproperties = font)
plt.title("播放数对数直方图",fontproperties = font)
plt.show()

image.png

二十四、上传数对数直方图

np.log(df["Video Uploads"]).hist(bins = 20)
plt.ylabel("频道数量",fontproperties = font)
plt.xlabel("上传数",fontproperties = font)
plt.title("上传数对数直方图",fontproperties = font)
plt.show()

image.png

新建dataframe，存储对数结果

df_log = pd.DataFrame()
df_log["Subscribers_log"] = np.log(df_clean["Subscribers"])
df_log["Video_views_log"] = np.log(df_clean["Video views"])
df_log["Video_Uploads_log"] = np.log(df_clean["Video Uploads"])

对数结果相关性分析
二十五、对数相关性矩阵图

plt.subplots(figsize=(8,5))#创建一个800*500像素的图形
sns.heatmap(df.corr(),cmap = 'RdBu',annot=True)#cmap有多个颜色选项可选,annot=True显示系数
plt.title("对数相关性矩阵图", fontproperties = font)
plt.show()

image.png

之前的矩阵图

image.png

作者原话：From the above correlation plot the correlation coefficient of the variables have not been changed after the log transformation. At least the positive correlation remains the positive and vice versa.

But if we look at the scatter plot below, visually the negative correlation between video uploads and subscribers seem to have gone. This is the effect of log transformation which is not to be confued thinking they have positive correlations.

概要：从上述相关图中可以看出在对数变换之后变量的相关系数没有改变（疑问：具体系数明明改变了）。原来的正相关仍为正，反之亦然。
（视频上传数的相关性变大，根据作者下面的说明是对数转换的影响。）

如果我们查看下面的散点图，看上去视频上传数和订阅数之间的负相关似乎已经消失。这是对数转换的影响，不要认为它们具有正相关性。

二十六、对数散点图

sns.pairplot(df_log)

image.png

对数结果建立线性模型
步骤和之前的相同，先用sklearn.linear_model的LinearRegression获得线性模型
二十七、对数预测值与实际值散点图

X2 = df_log[["Video_Uploads_log","Video_views_log"]]
Y2 = df_log[["Subscribers_log"]]

X2_train,X2_test,y2_train,y2_test = train_test_split(X2,Y2,test_size=0.2)

lm2 = LinearRegression()
lm2.fit(X2_train.dropna(),y2_train.dropna())

predictions2 = lm2.predict(X2_test)

plt.scatter(y2_test,predictions2,color = 'red')
plt.xlabel('Y Test')
plt.ylabel("预测值", fontproperties = font)
plt.title("预测值与实际值散点图", fontproperties = font)
plt.show()

image.png

二十八、对数转换后标准残差图

sns.residplot(y2_test,predictions2,color='g')
plt.ylabel("残差",fontproperties = font)
plt.xlabel('instances')
plt.title("对数转换后标准残差图",fontproperties = font)

image.png

二十九、误差值

print('MAE:',metrics.mean_absolute_error(y2_test,predictions2))
print('MSE:',metrics.mean_squared_error(y2_test,predictions2))
print('RMSE:',np.sqrt(metrics.mean_squared_error(y2_test,predictions2)))

image.png

三十、模型系数

coefficients2 = pd.DataFrame(X2.columns)
coefficients2['coefficients'] = lm2.coef_[0]
print(coefficients2)

image.png

三十一、再用statsmodels.api对对数结果建立线性模型

model2 = sm.OLS(Y2,X2).fit()
predictions2 = model2.predict(X2_test)

print(model2.summary())

image.png

使用对数前后模型结果对比
对数前模型公式：
Y = a X_1 + b X_2 + c
对数后模型公式：
Y = exp( p .... ) = X_1 ^ p + X_2 ^ q + e^r

三十二、对数转换前后订阅数预测值散点图

p = coefficients2['coefficients'][0]
q = coefficients2['coefficients'][1]

def pred_from_log(x,y):
    return x**p+y**q

vid_upl_test = np.array(X_test['Video Uploads'])#用于上面自定义函数x的值；这里不能用X2_test,因为要与predictions的自变量保持一致。
vid_view_test = np.array(X_test['Video views'])#用于上面自定义函数y的值

prediction_log = pred_from_log(vid_upl_test,vid_view_test)

plt.scatter(predictions,prediction_log,color='r',alpha=0.5)#alpha:透明程度
plt.xlabel("没有对数转换的预测值", fontproperties = font)
plt.ylabel("对数转换后的预测值", fontproperties = font)
plt.title("对数转换前后订阅数预测值散点图", fontproperties = font)
plt.show()

image.png

作者原话：The direct plot of the difference shows that log transformation tend to predict higher value than that without log if anything. There is no way it can predict lower though.

概要：从图中可以看出，对数转换后的预测值要高于对数转换前的预测值。（There is no way it can predict lower though.这句不是很理解。）

三十三、对数转换前的预测值与转换后的差值

plt.scatter(range(len(X_test)),predictions-prediction_log,color = 'r',alpha=0.5)
plt.xlabel("测试数据量", fontproperties = font)
plt.ylabel("对数转换前的预测值与转换后的差值", fontproperties = font)
plt.show()

这段代码出现报错：ValueError: x and y must be the same size
我已经确认过range(len(X_test))和predictions-prediction_log的长度都是922，不知道为什么报错。
针对这个问题在思否上提问了：https://segmentfault.com/q/1010000016562662

作者图片

总结
作者原话：
1、The number of subscribers is proportional to the number of views.
2、The number of subscribers in negatively correlated with the number of video uploads by the channel.
3、Linear model was tested for prediction of number of subscriber as a function of number of video uploads and number of video views.
4、Log transformation on the linear model gives the one sided biased prediction in comparison to the one without such transformation.

概要：
1、订阅数与播放数正相关
2、订阅数与该频道视频上传数负相关
3、基于上传数和播放数预测订阅数的线性模型测试有效
4、对数变换后的线性模型与变换前的模型预测值出现单侧偏差。（是指预测结果偏大吗？）

我的主要疑问：
作者用机器学习方法建立了线性模型，即sklearn.linear_model的LinearRegression方法，后面用到的参数也都是机器学习获得的，但是作者又用statsmodels.api计算了全部数据并给出了参数，这里的参数只是为了对比参考吗？对于参数结果的不同作者没有进行说明。

跟着kaggle学数据分析——YouTube数据：谁获得了最多订

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读