数据分析的价值主要在于熟悉了解整个数据集的基本情况包括每个文件里有哪些数据,具体的文件中的每个字段表示什么实际含义,以及数据集中特征之间的相关性,在推荐场景下主要就是分析用户本身的基本属性,文章基本属性,以及用户和文章交互的一些分布,这些都有利于后面的召回策略的选择,以及特征工程。
建议:当特征工程和模型调参已经很难继续上分了,可以回来在重新从新的角度去分析这些数据,或许可以找到上分的灵感
导包
In [1]:
%matplotlibinlineimportpandasaspdimportnumpyasnpimportmatplotlib.pyplotaspltimportseabornassnsplt.rc('font',family='SimHei',size=13)importos,gc,re,warnings,syswarnings.filterwarnings("ignore")
读取数据
In [2]:
path='./data_raw/'#####traintrn_click=pd.read_csv(path+'train_click_log.csv')#trn_click = pd.read_csv(path+'train_click_log.csv', names=['user_id','item_id','click_time','click_environment','click_deviceGroup','click_os','click_country','click_region','click_referrer_type'])item_df=pd.read_csv(path+'articles.csv')item_df=item_df.rename(columns={'article_id':'click_article_id'})#重命名,方便后续matchitem_emb_df=pd.read_csv(path+'articles_emb.csv')#####testtst_click=pd.read_csv(path+'testA_click_log.csv')
数据预处理
计算用户点击rank和点击次数
In [3]:
# 对每个用户的点击时间戳进行排序trn_click['rank']=trn_click.groupby(['user_id'])['click_timestamp'].rank(ascending=False).astype(int)tst_click['rank']=tst_click.groupby(['user_id'])['click_timestamp'].rank(ascending=False).astype(int)
In [5]:
#计算用户点击文章的次数,并添加新的一列counttrn_click['click_cnts']=trn_click.groupby(['user_id'])['click_timestamp'].transform('count')tst_click['click_cnts']=tst_click.groupby(['user_id'])['click_timestamp'].transform('count')
数据浏览
用户点击日志文件_训练集
In [8]:
trn_click=trn_click.merge(item_df,how='left',on=['click_article_id'])trn_click.head()
Out[8]:
user_idclick_article_idclick_timestampclick_environmentclick_deviceGroupclick_osclick_countryclick_regionclick_referrer_typerankclick_cntscategory_idcreated_at_tswords_count
019999916041715070295701904117113111112811506942089000173
11999995408150702957147841171131101141506994257000118
219999950823150702960147841171131911991507013614000213
319999815777015070295322004117125540402811506983935000201
41999989661315070296718314117125539402091506938444000185
train_click_log.csv文件数据中每个字段的含义
user_id: 用户的唯一标识
click_article_id: 用户点击的文章唯一标识
click_timestamp: 用户点击文章时的时间戳
click_environment: 用户点击文章的环境
click_deviceGroup: 用户点击文章的设备组
click_os: 用户点击文章时的操作系统
click_country: 用户点击文章时的所在的国家
click_region: 用户点击文章时所在的区域
click_referrer_type: 用户点击文章时,文章的来源
In [9]:
#用户点击日志信息trn_click.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1112623 entries, 0 to 1112622
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 user_id 1112623 non-null int64
1 click_article_id 1112623 non-null int64
2 click_timestamp 1112623 non-null int64
3 click_environment 1112623 non-null int64
4 click_deviceGroup 1112623 non-null int64
5 click_os 1112623 non-null int64
6 click_country 1112623 non-null int64
7 click_region 1112623 non-null int64
8 click_referrer_type 1112623 non-null int64
9 rank 1112623 non-null int32
10 click_cnts 1112623 non-null int64
11 category_id 1112623 non-null int64
12 created_at_ts 1112623 non-null int64
13 words_count 1112623 non-null int64
dtypes: int32(1), int64(13)
memory usage: 123.1 MB
In [8]:
trn_click.describe()
Out[8]:
user_idclick_article_idclick_timestampclick_environmentclick_deviceGroupclick_osclick_countryclick_regionclick_referrer_typerankclick_cnts
count1.112623e+061.112623e+061.112623e+061.112623e+061.112623e+061.112623e+061.112623e+061.112623e+061.112623e+06518010.0000001.112623e+06
mean1.221198e+051.951541e+051.507588e+123.947786e+001.815981e+001.301976e+011.310776e+001.813587e+011.910063e+0015.5217851.323704e+01
std5.540349e+049.292286e+043.363466e+083.276715e-011.035170e+006.967844e+001.618264e+007.105832e+001.220012e+0033.9577021.631503e+01
min0.000000e+003.000000e+001.507030e+121.000000e+001.000000e+002.000000e+001.000000e+001.000000e+001.000000e+001.0000002.000000e+00
25%7.934700e+041.239090e+051.507297e+124.000000e+001.000000e+002.000000e+001.000000e+001.300000e+011.000000e+004.0000004.000000e+00
50%1.309670e+052.038900e+051.507596e+124.000000e+001.000000e+001.700000e+011.000000e+002.100000e+012.000000e+008.0000008.000000e+00
75%1.704010e+052.777120e+051.507841e+124.000000e+003.000000e+001.700000e+011.000000e+002.500000e+012.000000e+0018.0000001.600000e+01
max1.999990e+053.640460e+051.510603e+124.000000e+005.000000e+002.000000e+011.100000e+012.800000e+017.000000e+00938.0000002.410000e+02
In [9]:
#训练集中的用户数量为20wtrn_click.user_id.nunique()
Out[9]:
200000
In [52]:
trn_click.groupby('user_id')['click_article_id'].count().min()# 训练集里面每个用户至少点击了两篇文章
Out[52]:
2
画直方图大体看一下基本的属性分布
In [10]:
plt.figure()plt.figure(figsize=(15,20))i=1forcolin['click_article_id','click_timestamp','click_environment','click_deviceGroup','click_os','click_country','click_region','click_referrer_type','rank','click_cnts']:plot_envs=plt.subplot(5,2,i)i+=1v=trn_click[col].value_counts().reset_index()[:10]fig=sns.barplot(x=v['index'],y=v[col])foriteminfig.get_xticklabels():item.set_rotation(90)plt.title(col)plt.tight_layout()plt.show()
从点击时间clik_timestamp来看,分布较为平均,可不做特殊处理。由于时间戳是13位的,后续将时间格式转换成10位方便计算。从点击环境click_environment来看,仅有1922次(占0.1%)点击环境为1;仅有24617次(占2.3%)点击环境为2;剩余(占97.6%)点击环境为4。从点击设备组click_deviceGroup来看,设备1占大部分(60.4%),设备3占36%。测试集用户点击日志In [87]:tst_click = tst_click.merge(item_df, how='left', on=['click_article_id']) tst_click.head() Out[87]:user_idclick_article_idclick_timestampclick_environmentclick_deviceGroupclick_osclick_countryclick_regionclick_referrer_typecategory_idcreated_at_tswords_count0249999160974150695914282041171132281150691274700025912499991604171506959172820411711322811506942089000173224999816097415069590560664112113228115069127470002593249998202557150695908606641121132327150693840100021942499971836651506959088613411711553011500895686000256In [12]:tst_click.describe() Out[12]:user_idclick_article_idclick_timestampclick_environmentclick_deviceGroupclick_osclick_countryclick_regionclick_referrer_typerankclick_cntscategory_idcreated_at_tswords_countcount518010.000000518010.0000005.180100e+05518010.000000518010.000000518010.000000518010.000000518010.000000518010.000000518010.000000518010.000000518010.0000005.180100e+05518010.000000mean227342.428169193803.7925501.507387e+123.9473001.73828513.6284671.34820918.2502501.81961415.52178530.043586305.3249611.506883e+12210.966331std14613.90718888279.3881773.706127e+080.3239161.0208586.6255641.7035247.0607981.08265733.95770256.868021110.4115135.816668e+0983.040065min200000.000000137.0000001.506959e+121.0000001.0000002.0000001.0000001.0000001.0000001.0000001.0000001.0000001.265812e+120.00000025%214926.000000128551.0000001.507026e+124.0000001.00000012.0000001.00000013.0000001.0000004.00000010.000000252.0000001.506970e+12176.00000050%229109.000000199197.0000001.507308e+124.0000001.00000017.0000001.00000021.0000002.0000008.00000019.000000323.0000001.507249e+12199.00000075%240182.000000272143.0000001.507666e+124.0000003.00000017.0000001.00000025.0000002.00000018.00000035.000000399.0000001.507630e+12232.000000max249999.000000364043.0000001.508832e+124.0000005.00000020.00000011.00000028.0000007.000000938.000000938.000000460.0000001.509949e+123082.000000我们可以看出训练集和测试集的用户是完全不一样的训练集的用户ID由0 ~ 199999,而测试集A的用户ID由200000 ~ 249999。In [13]:#测试集中的用户数量为5w tst_click.user_id.nunique() Out[13]:50000In [51]:tst_click.groupby('user_id')['click_article_id'].count().min() # 注意测试集里面有只点击过一次文章的用户 Out[51]:1新闻文章信息数据表In [10]:#新闻文章数据集浏览 item_df.head().append(item_df.tail()) Out[10]:click_article_idcategory_idcreated_at_tswords_count0001513144419000168111140534193600018922114086677060002503311408468313000230441140707117100016236404236404246014340341180001443640433640434601434148472000463364044364044460145797427900017736404536404546015159647370001263640463640464601505811330000479In [25]:item_df['words_count'].value_counts() Out[25]:176 3485 182 3480 179 3463 178 3458 174 3456 ... 845 1 710 1 965 1 847 1 1535 1 Name: words_count, Length: 866, dtype: int64In [26]:print(item_df['category_id'].nunique()) # 461个文章主题 item_df['category_id'].hist() In [15]:item_df.shape # 364047篇文章 Out[15]:(364047, 4)新闻文章embedding向量表示In [16]:item_emb_df.head() Out[16]:article_idemb_0emb_1emb_2emb_3emb_4emb_5emb_6emb_7emb_8...emb_240emb_241emb_242emb_243emb_244emb_245emb_246emb_247emb_248emb_24900-0.161183-0.957233-0.1379440.0508550.8300550.901365-0.335148-0.559561-0.500603...0.3212480.3139990.6364120.1691790.540524-0.8131820.286870-0.2316860.5974160.40962311-0.523216-0.9740580.7386080.1552340.6262940.485297-0.715657-0.897996-0.359747...-0.4878430.8231240.412688-0.3386540.3207860.588643-0.5941370.1828280.397090-0.83436422-0.619619-0.972960-0.207360-0.1288610.044748-0.387535-0.730477-0.066126-0.754899...0.4547560.4731840.377866-0.863887-0.3833650.137721-0.810877-0.4475800.805932-0.28528433-0.740843-0.9757490.3916980.641738-0.2686450.191745-0.825593-0.710591-0.040099...0.2715350.0360400.480029-0.7631730.0226270.565165-0.910286-0.5378380.243541-0.88532944-0.279052-0.9723150.6853740.1130560.2383150.271913-0.5688160.341194-0.600554...0.2382860.8092680.427521-0.615932-0.5036970.614450-0.917760-0.4240610.185484-0.5802925 rows × 251 columnsIn [17]:item_emb_df.shape Out[17]:(364047, 251)数据分析用户重复点击In [27]:#####merge user_click_merge = trn_click.append(tst_click) In [28]:#用户重复点击 user_click_count = user_click_merge.groupby(['user_id', 'click_article_id'])['click_timestamp'].agg({'count'}).reset_index() user_click_count[:10] Out[28]:user_idclick_article_idcount00307601101575071216374613128919714236162152168401163361621735064418439894194425671In [33]:user_click_count[user_click_count['count']>7] Out[33]:user_idclick_article_idcount31124286295742541031124386295762681039376110323720594810393763103237235689105769021348506946313In [32]:user_click_count['count'].unique() Out[32]:array([ 1, 2, 4, 3, 6, 5, 10, 7, 13], dtype=int64)In [93]:#用户点击新闻次数 user_click_count.loc[:,'count'].value_counts() Out[93]:1 1605541 2 11621 3 422 4 77 5 26 6 12 10 4 7 3 13 1 Name: count, dtype: int64可以看出:有1605541(约占99.2%)的用户未重复阅读过文章,仅有极少数用户重复点击过某篇文章。 这个也可以单独制作成特征用户点击环境变化分析In [35]:def plot_envs(df, cols, r, c): plt.figure() plt.figure(figsize=(10, 5)) i = 1 for col in cols: plt.subplot(r, c, i) i += 1 v = df[col].value_counts().reset_index() fig = sns.barplot(x=v['index'], y=v[col]) for item in fig.get_xticklabels(): item.set_rotation(90) plt.title(col) plt.tight_layout() plt.show() In [36]:# 分析用户点击环境变化是否明显,这里随机采样10个用户分析这些用户的点击环境分布 sample_user_ids = np.random.choice(tst_click['user_id'].unique(), size=10, replace=False) sample_users = user_click_merge[user_click_merge['user_id'].isin(sample_user_ids)] cols = ['click_environment','click_deviceGroup', 'click_os', 'click_country', 'click_region','click_referrer_type'] for _, user_df in sample_users.groupby('user_id'): plot_envs(user_df, cols, 2, 3) 用户点击新闻数量的分布In [37]:user_click_item_count = sorted(user_click_merge.groupby('user_id')['click_article_id'].count(), reverse=True) plt.plot(user_click_item_count) Out[37]:[<matplotlib.lines.Line2D at 0x20d488b4470>]可以根据用户的点击文章次数看出用户的活跃度In [97]:#点击次数在前50的用户 plt.plot(user_click_item_count[:50]) Out[97]:[<matplotlib.lines.Line2D at 0x2339d302b00>]点击次数排前50的用户的点击次数都在100次以上。思路:我们可以定义点击次数大于等于100次的用户为活跃用户,这是一种简单的处理思路, 判断用户活跃度,更加全面的是再结合上点击时间,后面我们会基于点击次数和点击时间两个方面来判断用户活跃度。In [98]:#点击次数排名在[25000:50000]之间 plt.plot(user_click_item_count[25000:50000]) Out[98]:[<matplotlib.lines.Line2D at 0x233a04386a0>]可以看出点击次数小于等于两次的用户非常的多,这些用户可以认为是非活跃用户新闻点击次数分析In [38]:item_click_count = sorted(user_click_merge.groupby('click_article_id')['user_id'].count(), reverse=True) In [39]:plt.plot(item_click_count) Out[39]:[<matplotlib.lines.Line2D at 0x20d48b0eda0>]In [101]:plt.plot(item_click_count[:100]) Out[101]:[<matplotlib.lines.Line2D at 0x233a04f5a90>]可以看出点击次数最多的前100篇新闻,点击次数大于1000次In [102]:plt.plot(item_click_count[:20]) Out[102]:[<matplotlib.lines.Line2D at 0x233a0551d30>]点击次数最多的前20篇新闻,点击次数大于2500。思路:可以定义这些新闻为热门新闻, 这个也是简单的处理方式,后面我们也是根据点击次数和时间进行文章热度的一个划分。In [103]:plt.plot(item_click_count[3500:]) Out[103]:[<matplotlib.lines.Line2D at 0x233a0591be0>]可以发现很多新闻只被点击过一两次。思路:可以定义这些新闻是冷门新闻新闻共现频次:两篇新闻连续出现的次数In [104]:tmp = user_click_merge.sort_values('click_timestamp') tmp['next_item'] = tmp.groupby(['user_id'])['click_article_id'].transform(lambda x:x.shift(-1)) union_item = tmp.groupby(['click_article_id','next_item'])['click_timestamp'].agg({'count'}).reset_index().sort_values('count', ascending=False) union_item[['count']].describe() Out[104]:countcount433597.000000mean3.184139std18.851753min1.00000025%1.00000050%1.00000075%2.000000max2202.000000由统计数据可以看出,平均共现次数2.88,最高为1687。说明用户看的新闻,相关性是比较强的。In [106]:#画个图直观地看一看 x = union_item['click_article_id'] y = union_item['count'] plt.scatter(x, y) Out[106]:<matplotlib.collections.PathCollection at 0x2339ce36780> 新闻字数的描述性统计 user_click_merge['words_count'].describe() Out[112]:count 1.630633e+06 mean 2.043012e+02 std 6.382198e+01 min 0.000000e+00 25% 1.720000e+02 50% 1.970000e+02 75% 2.290000e+02 max 6.690000e+03 Name: words_count, dtype: float64In [123]:plt.plot(user_click_merge['words_count'].values) 从上图中可以看出有一小部分用户阅读类型是极其广泛的,大部分人都处在20个新闻类型以下。In [115]:user_click_merge.groupby('user_id')['category_id'].nunique().reset_index().describe() Out[115]:user_idcategory_idcount250000.000000250000.000000mean124999.5000004.573188std72168.9279864.419800min0.0000001.00000025%62499.7500002.00000050%124999.5000003.00000075%187499.2500006.000000max249999.00000095.000000用户查看文章的长度的分布通过统计不同用户点击新闻的平均字数,这个可以反映用户是对长文更感兴趣还是对短文更感兴趣。In [116]:plt.plot(sorted(user_click_merge.groupby('user_id')['words_count'].mean(), reverse=True)) Out[116]:[<matplotlib.lines.Line2D at 0x233e7b66978>]从上图中可以发现有一小部分人看的文章平均词数非常高,也有一小部分人看的平均文章次数非常低。大多数人偏好于阅读字数在200-400字之间的新闻。In [117]:#挑出大多数人的区间仔细看看 plt.plot(sorted(user_click_merge.groupby('user_id')['words_count'].mean(), reverse=True)[1000:45000]) Out[117]:[<matplotlib.lines.Line2D at 0x23444758208>]可以发现大多数人都是看250字以下的文章In [119]:#更加详细的参数 user_click_merge.groupby('user_id')['words_count'].mean().reset_index().describe() Out[119]:user_idwords_countcount250000.000000250000.000000mean124999.500000205.830189std72168.92798647.174030min0.0000008.00000025%62499.750000187.50000050%124999.500000202.00000075%187499.250000217.750000max249999.0000003434.500000用户点击新闻的时间分析In [124]:#为了更好的可视化,这里把时间进行归一化操作 from sklearn.preprocessing import MinMaxScaler mm = MinMaxScaler() user_click_merge['click_timestamp'] = mm.fit_transform(user_click_merge[['click_timestamp']]) user_click_merge['created_at_ts'] = mm.fit_transform(user_click_merge[['created_at_ts']]) user_click_merge = user_click_merge.sort_values('click_timestamp') In [125]:user_click_merge.head() Out[125]:user_idclick_article_idclick_timestampclick_environmentclick_deviceGroupclick_osclick_countryclick_regionclick_referrer_typecategory_idcreated_at_tswords_count182499901623000.000000432012522810.98918619322499981609740.000002411211322810.989092259302499851609740.00000341171822810.989092259502499791623000.000004411712522810.989186193252499881609740.000004411712122810.989092259In [126]:def mean_diff_time_func(df, col): df = pd.DataFrame(df, columns={col}) df['time_shift1'] = df[col].shift(1).fillna(0) df['diff_time'] = abs(df[col] - df['time_shift1']) return df['diff_time'].mean() In [127]:# 点击时间差的平均值 mean_diff_click_time = user_click_merge.groupby('user_id')['click_timestamp', 'created_at_ts'].apply(lambda x: mean_diff_time_func(x, 'click_timestamp')) In [128]:plt.plot(sorted(mean_diff_click_time.values, reverse=True)) Out[128]:[<matplotlib.lines.Line2D at 0x233a1470e48>]从上图可以发现不同用户点击文章的时间差是有差异的In [130]:# 前后点击文章的创建时间差的平均值 mean_diff_created_time = user_click_merge.groupby('user_id')['click_timestamp', 'created_at_ts'].apply(lambda x: mean_diff_time_func(x, 'created_at_ts')) In [132]:plt.plot(sorted(mean_diff_created_time.values, reverse=True)) Out[132]:[<matplotlib.lines.Line2D at 0x2343edf2780>]从图中可以发现用户先后点击文章,文章的创建时间也是有差异的In [133]:# 用户前后点击文章的相似性分布 item_idx_2_rawid_dict = dict(zip(item_emb_df['article_id'], item_emb_df.index)) In [134]:del item_emb_df['article_id'] In [135]:item_emb_np = np.ascontiguousarray(item_emb_df.values, dtype=np.float32) In [136]:# 随机选择5个用户,查看这些用户前后查看文章的相似性 sub_user_ids = np.random.choice(user_click_merge.user_id.unique(), size=15, replace=False) sub_user_info = user_click_merge[user_click_merge['user_id'].isin(sub_user_ids)] sub_user_info.head() Out[136]:user_idclick_article_idclick_timestampclick_environmentclick_deviceGroupclick_osclick_countryclick_regionclick_referrer_typecategory_idcreated_at_tswords_count845882184641991980.007031432012123230.989226221845892184641626550.007039432012122810.9891992451135192078602721430.010669411212013990.9892351841135202078602710450.010677411212013990.9892782621359512006052721430.019109432102823990.989235184In [137]:def get_item_sim_list(df): sim_list = [] item_list = df['click_article_id'].values for i in range(0, len(item_list)-1): emb1 = item_emb_np[item_idx_2_rawid_dict[item_list[i]]] emb2 = item_emb_np[item_idx_2_rawid_dict[item_list[i+1]]] sim_list.append(np.dot(emb1,emb2)/(np.linalg.norm(emb1)*(np.linalg.norm(emb2)))) sim_list.append(0) return sim_list In [138]:for _, user_df in sub_user_info.groupby('user_id'): item_sim_list = get_item_sim_list(user_df) plt.plot(item_sim_list) 从图中可以看出有些用户前后看的商品的相似度波动比较大,有些波动比较小,也是有一定的区分度的 总结 通过数据分析的过程, 我们目前可以得到以下几点重要的信息, 这个对于我们进行后面的特征制作和分析非常有帮助: 训练集和测试集的用户id没有重复,也就是测试集里面的用户没有模型是没有见过的 训练集中用户最少的点击文章数是2, 而测试集里面用户最少的点击文章数是1 用户对于文章存在重复点击的情况, 但这个都存在于训练集里面 同一用户的点击环境存在不唯一的情况,后面做这部分特征的时候可以采用统计特征 用户点击文章的次数有很大的区分度,后面可以根据这个制作衡量用户活跃度的特征 文章被用户点击的次数也有很大的区分度,后面可以根据这个制作衡量文章热度的特征 用户看的新闻,相关性是比较强的,所以往往我们判断用户是否对某篇文章感兴趣的时候, 在很大程度上会和他历史点击过的文章有关 用户点击的文章字数有比较大的区别, 这个可以反映用户对于文章字数的区别 用户点击过的文章主题也有很大的区别, 这个可以反映用户的主题偏好 10.不同用户点击文章的时间差也会有所区别, 这个可以反映用户对于文章时效性的偏好
网友评论