读论文:Reinforcement Learning to Rank in E-Commerce Search Engine: Formalization, Analysis, and Application(SIGKDD2018)
主要还是在DDPG的框架里解决search session的问题。actor产生的动作向量与item embedding做内积得到score来做推荐。状态s是一个累加的状态。
在文中进行了该问题马尔科夫性的证明。
reward存在的问题:高方差和分布不均。(Firstly, the reward variance is high because the deal price m(h) normally varies over a wide range. Secondly, the immediate reward distribution of (s, a) is unbalanced because the conversion events lead by (s, a) occur much less frequently than the two other cases (i.e., abandon and continuation events) which produce zero rewards.)所以这篇文章用了一个类似model-based的方法,先用数据预训练了(继续概率c,结束概率l,购买概率b以及购买平均收入m)用于实际的Q值计算。更新Q网络的公式如下:
critic网络更新公式
Q函数推导:
Q函数推导
总结:所以这篇论文里面没有用target网络来拟合Q(),而是用离线数据先求得了转移概率,整体方案类似于model-based + model-free。
网友评论