现有收集到用户的页面点击行为日志数据
- 数据格式如下:
用户id, 点击时间
user_id click_time
A,2020-05-15 01:30:00
A,2020-05-15 01:35:00
A,2020-05-15 02:00:00
A,2020-05-15 03:00:10
A,2020-05-15 03:05:00
B,2020-05-15 02:03:00
B,2020-05-15 02:29:40
B,2020-05-15 04:00:00
要求:
- 间隔30分钟算两次会话
- 给所有会话按时间排序,标序号
思路: - 先根据
user_id
分组,click_time
排序,再用行函数查出上一行的时间beforeTime
; - 将当前的
click_time
减去beforeTime
,得到当前行时间比上一行时间差的分钟`minTime,大于30分钟设为1,其他设为0; - 计算第一行到当前行的和,设为
groupType
,结果可能是:0,0...1,1,1... - 根据
user_id
,groupType
重新排序,得到结果
select user_id,click_time,
row_number() over (partition by user_id,groupType order by click_time) rank
from
(select user_id,click_time,sum(minTime) over (partition by user_id order by click_time rows
between unbounded preceding and current row) as groupType
from(select user_id,click_time,
case when (unix_timestamp(click_time) -unix_timestamp(beforeTime))/60 >=30 then 1 else 0 end minTime
from(
select user_id,click_time,
row_number() over (partition by user_id order by click_time) rank,
nvl(lag(click_time) over (partition by user_id order by click_time),click_time) beforeTime
from user_clicklog)tmp1
)tmp2
)tmp3;
网友评论