表结构:
表结构方式一:
HIVESQL:
1.首先按照学生和上课时间排序(升序),并新增一列row值当作参考列
SELECT uid,`datetime`,teacher,id,ROW_NUMBER( ) OVER(order by `uid`,`datetime`) as rank from class_table结果如下:
2.利用lag(a,b,c)开窗函数,共三个参数,a为选中的滞后列,b为滞后的行数,c为空时默认值(可以不指定)
暂定连续上4节课
SELECT uid, `datetime`, teacher, `rank`
, lag(`rank`, 3) OVER (PARTITION BY uid, teacher ORDER BY `datetime`) AS `rank2`
FROM (
SELECT uid, `datetime`, teacher, ROW_NUMBER() OVER (ORDER BY `uid`, `datetime`) AS rank
FROM class_table
) tmp
结果如下:
3.通过判断列rank和rank2的关系即可找出相同老师学生连续上多节课的数据
SELECT *
FROM (
SELECT uid, `datetime`, teacher, `rank`
, lag(`rank`, 3) OVER (PARTITION BY uid, teacher ORDER BY `datetime`) AS `rank2`
FROM (
SELECT uid, `datetime`, teacher, ROW_NUMBER() OVER (ORDER BY `uid`, `datetime`) AS rank
FROM class_table
) tmp
) tmp2
WHERE `rank` = `rank2` + 3
调节滞后量即可查询其他连续数量
方式二:
原理都是一样的,pandas需要额外加一列自增列
data.sort_values(by=['uid','datetime'],inplace=True)
data['row'] = range(len(data))
data['rank'] = data.sort_values('datetime').groupby(['uid','teacher'])['row'].shift(3)
data[data['row']==data['rank']+3]
希望能帮助到大家
以上
网友评论