【需求背景】现有一个geek表,找出该geek对应的所有感兴趣的job_id (根据deal_type值为2或4的认为是感兴趣的)和该geek对应的所有不感兴趣的job_id,并保存到csv中。
【Hive表形式】
列名 值 数据类型。 (行转列后)
1 lid f1-filter-rule-4-ktbnroht-104637210-34-00000-def-3149_GroupB-def_u_def string
2 uid 104637210 string
3 job_id 7549352 string
4 boss_id 20116844 string
5 session_id f1-filter-rule-ktbnroht-104637210-34-00000-def-3149_GroupB-def_u_def string
6 time 1631115020214 string
7 rank 4 int
8 friend_source 1 int,comment:job来源 0:boss
9 page 1 int
10 exp_id 104637210 string
11 feature {"4813":"-1","4802":"-1","4810":"0.11200745","4738":"1","589":"440117","4535":"-1","4903":"0","4695":"34","4744":"0","4829":"0.009967752","4701":"0","4741":"0","4758":"0","128":"101280100","2691":"0","2813":"3","4694":"3411","4807":"-1","4818":"0.015157722","4799":"-1","4900":"55","4752":"0","4684":"0","4821":"-1","4904":"179","4734":"29","4824":"-1","4537":"-1","174":"3","4689":"0","217":"0","4800":"0.02","177":"5","4828":"-1","4754":"10041","4735":"6","4297":"-1","4757":"0","4681":"0","4270":"6","4814":"-1","4281":"329","4746":"722","379":"23","4803":"0.5","4291":"-1","4743":"0","4679":"29","4825":"-1","4685":"0","3351":"113.60119","2776":"1","4747":"0","4692":"0","4817":"-1","2834":"1549521091000","4806":"0.010416667","978":"20020622","473":"19","4690":"20","4798":"0.20689656","4699":"0","2135":"[290302,130303,290201]","4749":"90","2133":"3","4822":"-1","4750":"147","4742":"120","4320":"-1","4815":"-1","4827":"-1","2777":"1535382746000","4884":"-3.0","4901":"0","4688":"0","4477":"-1","4682":"50","4257":"-1","4680":"6","4804":"0.20689656","4736":"0","4805":"-1","4830":"-1","4279":"266","3515":"0","2825":"2","4693":"0","4686":"2441","3350":"23.56688","4678":"58","4691":"42","4816":"-1","4272":"-1","4319":"-3","4258":"3","4801":"-1","357":"1594814825000","4797":"0.5","4902":"179","210":"209","4826":"1.6333333","4335":"6","4811":"-1","4696":"0","4812":"-1","4739":"0","4748":"0","4536":"2","171":"206","4700":"0","4295":"-1","4737":"96","125":"130303","2253":"1","621":"1631104270725","4337":"29","570":"35","4698":"37","4745":"6446","130":"101","179":"3","4756":"125","590":"1289","569":"18","4819":"-1","4687":"266","4905":"0","4852":"1.0E-4","4751":"0","1054":"30","204":"0","4338":"3","366":"1631111701475","4286":"-1","4740":"0","4808":"-1","4820":"-1","380":"440113","4275":"32","4697":"0","4733":"58","4823":"-1","4299":"58","2543":"0","4683":"1","358":"1631104270725"} string
12 predict_score {} string,comment:模型得分:key: 模型tag
13 recall_score {} string,comment:粗排得分:key: 模型tag
14 recall_stag_tag null string
15 pk_class null string
16 deal_type 3 int
17 sample_rule null string
18 expo_num 0 long
19 cols {} string
20 other_rank 0 int
21 extend_msg {} string
22 recall_idx_no_map {} string
23 list_time 2021-09-08 23:30:20 string
24 detail 1 short,comment:是否detail
25 add_friend 1 short,comment:是否addfriend
26 double_chat 1 short,comment:是否detail
27 succ 0 short
28 valid 1 int
29 code_num 1 int
30 exp_num 50 int
31 det_num 13 int
32 biz_type f1aa_dz_grcd string
33 remain 0 int
34 recall string
35 sub_recall string
36 double_chat_1d 1 short,comment:1d口径,是否回聊double_chat
37 succ_1d 0 short,comment:1d口径,是否达成succ
38 deal_type_1d 3 short,comment:1d口径,exp:0,detail:1, add:2, chat:3
39 source f1aa_dz_grcd string
40 ds 2021-09-08 string
【解决过程】
第一个想法是把jobid按条件筛选放到list中:
select tmp_pos.uid,
get_json_object(tmp_pos.feature,'$.589'),
get_json_object(tmp_pos.feature,'$.4270'),
get_json_object(tmp_pos.feature,'$.2135'),
collect_list(tmp_pos.pos_jobids),
collect_list(tmp_neg.neg_jobids)
from(select uid,feature,job_id as pos_jobids from dianzhangai.dwd_dz_geekrec_sample where deal_type in ('2','4') and ds='2021-08-30')tmp_pos
join
(select uid,feature,job_id as neg_jobids from dianzhangai.dwd_dz_geekrec_sample where deal_type='0' and ds='2021-08-30')tmp_neg
on tmp_pos.uid=tmp_neg.uid
group by tmp_pos.uid,get_json_object(tmp_pos.feature,'$.589'),
get_json_object(tmp_pos.feature,'$.4270'),
get_json_object(tmp_pos.feature,'$.2135')
limit 20;
发现结果中有重复的元素,于是考虑collect_set
select tmp_pos.uid,
get_json_object(tmp_pos.feature,'$.589'),
get_json_object(tmp_pos.feature,'$.4270'),
get_json_object(tmp_pos.feature,'$.2135'),
collect_set(tmp_pos.pos_jobids),
collect_set(tmp_neg.neg_jobids)
from(select uid,feature,job_id as pos_jobids from dianzhangai.dwd_dz_geekrec_sample where deal_type in ('2','4') and ds='2021-08-30')tmp_pos
join
(select uid,feature,job_id as neg_jobids from dianzhangai.dwd_dz_geekrec_sample where deal_type='0' and ds='2021-08-30')tmp_neg
on tmp_pos.uid=tmp_neg.uid
group by tmp_pos.uid,get_json_object(tmp_pos.feature,'$.589'),
get_json_object(tmp_pos.feature,'$.4270'),
get_json_object(tmp_pos.feature,'$.2135')
limit 20;
结果如图所示,是个wrappedarray形式,无法存储到csv中,于是考虑转成str类型。
可以使用concat_ws函数,但是该函数仅支持string和array< string > 所以对于该列不是string的列,先转为string。concat_ws函数的意思是用逗号将所有的string组合到一起,用逗号分割显示:
select id,concat_ws(',',collect_list(cast (*** as string))) from table...
【最终正确写法】
select tmp_pos.uid,
get_json_object(tmp_pos.feature,'$.589'),
get_json_object(tmp_pos.feature,'$.4270'),
get_json_object(tmp_pos.feature,'$.2135'),
concat_ws(',',collect_set(cast (tmp_pos.pos_jobids as string))),
concat_ws(',',collect_set(cast (tmp_neg.neg_jobids as string)))
from(select uid,feature,job_id as pos_jobids from dianzhangai.dwd_dz_geekrec_sample where deal_type in ('2','4') and ds >= '2021-09-08' and ds <= '2021-09-09') tmp_pos
join
(select uid,feature,job_id as neg_jobids from dianzhangai.dwd_dz_geekrec_sample where deal_type='0' and ds >= '2021-09-08' and ds <= '2021-09-09') tmp_neg
on tmp_pos.uid=tmp_neg.uid
group by tmp_pos.uid,get_json_object(tmp_pos.feature,'$.589'),
get_json_object(tmp_pos.feature,'$.4270'),
get_json_object(tmp_pos.feature,'$.2135');
网友评论