一、hive正则表达式
三个函数:regexp,regexp_extract,regexp_replace,
1.A regexp B功能与RLIKE相同
select count(*) from olap_b_dw_hotelorder_f where create_date_wid not regexp '\\d{8}'
2.regexp_replace
语法:regexp_replace(target_string,pattern,substitute_string)
select regexp_replace("IloveYou","You","")
out>>>"Ilove"
3.regexp_extract
语法:regexp_extract(target_string,pattern,index)
index是选择哪个分组(即括号括起来的),从1开始。
疑难问题实战:
比如我有像这样的数据:
'[{"houses":0,"States":[{"state":"01","inventory":xxx,"houseCode":"xxx","houseName":"xxxx"},{"state":"01","inventory":xxx,"houseCode":"xxx","houseName":"xxxx"}]},{"houses":1,"States":[{"state":"07","inventory":xxx,"houseCode":"xxx","houseName":"xxxx"},{"state":"03","inventory":0,"houseCode":"xxx","houseName":"xxxx"}]}]'
我要从里面取出每个house的状态,然后综合各个house的状态,按照一定的逻辑去判断整体状态是什么。所以我要用正则去匹配。然后因为house的数目是不定的,有的数据的house可能有1一个,最多的有4个,所以要匹配出所有的state。因为数据库里一条数据只有一个,只能放一条匹配的数据,所以正则匹配的只显示第一次匹配的结果。
本来一筹莫展想额外建函数来写的,但是在公司小哥哥的指导下找到了解决方法。我们可以开多个列来储存多个匹配结果,这里最多有4个,就开4个。因为正则是优先匹配第一个,所以我在第一个表达式里只写一个组,第二表达式写两个组,以此类推。
然后把4个结果按照一定的逻辑判断整体的状态(用case when)。
select
regexp_extract(origin_info,'state":"(.*?)","inventory',1) as h1,
regexp_extract(origin_info,'state":"(.*?)","inventory.*state":"(.*?)","inventory',2) as h2,
regexp_extract(origin_info,'state":"(.*?)","inventory.*state":"(.*?)","inventory.*state":"(.*?)","inventory',3) as h3,
regexp_extract(origin_info,'state":"(.*?)","inventory.*state":"(.*?)","inventory.*state":"(.*?)","inventory.*state":"(.*?)","inventory',4) as h4
from test;
网友评论