HQL操作之DQL命令

作者: 一拳超疼 | 来源:发表于2020-07-15 14:16 被阅读0次

题记

本文部分资料来源于拉钩大数据高薪训练营

select 语法：

SELECT [ALL | DISTINCT] select_expr, select_expr, ...
  FROM table_reference
  [WHERE where_condition]
  [GROUP BY col_list]
  [ORDER BY col_list]
  [CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY col_list]]
[LIMIT [offset,] rows]

SQL语句书写注意事项：

SQL语句对大小写不敏感
SQL语句可以写一行（简单SQL）也可以写多行(复杂SQL)
关键字不能缩写，也不能分行
各子句一般要分行
使用缩进格式，提高SQL语句的可读性(重要)

# 省略from的查询
select 8*88；# 计算
select current_date; # 查询系统当前时间
# 全表查询
select * from tablename;
# 查询特定的列
select column1,column2 from tablename;
# 使用函数
select count(*) from tablename; 
select avg(...) from tablename; 
select max(...) from tablename;
select min(...) from tablename; 
# 使用limit子句限制返回的行数
select * from emp limit 3;

where 子句

WHERE子句紧随FROM子句，使用WHERE子句，过滤不满足条件的数据；
where 子句中不能使用列的别名
where子句中会涉及到较多的比较运算和逻辑运算；

比较运算符

官方文档：https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF

比较运算符	描述
=、==、<=>	等于
<>、!=	不等于
<、<=、>、>=	大于等于、小于等于
is [not] null	如果A等于NULL，则返回TRUE，反之返回FALSE。使用NOT关键字结果相反。
in(value1,value2, ......)	匹配列表中的值
LIKE	简单正则表达式，也称通配符模式。'x%' 表示必须以字母 'x' 开头；'%x'表示必须以字母'x'结尾；'%x%'表示包含有字母'x'，可以位于字符串任意位置。使用NOT关键字结果相反。% 代表匹配零个或多个字符(任意个字符)；_ 代表匹配一个字符。
[NOT] BETWEEN ... AND ...	范围的判断，使用NOT关键字结果相反。
RLIKE、REGEXP	基于java的正则表达式，匹配返回TRUE，反之返回FALSE。匹配使用的是JDK中的正则表达式接口实现的，因为正则也依据其中的规则。例如，正则表达式必须和整个字符串A相匹配，而不是只需与其字符串匹配。

备注：通常情况下NULL参与运算，返回值为NULL；NULL<=>NULL(相当equals)的结果为true

group by 子句

GROUP BY语句通常与聚组函数一起使用，按照一个或多个列对数据进行分组，对每个组进行聚合操作。
where子句针对表中的数据发挥作用；having针对查询结果（聚组以后的结
果）发挥作用
where子句不能有分组函数；having子句可以有分组函数
having只用于group by分组统计之后

# 计算emp表每个部门的平均工资
select deptno, avg(sal)
from emp
group by deptno;
# 计算emp每个部门中每个岗位的最高薪水
select deptno, job, max(sal)
from emp
group by deptno, job;
# 求每个部门的平均薪水大于2000的部门
select deptno, avg(sal)
from emp
group by deptno
having avg(sal) > 2000;

表连接

Hive支持通常的SQL JOIN语句，仅支持等值连接，不支持非等值连接。
JOIN 语句中经常会使用表的别名。使用别名可以简化SQL语句的编写，使用表名前
缀可以提高SQL的解析效率。
连接查询操作分为两大类：内连接和外连接，而外连接可进一步细分为三种类型：
1. 内连接: [inner] join
2. 外连接 (outer join)
  - 左外连接。 left [outer] join，左表的数据全部显示
  - 右外连接。 right [outer] join，右表的数据全部显示
  - 全外连接。 full [outer] join，两张表的数据都显示

表连接关系.png

# 内连接
select * from u1 join u2 on u1.id = u2.id;
# 左外连接
select * from u1 left join u2 on u1.id = u2.id;
# 右外连接
select * from u1 right join u2 on u1.id = u2.id;
# 全外连接
select * from u1 full join u2 on u1.id = u2.id;
# 多表连接
select *
from techer t left join course c on t.t_id = c.t_id
left join score s on s.c_id = c.c_id
left join student stu on s.s_id = stu.s_id;

笛卡尔积
满足以下条件将会产生笛卡尔集：

没有连接条件
连接条件无效
所有表中的所有行互相连接

set hive.strict.checks.cartesian.product=false;
select * from u1, u2;

排序子句

order by 全局排序

order by 子句出现在select语句的结尾；
order by子句对最终的结果进行排序；
默认使用升序(ASC)；可以使用DESC，跟在字段名之后表示降序；
order by 执行全局排序，因此只会使用一个reduce

hive (mydb)> select empno,deptno, sal+nvl(comm,0) salcomm from emp order by deptno, salcomm desc;

sort by （每个mr内部排序）

对于大规模数据而言order by效率低；
在很多业务场景，我们并不需要全局有序的数据，此时可以使用sort by；
sort by为每个reduce产生一个排序文件，在reduce内部进行排序，得到局部有序的结果；

select *from emp sort by sal desc;

distribute by (分区排序)

distribute by 将特定的行发送到特定的reducer中，便于后继的聚合与排序操作；
distribute by 类似于MR中的分区操作，可以结合sort by操作，使分区数据有序；
distribute by 要写在sort by之前；

# 因为deptno有三种值，分区排序需要设置reduces为3个
set mapreduce.job.reduces=3;
insert overwrite local directory '/root/data/hive_data/emp'
select empno, ename, job, deptno, sal + nvl(comm, 0) salcomm
from emp
distribute by deptno
sort by salcomm desc;

Cluster by

当distribute by 与 sort by是同一个字段时，可使用cluster by简化语法；
cluster by 只能是升序，不能指定排序规则；

# 语法上是等价的
select * from emp distribute by deptno sort by deptno;
select * from emp cluster by deptno;

函数

Hive内置函数：https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-Built-inFunctions

系统内置函数

查看系统函数

# 查看系统自带函数
show functions;
# 显示自带函数的用法
desc function upper;
desc function extended upper;

日期函数

# 当前日期
select current_date;
select unix_timestamp();
# 建议使用current_timestamp，有没有括号都可以
select current_timestamp();
# 时间戳转日期
select from_unixtime(1505456567);
select from_unixtime(1505456567, 'yyyyMMdd');
select from_unixtime(1505456567, 'yyyy-MM-dd HH:mm:ss');
# 日期转时间戳
select unix_timestamp('2019-09-15 14:23:00');
# 计算时间差
select datediff('2020-04-18','2019-11-21');
select datediff('2019-11-21', '2020-04-18');
# 查询当月第几天
select dayofmonth(current_date);
# 计算月末:
select last_day(current_date);
# 当月第1天:
select date_sub(current_date, dayofmonth(current_date)-1)
字符串函数
# 下个月第1天:
select add_months(date_sub(current_date,
dayofmonth(current_date)-1), 1)
# 字符串转时间（字符串必须为：yyyy-MM-dd格式）
select to_date('2020-01-01');
select to_date('2020-01-01 12:12:12');
# 日期、时间戳、字符串类型格式化输出标准时间格式
select date_format(current_timestamp(), 'yyyy-MM-dd HH:mm:ss');
select date_format(current_date(), 'yyyyMMdd');
select date_format('2020-06-01', 'yyyy-MM-dd HH:mm:ss');
# 计算emp表中，每个人的工龄
select *, round(datediff(current_date, hiredate)/365,1) workingyears from emp;

字符串函数

# 转小写。lower
select lower("HELLO WORLD");
# 转大写。upper
select lower(ename), ename from emp;
# 求字符串长度。length
select length(ename), ename from emp;
# 字符串拼接。 concat / ||
select empno || " " ||ename idname from emp;
select concat(empno, " " ,ename) idname from emp;
# 指定分隔符。concat_ws(separator, [string | array(string)]+)，第一个为连接符，其后都是连接字符串；
SELECT concat_ws('.', 'www', array('lagou', 'com'));
select concat_ws(" ", ename, job) from emp;
# 求子串。substr
SELECT substr('www.lagou.com', 5);
SELECT substr('www.lagou.com', -5);
SELECT substr('www.lagou.com', 5, 5);
# 字符串切分。split，注意 '.' 要转义
select split("www.lagou.com", "\\.");

数学函数

# 四舍五入 round
select round(314.15926);
select round(314.15926, 2);
select round(314.15926, -2);
# 向上取整。ceil
select ceil(3.1415926);
# 向下取整。floor
select floor(3.1415926);
# 其他数学函数包括：绝对值、平方、开方、对数运算、三角运算等
select abs(-1); # 绝对值
select pow(2,2); select power(2,2) # 平方
select sqrt(9); # 开方
select log(2,8); select log10(100); select log2(8); # 对数运算
sin, cos , tan # 三角运算

条件函数

# if (boolean testCondition, T valueTrue, T valueFalseOrNull)
select sal, if (sal<1500, 1, if (sal < 3000, 2, 3)) from emp;
# CASE WHEN a THEN b [WHEN c THEN d]* [ELSE e] END
# 将emp表的员工工资等级分类：0-1500、1500-3000、3000以上
select sal, if (sal<=1500, 1, if (sal <= 3000, 2, 3)) from emp;
# CASE WHEN a THEN b [WHEN c THEN d]* [ELSE e] END
# 复杂条件用 case when 更直观
select sal, case when sal<=1500 then 1
when sal<=3000 then 2
else 3 end sallevel
from emp;
# 以下语句等价
select ename, deptno,
case deptno when 10 then 'accounting'
when 20 then 'research'
when 30 then 'sales'
else 'unknown' end deptname
from emp;
# COALESCE(T v1, T v2, ...)。返回参数中的第一个非空值；如果所有值都为NULL，那么返回NULL
select sal, coalesce(comm, 0) from emp;
# isnull(a) isnotnull(a)
select * from emp where isnull(comm);
select * from emp where isnotnull(comm);
# nvl(T value, T default_value)
select empno, ename, job, mgr, hiredate, deptno, sal +
nvl(comm,0) sumsal
from emp;
# nullif(x, y) 相等为空，否则为a
SELECT nullif("b", "b"), nullif("b", "a");

UDTF函数
UDTF : User Defined Table-Generating Functions。用户定义表生成函数，一行输
入，多行输出。

lateral view explode()

# explode，炸裂函数
# 就是将一行中复杂的 array 或者 map 结构拆分成多行
hive (mydb)> select explode(array('A','B','C')) as col;
OK
col
A
B
C

hive (mydb)> select explode(map('a', 8, 'b', 88, 'c', 888));
OK
key value
a   8
b   88
c   888

# explode单独使用不能添加其他新列，因此 lateral view 常与表生成函数explode结合使用
# lateral view 语法：
lateralView: LATERAL VIEW udtf(expression) tableAlias AS columnAlias (',' columnAlias)*
fromClause: FROM baseTable (lateralView)*

# lateral view 的基本使用
with t1 as (
select 'OK' cola, split('www.baidu.com', '\\.') colb
)
select cola, colc
from t1
lateral view explode(colb) t2 as colc;

example1

# 数据(uid tags)：
1 1,2,3
2 2,3
3 1,2
# 编写sql,实现如下结果：
1 1
1 2
1 3
2 2
2 3
3 1
3 2
# 建表加载数据
create table market(
id int,
storage string,
allocation string,
outdt string
)
row format delimited fields terminated by '\t';
load data local inpath '/hivedata/market.txt' into table market;
# SQL
select uid, tag
from t1
lateral view explode(split(tags,",")) t2 as tag;

example2

# 数据准备
lisi|Chinese:90,Math:80,English:70
wangwu|Chinese:88,Math:90,English:96
maliu|Chinese:99,Math:65,English:60
# 创建表
create table studscore(
name string,
score map<String,string>)
row format delimited
fields terminated by '|'
collection items terminated by ','
map keys terminated by ':';
# 加载数据
load data local inpath '/home/hadoop/data/score.dat' overwrite
into table studscore;
# 需求：找到每个学员的最好成绩
# 第一步，使用 explode 函数将map结构拆分为多行
select explode(score) as (subject, socre) from studscore;
#但是这里缺少了学员姓名，加上学员姓名后出错。下面的语句有是错的
select name, explode(score) as (subject, socre) from studscore;
# 第二步：explode常与 lateral view 函数联用，这两个函数结合在一起能关联其他字段
select name, subject, score1 as score from studscore
lateral view explode(score) t1 as subject, score1;

# 第三步：找到每个学员的最好成绩
# 写法一
select name, max(mark) maxscore
from (select name, subject, mark
from studscore lateral view explode(score) t1 as
subject, mark) t1
group by name;

#写法二
with tmp as (
select name, subject, mark
from studscore lateral view explode(score) t1 as subject, mark
)
select name, max(mark) maxscore
from tmp
group by name;

窗口函数 [重要]

窗口函数又名开窗函数，属于分析函数的一种。用于解决复杂报表统计需求的功能
强大的函数，很多场景都需要用到。

窗口函数用于计算基于组的某种聚合值，它和聚合函数的不同之处是：对于每个组返回多行，而聚合函数对于每个组只返回一行。
窗口函数指定了分析函数工作的数据窗口大小，这个数据窗口大小可能会随着行的
变化而变化。

over 关键字

使用窗口函数之前，一般要通过over()进行开窗。

# 普通聚合函数
select sum(sal) from emp;
# 使用窗口函数，查询员工，薪水，全体员工薪水和，每个员工占整体薪水和
select ename, sal, sum(sal) over() sal_sum,
concat(round(sal / sum(sal) over()*100, 1), "%") ratiosal
from emp;

注意：窗口函数是针对每一行数据的；如果over中没有参数，默认的是全部结果
集；

partition by 子句

在over窗口中进行分区，对某一列进行分区统计，窗口的大小就是分区的大小

# 查询员工姓名、薪水、部门薪水总和
select ename, sal, sum(sal) over(partition by deptno) sal_sum from emp;

order by 子句

order by 子句对输入的数据进行排序

# 增加了order by子句；sum：从分组的第一行到当前行求和
select ename, sal, deptno, 
sum(sal) over(partition by deptno order by sal) salsum
from emp;

Window子句

语法格式：

row between ... and ...

对窗口的更细的划分，使用如下语法：

unbounded preceding # 当前窗口的第一条数据
n preceding # 当前窗口下，前n条数据到当前行数据的所有数据
current row # 当前行数据
n following # 当前窗口下，当前行到后n行的数据
unbounded following # 当前窗口下，最后一条数据

example

select ename, sal, deptno,
sum(sal) over(partition by deptno order by sal 
rows between unbounded preceding and unbounded following) sal_sum,
from emp;

select ename, sal, deptno,
sum(sal) over(partition by deptno order by sal
rows between 1 preceding and 1 following) sal_sum
from emp;

排名函数

都是从1开始，生成数据项在分组中的排名。

row_number()。排名顺序增加不会重复；如1、2、3、4、... ...
rank()。排名相等会在名次中留下空位；如1、2、2、4、5、... ...
dense_rank()。排名相等会在名次中不会留下空位；如1、2、2、3、4、.

# 按照班级，使用3种方式对成绩进行排名
select cname, sname, score,
row_number() over (partition by cname order by score desc) rank1,
rank() over (partition by cname order by score desc) rank2,
dense_rank() over (partition by cname order by score desc) rank3
from t2;

# 计算班级前三名的学生
with t1 as(
select cname, sname, score,
rank() over(partition by cname order by score desc) rank_score
from rank
)
select * from t1 where rank_score <= 3;

序列函数

lag：返回当前数据行的上一行数据
lead：返回当前数据行的下一行数据
first_value：取分组内排序后，截止到当前行，第一个值
last_value：分组内排序后，截止到当前行，最后一个值
ntile：将分组的数据按照顺序切分成n片，返回当前切片值

# lag lead
hive (mydb)> select cid, ctime, pv,
           > lag(pv) over(partition by cid order by ctime) lag,
           > lead(pv) over(partition by cid order by ctime) lead
           > from userpv;

cid   ctime  pv lag   lead
cookie1 2019-04-10  1   NULL    5
cookie1 2019-04-11  5   1   7
cookie1 2019-04-12  7   5   3
cookie1 2019-04-13  3   7   2
cookie1 2019-04-14  2   3   4
cookie1 2019-04-15  4   2   4
cookie1 2019-04-16  4   4   NULL
cookie2 2019-04-10  2   NULL    3
cookie2 2019-04-11  3   2   5
cookie2 2019-04-12  5   3   6
cookie2 2019-04-13  6   5   3
cookie2 2019-04-14  3   6   9
cookie2 2019-04-15  9   3   7
cookie2 2019-04-16  7   9   NULL


# first_value last_value
select cid, ctime, pv,
first_value(pv) over(partition by cid order by ctime 
rows between unbounded preceding and unbounded following) first_value,
last_value(pv) over(partition by cid order by ctime
rows between unbounded preceding and unbounded following) last_value
from userpv;

cid ctime   pv  first_value last_value
cookie1 2019-04-10  1   1   4
cookie1 2019-04-11  5   1   4
cookie1 2019-04-12  7   1   4
cookie1 2019-04-13  3   1   4
cookie1 2019-04-14  2   1   4
cookie1 2019-04-15  4   1   4
cookie1 2019-04-16  4   1   4
cookie2 2019-04-10  2   2   7
cookie2 2019-04-11  3   2   7
cookie2 2019-04-12  5   2   7
cookie2 2019-04-13  6   2   7
cookie2 2019-04-14  3   2   7
cookie2 2019-04-15  9   2   7
cookie2 2019-04-16  7   2   7

# ntile
select cid, ctime, pv,
ntile(2) over(partition by cid order by ctime) ntile
from userpv;

cid ctime   pv  ntile
cookie1 2019-04-10  1   1
cookie1 2019-04-11  5   1
cookie1 2019-04-12  7   1
cookie1 2019-04-13  3   1
cookie1 2019-04-14  2   2
cookie1 2019-04-15  4   2
cookie1 2019-04-16  4   2
cookie2 2019-04-10  2   1
cookie2 2019-04-11  3   1
cookie2 2019-04-12  5   1
cookie2 2019-04-13  6   1
cookie2 2019-04-14  3   2
cookie2 2019-04-15  9   2
cookie2 2019-04-16  7   2

三个问题

求连续的问题

连续7天登陆

uid dt status(1 正常登录，0 异常)
1 2019-07-11 1
1 2019-07-12 1
1 2019-07-13 1
1 2019-07-14 1
1 2019-07-15 1
1 2019-07-16 1
1 2019-07-17 1
1 2019-07-18 1
2 2019-07-11 1
2 2019-07-12 1
2 2019-07-13 0
2 2019-07-14 1
2 2019-07-15 1
2 2019-07-16 0
2 2019-07-17 1
2 2019-07-18 0
3 2019-07-11 1
3 2019-07-12 1
3 2019-07-13 1
3 2019-07-14 0
3 2019-07-15 1
3 2019-07-16 1
3 2019-07-17 1
3 2019-07-18 1

# 创建表，导数
create table ulogin(
uid int,
dt date,
status int
)
row format delimited fields terminated by ' ';
# 计算连续7天登陆的用户
with t1 as(
select uid,
date_sub(dt, row_number() over(partition by uid order by dt)) gid
from ulogin where status = 1
)
select uid, count(*) count
from t1 group by uid,gid
having count >= 7;

查找班级前三名，并计算分数差

sid class score
1 1901 90
2 1901 90
3 1901 83
4 1901 60
5 1902 66
6 1902 23
7 1902 99
8 1902 67
9 1902 87

# 创建表
create table stu(
sno int,
class string,
score int
)row format delimited fields terminated by ' ';

# 实现
with t1 as(
select sno, class, score,
dense_rank() over(partition by class order by score desc) dr
from stu
)
select sno, class, score,
nvl(score - lag(score) over(partition by class order by score desc), 0) lagscore
from t1 where dr <= 3;

行列转换

id course
1 java
1 hadoop
1 hive
1 hbase
2 java
2 hive
2 spark
2 flink
3 java
3 hadoop
3 hive
3 kafka

# 创建表
create table rowline1(
id string,
course string
)row format delimited fields terminated by ' ';

# 得到结果如下(1表示选修，0表示未选修)
id java hadoop hive hbase spark flink kafka
1    1     1     1     1     0     0     0
2    1     0     1     0     1     1     0
3    1     1     1     0     0     0     1

# 实现
select id, 
sum(case course when 'java' then 1 else 0 end) java,
sum(case course when 'hadoop' then 1 else 0 end) hadoop,
sum(case course when 'hive' then 1 else 0 end) hive,
sum(case course when 'hbase' then 1 else 0 end)hbase,
sum(case course when 'spark' then 1 else 0 end) spark,
sum(case course when 'flink' then 1 else 0 end) flink,
sum(case course when 'kafka' then 1 else 0 end) kafka
from rowline1 group by id;

id1 id2 flag
a b 2
a b 1
a b 3
c d 6
c d 8
c d 8

# 编写sql实现如下结果
id1 id2 flag
a b 2|1|3
c d 6|8

# 创建表
create table rowline2(
id1 string,
id2 string,
flag int
) row format delimited fields terminated by ' ';

# 实现
select id1, id2,
concat_ws( "|", sort_array(collect_list(cast(flag as string)))) flag
from rowline2 group by id1,id2;

自定义函数

当 Hive 提供的内置函数无法满足实际的业务处理需要时，可以考虑使用用户自定义
函数进行扩展。用户自定义函数分为以下三类：

UDF（User Defined Function）。用户自定义函数，一进一出
- 继承org.apache.hadoop.hive.ql.exec.UDF
- 需要实现evaluate函数；evaluate函数支持重载
- UDF必须要有返回类型，可以返回null，但是返回类型不能为void
UDAF（User Defined Aggregation Function）。用户自定义聚集函数，多进一
出；类似于：count/max/min
UDTF（User Defined Table-Generating Functions）。用户自定义表生成函
数，一进多出；类似于：explode