美文网首页
Hive内部函数简介及查询语法

Hive内部函数简介及查询语法

作者: K_un | 来源:发表于2018-07-02 16:36 被阅读0次
    1.Hive内置函数:

    在Hive中 系统给我们内置了很多函数 具体参考官方地址

    • 看下官网给我们的介绍:
    SHOW FUNCTIONS; --查看所有内置函数
    DESCRIBE FUNCTION <function_name>; --查看某个函数的描述
    DESCRIBE FUNCTION EXTENDED <function_name>; --查看某个函数的具体使用方法
    
    hive> DESCRIBE FUNCTION case;
    OK
    CASE a WHEN b THEN c [WHEN d THEN e]* [ELSE f] END - When a = b, returns c; when a = d, return e; else return f
    Time taken: 0.006 seconds, Fetched: 1 row(s)
    hive> DESCRIBE FUNCTION EXTENDED case;
    OK
    CASE a WHEN b THEN c [WHEN d THEN e]* [ELSE f] END - When a = b, returns c; when a = d, return e; else return f
    Example:
     SELECT
     CASE deptno
       WHEN 1 THEN Engineering
       WHEN 2 THEN Finance
       ELSE admin
     END,
     CASE zone
       WHEN 7 THEN Americas
       ELSE Asia-Pac
     END
     FROM emp_details
    Time taken: 0.008 seconds, Fetched: 13 row(s)
    # DESCRIBE 可简写为desc
    hive> desc FUNCTION EXTENDED case;
    OK
    CASE a WHEN b THEN c [WHEN d THEN e]* [ELSE f] END - When a = b, returns c; when a = d, return e; else return f
    Example:
     SELECT
     CASE deptno
       WHEN 1 THEN Engineering
       WHEN 2 THEN Finance
       ELSE admin
     END,
     CASE zone
       WHEN 7 THEN Americas
       ELSE Asia-Pac
     END
     FROM emp_details
    Time taken: 0.009 seconds, Fetched: 13 row(s)
    

    下面我们了解下常用函数的使用方法:

    # 为了方便测试 我们创建常用的dual表
    hive> create table dual(x string);
    OK
    Time taken: 0.11 seconds
    hive> insert into table dual values('');
    Query ID = hadoop_20180702100505_f0566585-06b2-4c53-910a-b6a58791fc2d
    Total jobs = 3
    Launching Job 1 out of 3
    ...
    OK
    Time taken: 29.535 seconds
    hive> select * from dual;
    OK
    
    Time taken: 0.147 seconds, Fetched: 1 row(s)
    # 测试当前时间
    hive> select current_date from dual;
    OK
    2018-07-02
    Time taken: 0.111 seconds, Fetched: 1 row(s)
    # 测试当前时间戳
    hive> select current_timestamp from dual;
    OK
    2018-07-02 15:03:28.919
    Time taken: 0.117 seconds, Fetched: 1 row(s)
    # 测试substr函数 用于截取字符串
    hive> desc function extended substr;
    OK
    substr(str, pos[, len]) - returns the substring of str that starts at pos and is of length len orsubstr(bin, pos[, len]) - returns the slice of byte array that starts at pos and is of length len
    Synonyms: substring
    pos is a 1-based index. If pos<0 the starting position is determined by counting backwards from the end of str.
    Example:
       > SELECT substr('Facebook', 5) FROM src LIMIT 1;
      'book'
      > SELECT substr('Facebook', -5) FROM src LIMIT 1;
      'ebook'
      > SELECT substr('Facebook', 5, 1) FROM src LIMIT 1;
      'b'
    Time taken: 0.016 seconds, Fetched: 10 row(s)
    hive> SELECT substr('helloworld',-5) FROM dual;
    OK
    world
    Time taken: 0.171 seconds, Fetched: 1 row(s)
    hive> SELECT substr('helloworld',5) FROM dual;
    OK
    oworld
    Time taken: 0.12 seconds, Fetched: 1 row(s)
    hive> SELECT substr('helloworld',5,3) FROM dual;
    OK
    owo
    Time taken: 0.142 seconds, Fetched: 1 row(s)
    # 测试函数concat 用于将字符连接起来
    hive> desc function extended concat_ws;
    OK
    concat_ws(separator, [string | array(string)]+) - returns the concatenation of the strings separated by the separator.
    Example:
      > SELECT concat_ws('.', 'www', array('facebook', 'com')) FROM src LIMIT 1;
      'www.facebook.com'
    Time taken: 0.019 seconds, Fetched: 4 row(s)
    hive> select concat_ws(".","192","168","199","151") from dual;
    OK
    192.168.199.151
    Time taken: 0.152 seconds, Fetched: 1 row(s)
    # 测试函数split 用于拆分
    hive> desc function extended split;
    OK
    split(str, regex) - Splits str around occurances that match regex
    Example:
      > SELECT split('oneAtwoBthreeC', '[ABC]') FROM src LIMIT 1;
      ["one", "two", "three"]
    Time taken: 0.021 seconds, Fetched: 4 row(s)
    hive> select split("192.168.199.151","\\.") from dual;
    OK
    ["192","168","199","151"]
    Time taken: 0.169 seconds, Fetched: 1 row(s)
    
    2.Hive查询语法:
    • 简单select语法:
    # 简单select语法
    hive> select * from emp where deptno=10;
    OK
    7782    CLARK   MANAGER 7839    1981-6-9        2450.0  NULL    10
    7839    KING    PRESIDENT       NULL    1981-11-17      5000.0  NULL    10
    7934    MILLER  CLERK   7782    1982-1-23       1300.0  NULL    10
    Time taken: 0.899 seconds, Fetched: 3 row(s)
    hive> select * from emp where empno <= 7800;
    OK
    7369    SMITH   CLERK   7902    1980-12-17      800.0   NULL    20
    7499    ALLEN   SALESMAN        7698    1981-2-20       1600.0  300.0   30
    7521    WARD    SALESMAN        7698    1981-2-22       1250.0  500.0   30
    7566    JONES   MANAGER 7839    1981-4-2        2975.0  NULL    20
    7654    MARTIN  SALESMAN        7698    1981-9-28       1250.0  1400.0  30
    7698    BLAKE   MANAGER 7839    1981-5-1        2850.0  NULL    30
    7782    CLARK   MANAGER 7839    1981-6-9        2450.0  NULL    10
    7788    SCOTT   ANALYST 7566    1987-4-19       3000.0  NULL    20
    Time taken: 0.277 seconds, Fetched: 8 row(s)
    hive> select * from emp where salary between 1000 and 1500;
    OK
    7521    WARD    SALESMAN        7698    1981-2-22       1250.0  500.0   30
    7654    MARTIN  SALESMAN        7698    1981-9-28       1250.0  1400.0  30
    7844    TURNER  SALESMAN        7698    1981-9-8        1500.0  0.0     30
    7876    ADAMS   CLERK   7788    1987-5-23       1100.0  NULL    20
    7934    MILLER  CLERK   7782    1982-1-23       1300.0  NULL    10
    Time taken: 0.187 seconds, Fetched: 5 row(s)
    hive> select * from emp limit 5;
    OK
    7369    SMITH   CLERK   7902    1980-12-17      800.0   NULL    20
    7499    ALLEN   SALESMAN        7698    1981-2-20       1600.0  300.0   30
    7521    WARD    SALESMAN        7698    1981-2-22       1250.0  500.0   30
    7566    JONES   MANAGER 7839    1981-4-2        2975.0  NULL    20
    7654    MARTIN  SALESMAN        7698    1981-9-28       1250.0  1400.0  30
    Time taken: 0.154 seconds, Fetched: 5 row(s)
    hive> select * from emp where empno in(7566,7499);
    OK
    7499    ALLEN   SALESMAN        7698    1981-2-20       1600.0  300.0   30
    7566    JONES   MANAGER 7839    1981-4-2        2975.0  NULL    20
    Time taken: 0.153 seconds, Fetched: 2 row(s)
    hive> select * from emp where comm is not null;
    OK
    7499    ALLEN   SALESMAN        7698    1981-2-20       1600.0  300.0   30
    7521    WARD    SALESMAN        7698    1981-2-22       1250.0  500.0   30
    7654    MARTIN  SALESMAN        7698    1981-9-28       1250.0  1400.0  30
    7844    TURNER  SALESMAN        7698    1981-9-8        1500.0  0.0     30
    Time taken: 0.291 seconds, Fetched: 4 row(s)
    
    • 聚合函数及分组函数:
    # 聚合函数及分组函数
    #  max/min/count/sum/avg 特点:多进一出,进来很多条记录出去只有一条记录
    
    # 查询部门编号为10的有多少条记录
    hive> select count(1) from emp where deptno=10;
    Query ID = hadoop_20180703142525_af460dc7-287b-41b2-8af3-ba27cc0ea6ce
    Total jobs = 1
    ...
    OK
    3
    Time taken: 38.951 seconds, Fetched: 1 row(s)
    # 求最大工资,最小工资,平均工资,工资的和
    hive> select max(salary),min(salary),avg(salary),sum(salary) from emp;
    Query ID = hadoop_20180703142525_af460dc7-287b-41b2-8af3-ba27cc0ea6ce
    Total jobs = 1
    ...
    OK
    5000.0  800.0   2073.214285714286       29025.0
    Time taken: 23.748 seconds, Fetched: 1 row(s)
    
    # 分组函数 group by
    # 求部门的平均工资
    # 注:select中出现的字段,如果没有出现在组函数/聚合函数中,必须出现在group by里面,否则就会产生报错
    hive> select deptno,avg(salary) from emp group by deptno;
    Query ID = hadoop_20180703142525_af460dc7-287b-41b2-8af3-ba27cc0ea6ce
    Total jobs = 1
    ...
    OK
    10      2916.6666666666665
    20      2175.0
    30      1566.6666666666667
    Time taken: 36.502 seconds, Fetched: 3 row(s)
    # 求每个部门(deptno)、工作岗位(job)的最高工资(salary)
    hive> select deptno,job,max(salary) from emp group by deptno,job;
    Query ID = hadoop_20180703142525_af460dc7-287b-41b2-8af3-ba27cc0ea6ce
    Total jobs = 1
    ...
    OK
    10      CLERK   1300.0
    10      MANAGER 2450.0
    10      PRESIDENT       5000.0
    20      ANALYST 3000.0
    20      CLERK   1100.0
    20      MANAGER 2975.0
    30      CLERK   950.0
    30      MANAGER 2850.0
    30      SALESMAN        1600.0
    Time taken: 36.096 seconds, Fetched: 9 row(s)
    # 查询平均工资大于2000的部门(使用having子句限定分组查询)
    hive> select deptno,avg(salary) from emp group by deptno having avg(salary) >2000;
    Query ID = hadoop_20180703142525_af460dc7-287b-41b2-8af3-ba27cc0ea6ce
    Total jobs = 1
    ...
    OK
    10      2916.6666666666665
    20      2175.0
    Time taken: 24.71 seconds, Fetched: 2 row(s)
    
    # case when then end(不会跑mr)
    hive> select ename, salary, 
        > case
        > when salary > 1 and salary <= 1000 then 'LOWER'
        > when salary > 1000 and salary <= 2000 then 'MIDDLE'
        > when salary > 2000 and salary <= 4000 then 'HIGH'
        > ELSE 'HIGHEST'
        > end
        > from emp;
    OK
    SMITH   800.0   LOWER
    ALLEN   1600.0  MIDDLE
    WARD    1250.0  MIDDLE
    JONES   2975.0  HIGH
    MARTIN  1250.0  MIDDLE
    BLAKE   2850.0  HIGH
    CLARK   2450.0  HIGH
    SCOTT   3000.0  HIGH
    KING    5000.0  HIGHEST
    TURNER  1500.0  MIDDLE
    ADAMS   1100.0  MIDDLE
    JAMES   950.0   LOWER
    FORD    3000.0  HIGH
    MILLER  1300.0  MIDDLE
    Time taken: 0.096 seconds, Fetched: 14 row(s)
    
    • 多表join查询:
    # 创建测试表
    hive> create table a(
        > id int, name string
        > ) row format delimited fields terminated by '\t';
    OK
    Time taken: 0.311 seconds
    hive> create table b(
        > id int, age int
        > ) row format delimited fields terminated by '\t';
    OK
    Time taken: 0.142 seconds
    # insert或load数据 最后表数据如下
    hive> select * from a;
    OK
    1       zhangsan
    2       lisi
    3       wangwu
    hive> select * from b;
    OK
    1       20
    2       30
    4       40
    Time taken: 0.2 seconds, Fetched: 3 row(s)
    
    # 内连接 inner join = join 仅列出表1和表2符合连接条件的数据
    hive> select a.id,a.name,b.age from a join b on a.id=b.id;
    Query ID = hadoop_20180703142525_af460dc7-287b-41b2-8af3-ba27cc0ea6ce
    Total jobs = 1
    ...
    OK
    1       zhangsan        20
    2       lisi    30
    Time taken: 24.415 seconds, Fetched: 2 row(s)
    # 左外连接(left join) 以左边的为基准,左边的数据全部数据全部出现,如果没有出现就赋null值
    hive> select a.id,a.name,b.age from a left join b on a.id=b.id;
    Query ID = hadoop_20180703142525_af460dc7-287b-41b2-8af3-ba27cc0ea6ce
    Total jobs = 1
    ...
    OK
    1       zhangsan        20
    2       lisi    30
    3       wangwu  NULL
    Time taken: 26.218 seconds, Fetched: 3 row(s)
    # 右外连接(right join) 以右表为基准
    hive> select a.id,a.name,b.age from a right join b on a.id=b.id;
    Query ID = hadoop_20180703142525_af460dc7-287b-41b2-8af3-ba27cc0ea6ce
    Total jobs = 1
    ...
    OK
    1       zhangsan        20
    2       lisi    30
    NULL    NULL    40
    Time taken: 24.027 seconds, Fetched: 3 row(s)
    # 全连接(full join)相当于表1和表2的数据都显示,如果没有对应的数据,则显示Null.
    hive> select a.id,a.name,b.age from a full join b on a.id=b.id;
    Query ID = hadoop_20180703142525_af460dc7-287b-41b2-8af3-ba27cc0ea6ce
    Total jobs = 1
    ...
    OK
    1       zhangsan        20
    2       lisi    30
    3       wangwu  NULL
    NULL    NULL    40
    Time taken: 32.94 seconds, Fetched: 4 row(s)
    # 笛卡尔积(cross join) 没有连接条件 会针对表1和表2的每条数据做连接
    hive> select a.id,a.name,b.age from a cross join b;
    Warning: Map Join MAPJOIN[7][bigTable=a] in task 'Stage-3:MAPRED' is a cross product
    Query ID = hadoop_20180703142525_af460dc7-287b-41b2-8af3-ba27cc0ea6ce
    Total jobs = 1
    ...
    OK
    1       zhangsan        20
    1       zhangsan        30
    1       zhangsan        40
    2       lisi    20
    2       lisi    30
    2       lisi    40
    3       wangwu  20
    3       wangwu  30
    3       wangwu  40
    Time taken: 29.825 seconds, Fetched: 9 row(s)
    
    3.利用Hive sql实现wordcount:
    # 创建表 加载测试数据
    hive> create table hive_wc(sentence string);
    OK
    Time taken: 0.149 seconds
    
    [hadoop@hadoop000 ~]$ cat hive-wc.txt
    hello,world,welcome
    hello,welcome
    
    hive> load data local inpath '/home/hadoop/hive-wc.txt' into table hive_wc;
    Loading data to table default.hive_wc
    Table default.hive_wc stats: [numFiles=1, totalSize=34]
    OK
    Time taken: 0.729 seconds
    hive> select * from hive_wc;
    OK
    hello,world,welcome
    hello,welcome
    Time taken: 0.13 seconds, Fetched: 2 row(s)
    
    # 获取每个单词 利用split分割
    hive> select split(sentence,",") from hive_wc;
    OK
    ["hello","world","welcome"]
    ["hello","welcome"]
    Time taken: 0.163 seconds, Fetched: 2 row(s)
    # explode把数组转成多行 结合split使用如下
    hive> select explode(split(sentence,",")) from hive_wc;
    OK
    hello
    world
    welcome
    hello
    welcome
    Time taken: 0.068 seconds, Fetched: 5 row(s)
    # 做group by操作 一条语句即可实现wordcount统计
    hive> select word, count(1) as c 
        > from (select explode(split(sentence,",")) as word from hive_wc) t
        > group by word ;
    Query ID = hadoop_20180703142525_af460dc7-287b-41b2-8af3-ba27cc0ea6ce
    Total jobs = 1
    ...
    OK
    hello   2
    welcome 2
    world   1
    Time taken: 34.168 seconds, Fetched: 3 row(s)
    

    相关文章

      网友评论

          本文标题:Hive内部函数简介及查询语法

          本文链接:https://www.haomeiwen.com/subject/emhluftx.html