美文网首页
Hive自定义函数

Hive自定义函数

作者: 小月半会飞 | 来源:发表于2019-01-07 18:17 被阅读0次

    一、hive为我们提供的函数

    hive给我们提供了一些内置函数,比如截取字符串,大小写转换
    此处距离substr

    1、首先模仿sql建立一个伪表dual

    create table dual(id string);
    

    2、准备数据

    在本地创建一个文档,dual.txt,内容为一个空格或者空行

    3、加载数据到表格

    load data local inpath '/root/dual.txt' into table dual;
    

    4、进行测试

    0: jdbc:hive2://localhost:10000> select substr('sichuan',1,3) from dual;
    

    5、也可以不建表直接使用:

    select substr('sichuan',1,3);
    

    二、自定义函数

    1、添加maven依赖

    <dependency>
          <groupId>org.apache.hive</groupId>
          <artifactId>hive-exec</artifactId>
          <version>1.2.1</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/org.apache.hive/hive-metastore -->
        <dependency>
          <groupId>org.apache.hive</groupId>
          <artifactId>hive-metastore</artifactId>
          <version>1.2.1</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/org.apache.hive/hive-common -->
        <dependency>
          <groupId>org.apache.hive</groupId>
          <artifactId>hive-common</artifactId>
          <version>1.2.1</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/org.apache.hive/hive-service -->
        <dependency>
          <groupId>org.apache.hive</groupId>
          <artifactId>hive-service</artifactId>
          <version>1.2.1</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/org.apache.hive/hive-jdbc -->
        <dependency>
          <groupId>org.apache.hive</groupId>
          <artifactId>hive-jdbc</artifactId>
          <version>1.2.1</version>
        </dependency>
    

    2、大写转小写

    1)、创建UpperToLowerCase类,继承UDF,重写evaluate方法
    public class UpperToLowerCase extends UDF {
        /*
         * 重载evaluate
         * 访问限制必须是public
         */
        public String evaluate(String word) {
            String lowerWord = word.toLowerCase();
            return lowerWord;
        }
    }
    
    2)、打包上传到hadoop集群上

    此处我的项目名称是hive

    3)、将jar包放到hive的classpath下
    0: jdbc:hive2://localhost:10000> add jar /root/hive.jar;
    
    4)、创建临时函数,指定完整类名

    完整类名:包名加类名

    0: jdbc:hive2://localhost:10000> create temporary function tolower as 'com.neusoft.hive.UpperToLowerCase';
    
    5)、使用临时函数
    select tolower('HELLO');
    

    3、根据电话号码显示归属地信息

    1)、创建工具类
    public class PhoneNumParse extends UDF{
    
        static HashMap<String, String> phoneMap = new HashMap<String, String>();
    
        static{
            phoneMap.put("136", "beijing");
            phoneMap.put("137", "shanghai");
            phoneMap.put("138", "shenzhen");
        }
    
        public static String evaluate(int phoneNum) {
    
            String num = String.valueOf(phoneNum);
            String province = phoneMap.get(num.substring(0, 3));
            return province==null?"foreign":province;
        }
        //测试
        public static void main(String[] args) {
            String string = evaluate(136666);
            System.out.println(string);
        }
    }
    
    2)、重新打jar包,然后上传到hadoop集群上

    此处我的项目名称是hive

    3)、将jar包放到hive的classpath下
    0: jdbc:hive2://localhost:10000> add jar /root/hive.jar;
    
    4)、创建临时函数,指定完整类名

    完整类名:包名加类名

    create temporary function getprovince as 'com.scu.hive.PhoneNumParse';
    
    5)、创建本地数据

    创建flow.txt将以下数据

    [root@mini1 ~]# vi flow.txt;
    1367788,1
    1367788,10
    1377788,80
    1377788,97
    1387788,98
    1387788,99
    1387788,100
    1555118,99
    
    6)、创建表,然后加载数据
    0: jdbc:hive2://localhost:10000> create table flow(phonenum int,flow int)
    0: jdbc:hive2://localhost:10000> row format delimited fields terminated by ',';
    No rows affected (0.143 seconds)
    0: jdbc:hive2://localhost:10000> load data local inpath '/root/flow.txt' into table flow;
    
    7)、查询结果
    0: jdbc:hive2://localhost:10000> select phonenum,getprovince(phonenum),flow from flow;
    +-----------+-----------+-------+--+
    | phonenum  |    _c1    | flow  |
    +-----------+-----------+-------+--+
    | 1367788   | beijing   | 1     |
    | 1367788   | beijing   | 10    |
    | 1377788   | shanghai  | 80    |
    | 1377788   | shanghai  | 97    |
    | 1387788   | shenzhen  | 98    |
    | 1387788   | shenzhen  | 99    |
    | 1387788   | shenzhen  | 100   |
    | 1555118   | foreign   | 99    |
    +-----------+-----------+-------+--+
    

    3、Json数据解析UDF开发

    1)、创建数据源文本

    内容如下,记住不要有空行以及多余的空格,不然会出现解析错误

    {"movie":"1193","rate":"5","timeStamp":"978300760","uid":"1"}
    {"movie":"661","rate":"3","timeStamp":"978302109","uid":"1"}
    {"movie":"914","rate":"3","timeStamp":"978301968","uid":"1"}
    {"movie":"3408","rate":"4","timeStamp":"978300275","uid":"1"}
    {"movie":"2355","rate":"5","timeStamp":"978824291","uid":"1"}
    {"movie":"1197","rate":"3","timeStamp":"978302268","uid":"1"}
    
    2)、创建表,然后上传数据

    创建表:

    create table json(line string);
    

    上传数据:

    load data local inpath ‘/root/json.txt’ into table json;
    
    2)、与json数据对应的javabean
    public class MovieRateBean {
        private String movie;
        private String rate;//评分
        private String timeStamp;
        private String uid;
        @Override
        public String toString() {
            return  this.movie+"\t"+this.rate+"\t"+this.timeStamp+"\t"+this.uid;
        }
      //  get、set方法
    }
    
    3)、java工具类
    package com.neusoft.hive;
    
    import com.alibaba.fastjson.JSON;
    import com.alibaba.fastjson.TypeReference;
    import org.apache.hadoop.hive.ql.exec.UDF;
    
    public class JsonParse extends UDF {
        public String evaluate(String jsonStr){
            MovieRateBean movieRateBean=JSON.parseObject(jsonStr,new TypeReference<MovieRateBean>(){});
            returnmovieRateBean.toString();
        }
    }
    
    4)、打jar包然后上传到hadoop集群
    5)、将jar包添加到hive下的classpath
    0: jdbc:hive2://localhost:10000> add jar /root/hive.jar;
    
    5)、将fastjson的jar包添加到hive下的classpath
    0: jdbc:hive2://localhost:10000> add jar /root/fastjson-1.1.41.jar;
    
    6)、创建临时函数:

    指定类名一定要完整的路径,即包名加类名

    0: jdbc:hive2://localhost:10000> create temporary function parsejson as 'com.sneusoft.hive.JsonParse';
    
    7)、执行查询
    0: jdbc:hive2://localhost:10000> select parsejson(line) from json limit 10;
    +---------------------+--+
    |         _c0         |
    +---------------------+--+
    | 1193  5       978300760       1  |
    | 661   3       978302109       1   |
    | 914   3       978301968       1   |
    | 3408  4       978300275       1  |
    | 2355  5       978824291       1  |
    | 1197  3       978302268       1  |
    | 1287  5       978302039       1  |
    | 2804  5       978300719       1  |
    | 594   4       978302268       1   |
    | 919   4       978301368       1   |
    
    8)、显示字段名

    从上面的结果可以看出来,数据虽然分开了,但是没有字段名,现在我们通过建表来实现显示字段名

    0: jdbc:hive2://localhost:10000> create table movie as
    0: jdbc:hive2://localhost:10000> select split(parsejson(line),'\t')[0]as movieid,
    0: jdbc:hive2://localhost:10000> split(parsejson(line),'\t')[1] as rate,
    0: jdbc:hive2://localhost:10000> split(parsejson(line),'\t')[2] as timestring,
    0: jdbc:hive2://localhost:10000> split(parsejson(line),'\t')[3] as uid 
    0: jdbc:hive2://localhost:10000> from json;
    

    再次执行查询,查看结果:

    0: jdbc:hive2://localhost:10000> select * from movie;
    +-------------------+----------------+----------------------+---------------+--+
    | t_rating.movieid  | t_rating.rate  | t_rating.timestring  | t_rating.uid  |
    +-------------------+----------------+----------------------+---------------+--+
    | 919               | 4              | 978301368            | 1             |
    | 594               | 4              | 978302268            | 1             |
    | 2804              | 5              | 978300719            | 1             |
    | 1287              | 5              | 978302039            | 1             |
    | 1197              | 3              | 978302268            | 1             |
    | 2355              | 5              | 978824291            | 1             |
    | 3408              | 4              | 978300275            | 1             |
    | 914               | 3              | 978301968            | 1             |
    | 661               | 3              | 978302109            | 1             |
    | 1193              | 5              | 978300760            | 1             |
    +-------------------+----------------+----------------------+---------------+--+
    

    三、transform关键字的使用

    将某一个字段时间戳要改为输出周几,可以不用实现UDF
    我们直接使用跟上面创建好的表,将第三个字段改为时间

    1、编写python脚本

    在本地创建一个Python脚本,名字叫做trans.py

    vi trans.py
    

    以下是脚本代码

    import sys
    import datetime
    
    for line in sys.stdin:
      line = line.strip()
      movieid, rating, unixtime,userid = line.split('\t')
      weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday()
      print '\t'.join([movieid, rating, str(weekday),userid])
    

    2、使用该脚本

    1)、添加该文本到hive
    0: jdbc:hive2://localhost:10000> add FILE /root/trans.py;
    
    2)、使用该脚本
    0: jdbc:hive2://localhost:10000> select TRANSFORM(movieid,rate,time,userid) USING 'python trans.py' as (mov,rat,tim,uid) from movie;
    

    输出结果:

    INFO  : Number of reduce tasks is set to 0 since there's no reduce operator
    INFO  : number of splits:1
    INFO  : Submitting tokens for job: job_1546821616463_0002
    INFO  : The url to track the job: http://hadoop4:8088/proxy/application_1546821616463_0002/
    INFO  : Starting Job = job_1546821616463_0002, Tracking URL = http://hadoop4:8088/proxy/application_1546821616463_0002/
    INFO  : Kill Command = /opt/modules/app/hadoop/bin/hadoop job  -kill job_1546821616463_0002
    INFO  : Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
    INFO  : 2019-01-05 00:35:37,847 Stage-1 map = 0%,  reduce = 0%
    INFO  : 2019-01-05 00:35:56,316 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.55 sec
    INFO  : MapReduce Total cumulative CPU time: 1 seconds 550 msec
    INFO  : Ended Job = job_1546821616463_0002
    +-------+------+------+------+--+
    |  mov  | rat  | tim  | uid  |
    +-------+------+------+------+--+
    | 1197  | 3    | 1    | 1    |
    | 2355  | 5    | 7    | 1    |
    | 3408  | 4    | 1    | 1    |
    | 914   | 3    | 1    | 1    |
    | 661   | 3    | 1    | 1    |
    | 1193  | 5    | 1    | 1    |
    +-------+------+------+------+--+
    6 rows selected (36.201 seconds)
    
    

    相关文章

      网友评论

          本文标题:Hive自定义函数

          本文链接:https://www.haomeiwen.com/subject/fdaurqtx.html