Hive自定义函数

作者: 小月半会飞 | 来源:发表于2019-01-07 18:17 被阅读0次

一、hive为我们提供的函数

hive给我们提供了一些内置函数，比如截取字符串，大小写转换
此处距离substr

1、首先模仿sql建立一个伪表dual

create table dual(id string);

2、准备数据

在本地创建一个文档，dual.txt，内容为一个空格或者空行

3、加载数据到表格

load data local inpath '/root/dual.txt' into table dual;

4、进行测试

0: jdbc:hive2://localhost:10000> select substr('sichuan',1,3) from dual;

5、也可以不建表直接使用：

select substr('sichuan',1,3);

二、自定义函数

1、添加maven依赖

<dependency>
      <groupId>org.apache.hive</groupId>
      <artifactId>hive-exec</artifactId>
      <version>1.2.1</version>
    </dependency>
    <!-- https://mvnrepository.com/artifact/org.apache.hive/hive-metastore -->
    <dependency>
      <groupId>org.apache.hive</groupId>
      <artifactId>hive-metastore</artifactId>
      <version>1.2.1</version>
    </dependency>
    <!-- https://mvnrepository.com/artifact/org.apache.hive/hive-common -->
    <dependency>
      <groupId>org.apache.hive</groupId>
      <artifactId>hive-common</artifactId>
      <version>1.2.1</version>
    </dependency>
    <!-- https://mvnrepository.com/artifact/org.apache.hive/hive-service -->
    <dependency>
      <groupId>org.apache.hive</groupId>
      <artifactId>hive-service</artifactId>
      <version>1.2.1</version>
    </dependency>
    <!-- https://mvnrepository.com/artifact/org.apache.hive/hive-jdbc -->
    <dependency>
      <groupId>org.apache.hive</groupId>
      <artifactId>hive-jdbc</artifactId>
      <version>1.2.1</version>
    </dependency>

2、大写转小写

1）、创建UpperToLowerCase类，继承UDF，重写evaluate方法

public class UpperToLowerCase extends UDF {
    /*
     * 重载evaluate
     * 访问限制必须是public
     */
    public String evaluate(String word) {
        String lowerWord = word.toLowerCase();
        return lowerWord;
    }
}

2)、打包上传到hadoop集群上

此处我的项目名称是hive

3)、将jar包放到hive的classpath下

0: jdbc:hive2://localhost:10000> add jar /root/hive.jar;

4)、创建临时函数，指定完整类名

完整类名：包名加类名

0: jdbc:hive2://localhost:10000> create temporary function tolower as 'com.neusoft.hive.UpperToLowerCase';

5)、使用临时函数

select tolower('HELLO');

3、根据电话号码显示归属地信息

1）、创建工具类

public class PhoneNumParse extends UDF{

    static HashMap<String, String> phoneMap = new HashMap<String, String>();

    static{
        phoneMap.put("136", "beijing");
        phoneMap.put("137", "shanghai");
        phoneMap.put("138", "shenzhen");
    }

    public static String evaluate(int phoneNum) {

        String num = String.valueOf(phoneNum);
        String province = phoneMap.get(num.substring(0, 3));
        return province==null?"foreign":province;
    }
    //测试
    public static void main(String[] args) {
        String string = evaluate(136666);
        System.out.println(string);
    }
}

2)、重新打jar包，然后上传到hadoop集群上

此处我的项目名称是hive

3)、将jar包放到hive的classpath下

0: jdbc:hive2://localhost:10000> add jar /root/hive.jar;

4)、创建临时函数，指定完整类名

完整类名：包名加类名

create temporary function getprovince as 'com.scu.hive.PhoneNumParse';

5)、创建本地数据

创建flow.txt将以下数据

[root@mini1 ~]# vi flow.txt;
1367788,1
1367788,10
1377788,80
1377788,97
1387788,98
1387788,99
1387788,100
1555118,99

6)、创建表，然后加载数据

0: jdbc:hive2://localhost:10000> create table flow(phonenum int,flow int)
0: jdbc:hive2://localhost:10000> row format delimited fields terminated by ',';
No rows affected (0.143 seconds)
0: jdbc:hive2://localhost:10000> load data local inpath '/root/flow.txt' into table flow;

7)、查询结果

0: jdbc:hive2://localhost:10000> select phonenum,getprovince(phonenum),flow from flow;
+-----------+-----------+-------+--+
| phonenum  |    _c1    | flow  |
+-----------+-----------+-------+--+
| 1367788   | beijing   | 1     |
| 1367788   | beijing   | 10    |
| 1377788   | shanghai  | 80    |
| 1377788   | shanghai  | 97    |
| 1387788   | shenzhen  | 98    |
| 1387788   | shenzhen  | 99    |
| 1387788   | shenzhen  | 100   |
| 1555118   | foreign   | 99    |
+-----------+-----------+-------+--+

3、Json数据解析UDF开发

1）、创建数据源文本

内容如下，记住不要有空行以及多余的空格，不然会出现解析错误

{"movie":"1193","rate":"5","timeStamp":"978300760","uid":"1"}
{"movie":"661","rate":"3","timeStamp":"978302109","uid":"1"}
{"movie":"914","rate":"3","timeStamp":"978301968","uid":"1"}
{"movie":"3408","rate":"4","timeStamp":"978300275","uid":"1"}
{"movie":"2355","rate":"5","timeStamp":"978824291","uid":"1"}
{"movie":"1197","rate":"3","timeStamp":"978302268","uid":"1"}

2）、创建表，然后上传数据

创建表：

create table json(line string);

上传数据：

load data local inpath ‘/root/json.txt’ into table json;

2）、与json数据对应的javabean

public class MovieRateBean {
    private String movie;
    private String rate;//评分
    private String timeStamp;
    private String uid;
    @Override
    public String toString() {
        return  this.movie+"\t"+this.rate+"\t"+this.timeStamp+"\t"+this.uid;
    }
  //  get、set方法
}

3）、java工具类

package com.neusoft.hive;

import com.alibaba.fastjson.JSON;
import com.alibaba.fastjson.TypeReference;
import org.apache.hadoop.hive.ql.exec.UDF;

public class JsonParse extends UDF {
    public String evaluate(String jsonStr){
        MovieRateBean movieRateBean=JSON.parseObject(jsonStr,new TypeReference<MovieRateBean>(){});
        returnmovieRateBean.toString();
    }
}

4)、打jar包然后上传到hadoop集群

5)、将jar包添加到hive下的classpath

0: jdbc:hive2://localhost:10000> add jar /root/hive.jar;

5)、将fastjson的jar包添加到hive下的classpath

0: jdbc:hive2://localhost:10000> add jar /root/fastjson-1.1.41.jar;

6)、创建临时函数：

指定类名一定要完整的路径，即包名加类名

0: jdbc:hive2://localhost:10000> create temporary function parsejson as 'com.sneusoft.hive.JsonParse';

7）、执行查询

0: jdbc:hive2://localhost:10000> select parsejson(line) from json limit 10;
+---------------------+--+
|         _c0         |
+---------------------+--+
| 1193  5       978300760       1  |
| 661   3       978302109       1   |
| 914   3       978301968       1   |
| 3408  4       978300275       1  |
| 2355  5       978824291       1  |
| 1197  3       978302268       1  |
| 1287  5       978302039       1  |
| 2804  5       978300719       1  |
| 594   4       978302268       1   |
| 919   4       978301368       1   |

8）、显示字段名

从上面的结果可以看出来，数据虽然分开了，但是没有字段名，现在我们通过建表来实现显示字段名

0: jdbc:hive2://localhost:10000> create table movie as
0: jdbc:hive2://localhost:10000> select split(parsejson(line),'\t')[0]as movieid,
0: jdbc:hive2://localhost:10000> split(parsejson(line),'\t')[1] as rate,
0: jdbc:hive2://localhost:10000> split(parsejson(line),'\t')[2] as timestring,
0: jdbc:hive2://localhost:10000> split(parsejson(line),'\t')[3] as uid 
0: jdbc:hive2://localhost:10000> from json;

再次执行查询，查看结果：

0: jdbc:hive2://localhost:10000> select * from movie;
+-------------------+----------------+----------------------+---------------+--+
| t_rating.movieid  | t_rating.rate  | t_rating.timestring  | t_rating.uid  |
+-------------------+----------------+----------------------+---------------+--+
| 919               | 4              | 978301368            | 1             |
| 594               | 4              | 978302268            | 1             |
| 2804              | 5              | 978300719            | 1             |
| 1287              | 5              | 978302039            | 1             |
| 1197              | 3              | 978302268            | 1             |
| 2355              | 5              | 978824291            | 1             |
| 3408              | 4              | 978300275            | 1             |
| 914               | 3              | 978301968            | 1             |
| 661               | 3              | 978302109            | 1             |
| 1193              | 5              | 978300760            | 1             |
+-------------------+----------------+----------------------+---------------+--+

三、transform关键字的使用

将某一个字段时间戳要改为输出周几，可以不用实现UDF
我们直接使用跟上面创建好的表，将第三个字段改为时间

1、编写python脚本

在本地创建一个Python脚本，名字叫做trans.py

vi trans.py

以下是脚本代码

import sys
import datetime

for line in sys.stdin:
  line = line.strip()
  movieid, rating, unixtime,userid = line.split('\t')
  weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday()
  print '\t'.join([movieid, rating, str(weekday),userid])

2、使用该脚本

1）、添加该文本到hive

0: jdbc:hive2://localhost:10000> add FILE /root/trans.py;

2）、使用该脚本

0: jdbc:hive2://localhost:10000> select TRANSFORM(movieid,rate,time,userid) USING 'python trans.py' as (mov,rat,tim,uid) from movie;

输出结果：

INFO  : Number of reduce tasks is set to 0 since there's no reduce operator
INFO  : number of splits:1
INFO  : Submitting tokens for job: job_1546821616463_0002
INFO  : The url to track the job: http://hadoop4:8088/proxy/application_1546821616463_0002/
INFO  : Starting Job = job_1546821616463_0002, Tracking URL = http://hadoop4:8088/proxy/application_1546821616463_0002/
INFO  : Kill Command = /opt/modules/app/hadoop/bin/hadoop job  -kill job_1546821616463_0002
INFO  : Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
INFO  : 2019-01-05 00:35:37,847 Stage-1 map = 0%,  reduce = 0%
INFO  : 2019-01-05 00:35:56,316 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.55 sec
INFO  : MapReduce Total cumulative CPU time: 1 seconds 550 msec
INFO  : Ended Job = job_1546821616463_0002
+-------+------+------+------+--+
|  mov  | rat  | tim  | uid  |
+-------+------+------+------+--+
| 1197  | 3    | 1    | 1    |
| 2355  | 5    | 7    | 1    |
| 3408  | 4    | 1    | 1    |
| 914   | 3    | 1    | 1    |
| 661   | 3    | 1    | 1    |
| 1193  | 5    | 1    | 1    |
+-------+------+------+------+--+
6 rows selected (36.201 seconds)

Hive自定义函数

一、hive为我们提供的函数

1、首先模仿sql建立一个伪表dual

2、准备数据

3、加载数据到表格

4、进行测试

5、也可以不建表直接使用：

二、自定义函数

1、添加maven依赖

2、大写转小写

1）、创建UpperToLowerCase类，继承UDF，重写evaluate方法

2)、打包上传到hadoop集群上

3)、将jar包放到hive的classpath下

4)、创建临时函数，指定完整类名

5)、使用临时函数

3、根据电话号码显示归属地信息

1）、创建工具类

2)、重新打jar包，然后上传到hadoop集群上

3)、将jar包放到hive的classpath下

4)、创建临时函数，指定完整类名

5)、创建本地数据

6)、创建表，然后加载数据

7)、查询结果

3、Json数据解析UDF开发

1）、创建数据源文本

2）、创建表，然后上传数据

2）、与json数据对应的javabean

3）、java工具类

4)、打jar包然后上传到hadoop集群

5)、将jar包添加到hive下的classpath

5)、将fastjson的jar包添加到hive下的classpath

6)、创建临时函数：

7）、执行查询

8）、显示字段名

三、transform关键字的使用

1、编写python脚本

2、使用该脚本

1）、添加该文本到hive

2）、使用该脚本

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读