1、介绍
Hive 自定义函数类别分为以下三种:
(1)UDF(User-Defined-Function) 一进一出
(2)UDAF(User-Defined Aggregation Function) 聚集函数,多进一出 类似于:count/max/min
(3)UDTF(User-Defined Table-Generating Functions) 一进多出
udf的编写主要是继承UDF类,新增evaluate方法。evaluate方法的参数是sql中调用的参数,返回值是查询需要的返回值
2、代码展示
pom.xml
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-exec</artifactId>
<version>2.1.1</version>
<scope>provided</scope> #因为在正式环境不能有这个包,本地调试需要,所以在打包时要添加这个标签,在调试时去掉这个标签
<exclusions>
<exclusion>
<groupId>jackson.databind</groupId>
<artifactId>com.fasterxml.jackson.databind</artifactId>
</exclusion>
<exclusion>
<artifactId>log4j-core</artifactId>
<groupId>org.apache.logging.log4j</groupId>
</exclusion>
<exclusion>
<artifactId>calcite-avatica</artifactId>
<groupId>org.apache.calcite</groupId>
</exclusion>
<exclusion>
<artifactId>calcite-core</artifactId>
<groupId>org.apache.calcite</groupId>
</exclusion>
<exclusion>
<groupId>jackson.annotations</groupId>
<artifactId>com.fasterxml.jackson.annotations</artifactId>
</exclusion>
<exclusion>
<groupId>jackson.core</groupId>
<artifactId>com.fasterxml.jackson.core</artifactId>
</exclusion>
<exclusion>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-databind</artifactId>
</exclusion>
<exclusion>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-core</artifactId>
</exclusion>
<exclusion>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-1.2-api</artifactId>
</exclusion>
<exclusion>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-slf4j-impl</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>log4j</groupId>
<artifactId>log4j</artifactId>
<version>1.2.17</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
<version>1.7.25</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
<version>1.7.25</version>
</dependency>
log4j.properties
log4j.rootLogger=INFO, stdout
log4j.appender.stdout=org.apache.log4j.ConsoleAppender
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=%d %p [%c] - %m%n
TestUdf.java
package com.tianzehao;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
public class TestUdf extends UDF {
private static Logger Log = LoggerFactory.getLogger(TestUdf.class);
public boolean evaluate(String a){
Log.info("tianzehao dqc test");
return true;
}
}
3、测试
- 上传hdfs
- add jar hdfs:///tmp/hive/hiveUdf_Test-1.0-SNAPSHOT.jar
- create temporary function test as "com.tianzehao.TestUdf";
- select test('a') ;
throubleShooting
现象
根据日志输出发现存在udf执行多次的问题,是因为udf中没有使用到任何列的信息,全部是常量导致。
image.png
解决
在编辑udf时接受一个列参数可以不做任务处理继续常量输出即可。比如上述的udf
select test(id) from test ;
网友评论