0003-如何在CDH中使用LZO压缩

作者: Hadoop实操 | 来源:发表于2018-12-08 22:20 被阅读7次

0003-如何在CDH中使用LZO压缩
hadoop,spark中使用lzo
Go gzip,lzo,snappy的简单使用
Hive读取索引文件问题
HadoopLZO压缩配置
CDH6安装Lzo
LZO 创建索引
hadoop之配置LZO压缩以及创建索引支持分片
压缩
HBase优化

Fayson的github： https://github.com/fayson/cdhproject
推荐关注微信公众号：“Hadoop实操”，ID：gh_c4c535955d0f，或者扫描文末二维码。

1.问题描述

CDH中默认不支持Lzo压缩编码，需要下载额外的Parcel包，才能让Hadoop相关组件如HDFS，Hive，Spark支持Lzo编码。

具体请参考：
Configuring Services to Use the GPL Extras Parcel
Installing the GPL Extras Parcel

首先我在没做额外配置的情况下，生成Lzo文件并读取。我们在Hive中创建两张表，test_table和test_table2，test_table是文本文件的表，test_table2是Lzo压缩编码的表。如下：

create external table test_table
(
s1 string,
s2 string
)
row format delimited fields terminated by '#'
location '/lilei/test_table';

insert into test_table values('1','a'),('2','b');

create external table test_table2
(
s1 string,
s2 string
)
row format delimited fields terminated by '#'
location '/lilei/test_table2';

通过beeline访问Hive并执行上面命令：

查询test_table中的数据：

将test_table中的数据插入到test_table2，并设置输出文件为lzo压缩：

set mapreduce.output.fileoutputformat.compress.codec=com.hadoop.compression.lzo.LzoCodec;
set hive.exec.compress.output=true;
set mapreduce.output.fileoutputformat.compress=true;
set mapreduce.output.fileoutputformat.compress.type=BLOCK;

insert overwrite table test_table2 select * from test_table;

在Hive中执行报错如下：

Error:Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask (state=08S01,code=2)

通过Yarn的8088可以发现是因为找不到Lzo压缩编码：

Compression codec com.hadoop.compression.lzo.LzoCodec was not found.

2.解决办法

通过Cloudera Manager的Parcel页面配置Lzo的Parcel包地址：

注意：如果集群无法访问公网，需要提前下载好Parcel包并发布到httpd

下载->分配->激活

配置HDFS的压缩编码加入Lzo：

com.hadoop.compression.lzo.LzoCodec
com.hadoop.compression.lzo.LzopCodec

保存更改，部署客户端配置，重启整个集群。

等待重启成功：

再次插入数据到test_table2，设置为Lzo编码格式：

set mapreduce.output.fileoutputformat.compress.codec=com.hadoop.compression.lzo.LzoCodec;
set hive.exec.compress.output=true;
set mapreduce.output.fileoutputformat.compress=true;
set mapreduce.output.fileoutputformat.compress.type=BLOCK;

insert overwrite table test_table2 select * from test_table;

插入成功：

2.1.Hive验证

首先确认test_table2中的文件为Lzo格式：

在Hive的beeline中进行测试：

Hive基于Lzo压缩文件运行正常。

2.2.Spark SQL验证

var textFile=sc.textFile("hdfs://ip-172-31-8-141:8020/lilei/test_table2/000000_0.lzo_deflate")

textFile.count()

sqlContext.sql("select * from test_table2")