Fayson的github: https://github.com/fayson/cdhproject
推荐关注微信公众号:“Hadoop实操”,ID:gh_c4c535955d0f,或者扫描文末二维码。
1.问题描述
CDH中默认不支持Lzo压缩编码,需要下载额外的Parcel包,才能让Hadoop相关组件如HDFS,Hive,Spark支持Lzo编码。
具体请参考:
Configuring Services to Use the GPL Extras Parcel
Installing the GPL Extras Parcel
首先我在没做额外配置的情况下,生成Lzo文件并读取。我们在Hive中创建两张表,test_table和test_table2,test_table是文本文件的表,test_table2是Lzo压缩编码的表。如下:
create external table test_table
(
s1 string,
s2 string
)
row format delimited fields terminated by '#'
location '/lilei/test_table';
insert into test_table values('1','a'),('2','b');
create external table test_table2
(
s1 string,
s2 string
)
row format delimited fields terminated by '#'
location '/lilei/test_table2';
通过beeline访问Hive并执行上面命令:
data:image/s3,"s3://crabby-images/83817/838170a6a22348f3d0bee5bc9c578f87f372f53a" alt=""
data:image/s3,"s3://crabby-images/a6eba/a6eba70de5cb571002e769215763c63aea2767db" alt=""
data:image/s3,"s3://crabby-images/72c2f/72c2f3426aebf5b45deb3546b2d40e06f4e77601" alt=""
查询test_table中的数据:
data:image/s3,"s3://crabby-images/b0bc0/b0bc09b1f1626c274aa2ba40e14fffec2619cf97" alt=""
将test_table中的数据插入到test_table2,并设置输出文件为lzo压缩:
set mapreduce.output.fileoutputformat.compress.codec=com.hadoop.compression.lzo.LzoCodec;
set hive.exec.compress.output=true;
set mapreduce.output.fileoutputformat.compress=true;
set mapreduce.output.fileoutputformat.compress.type=BLOCK;
insert overwrite table test_table2 select * from test_table;
在Hive中执行报错如下:
Error:Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask (state=08S01,code=2)
data:image/s3,"s3://crabby-images/ce557/ce55709bee11b24083acaf3b04ea14f1e41ade69" alt=""
通过Yarn的8088可以发现是因为找不到Lzo压缩编码:
Compression codec com.hadoop.compression.lzo.LzoCodec was not found.
data:image/s3,"s3://crabby-images/51c4f/51c4f8627205146c3d7b9dcfe8e39b7fc6cd92d3" alt=""
2.解决办法
通过Cloudera Manager的Parcel页面配置Lzo的Parcel包地址:
data:image/s3,"s3://crabby-images/9f494/9f49470e82a650b583a993dd541a2316e9d1b85f" alt=""
注意:如果集群无法访问公网,需要提前下载好Parcel包并发布到httpd
下载->分配->激活
data:image/s3,"s3://crabby-images/8b631/8b631415bd469e24012b3851e884e6974e71872f" alt=""
data:image/s3,"s3://crabby-images/46238/4623832a15e52eed0a8fc179cceb5cc3b8498ef9" alt=""
data:image/s3,"s3://crabby-images/d6917/d691722a44fd0fab5729430fcbf9f13b5f2a6ce6" alt=""
data:image/s3,"s3://crabby-images/ae6b5/ae6b5484be2c17359718f15d5eac7973f01674e7" alt=""
data:image/s3,"s3://crabby-images/c4786/c47867bdd1fe7a6935a3488d996292f6f2e44784" alt=""
配置HDFS的压缩编码加入Lzo:
com.hadoop.compression.lzo.LzoCodec
com.hadoop.compression.lzo.LzopCodec
data:image/s3,"s3://crabby-images/08453/084532521c4101e7fee187fc24828f503f9ec65a" alt=""
data:image/s3,"s3://crabby-images/dfc8d/dfc8d536a20844f8062d5eb8fa295d0fcf6968df" alt=""
保存更改,部署客户端配置,重启整个集群。
data:image/s3,"s3://crabby-images/54c39/54c399fac04c58d933fb1799e4223e49ba25b3c6" alt=""
data:image/s3,"s3://crabby-images/c6b02/c6b029269d34c90fd9e968722a647a5de5e7ae59" alt=""
等待重启成功:
data:image/s3,"s3://crabby-images/f009e/f009e294365c3b880200e7405ec422f01f3bcd42" alt=""
再次插入数据到test_table2,设置为Lzo编码格式:
set mapreduce.output.fileoutputformat.compress.codec=com.hadoop.compression.lzo.LzoCodec;
set hive.exec.compress.output=true;
set mapreduce.output.fileoutputformat.compress=true;
set mapreduce.output.fileoutputformat.compress.type=BLOCK;
insert overwrite table test_table2 select * from test_table;
插入成功:
data:image/s3,"s3://crabby-images/aaafb/aaafbd80036a8bf2c700ffa86df17aa5ae982c06" alt=""
2.1.Hive验证
首先确认test_table2中的文件为Lzo格式:
data:image/s3,"s3://crabby-images/65c60/65c60980db0fffddc2f5eef910f9fbdabaafec26" alt=""
在Hive的beeline中进行测试:
data:image/s3,"s3://crabby-images/5a0ee/5a0eefee13305ec8c6d51df357ac08edaf3e8fed" alt=""
data:image/s3,"s3://crabby-images/95f24/95f241de0d85fc25fc692a15892e5e921cd02fdd" alt=""
Hive基于Lzo压缩文件运行正常。
2.2.Spark SQL验证
var textFile=sc.textFile("hdfs://ip-172-31-8-141:8020/lilei/test_table2/000000_0.lzo_deflate")
textFile.count()
sqlContext.sql("select * from test_table2")
data:image/s3,"s3://crabby-images/e9c9a/e9c9acc40e9d6b8dcb35b40478da802de3c60edd" alt=""
SparkSQL基于Lzo压缩文件运行正常。
为天地立心,为生民立命,为往圣继绝学,为万世开太平。
推荐关注Hadoop实操,第一时间,分享更多Hadoop干货,欢迎转发和分享。
![]()
原创文章,欢迎转载,转载请注明:转载自微信公众号Hadoop实操
data:image/s3,"s3://crabby-images/f255d/f255d83feee68822d3d00e240ac4bd0f0e7558a7" alt=""
网友评论