一、LZO相关软件下载及前置工作
1、查看是否安装了lzop,lzo格式文件压缩解压需要用到服务器的lzop工具
[hadoop@hadoop000 ~]$ which lzop
/usr/bin/lzop
如果没有请安装
yum install -y svn ncurses-devel
yum install -y gcc gcc-c++ make cmake
yum install -y openssl openssl-devel svn ncurses-devel zlib-devel libtool
yum install -y lzo lzo-devel lzop autoconf automake cmake
2、准备一份压缩后大于128M的数据
可以使用Hadoop MR ETL离线项目文章中的日志生成器生成一份access日志
[hadoop@hadoop000 accesslog]$ ll -h access.log
-rw-r--r-- 1 hadoop hadoop 2.7G Apr 20 19:05 access.log
--------------------------日志格式-----------------------------------------
baidu CN 2 2019-01-21 17:17:56 123.235.248.216 v2.go2yd.com http://v1.go2yd.com/user_upload/1531633977627104fdecdc68fe7a2c4b96b2226fd3f4c.mp4_bd.mp4 785966
3、采用LZOP压缩日志
lzo压缩:lzop -v file
lzo解压:lzop -dv file
[hadoop@hadoop000 accesslog]$ lzop -v access.log
---------------------------------------------------------------------
[hadoop@hadoop000 accesslog]$ ll -h access.log*
-rw-r--r-- 1 hadoop hadoop 2.7G Apr 20 19:05 access.log
-rw-r--r-- 1 hadoop hadoop 211M Apr 20 19:05 access.log.lzo
此处之所以会压缩这么厉害,因为access.log里面的重复数据太多
二、编译HADOOP_LZO
1、获取hadoop-lzo源码
https://github.com/twitter/hadoop-lzo
---------------------------------------------------------------------
[hadoop@hadoop000 source]$ ll hadoop-lzo-master.zip
-rw-r--r-- 1 hadoop hadoop 1040269 Apr 20 16:13 hadoop-lzo-master.zip
2、编译
[hadoop@hadoop000 source]$ unzip hadoop-lzo-master.zip
-------------------------------------------
[hadoop@hadoop000 source]$ cd hadoop-lzo-master
[hadoop@hadoop000 hadoop-lzo-master]$ ll
total 76
-rw-rw-r-- 1 hadoop hadoop 35147 Oct 13 2017 COPYING
-rw-rw-r-- 1 hadoop hadoop 19753 Oct 13 2017 pom.xml
-rw-rw-r-- 1 hadoop hadoop 10170 Oct 13 2017 README.md
drwxrwxr-x 2 hadoop hadoop 4096 Oct 13 2017 scripts
drwxrwxr-x 4 hadoop hadoop 4096 Oct 13 2017 src
------------------------------------------------------------
[hadoop@hadoop000 hadoop-lzo-master]$ mvn clean package -Dmaven.test.skip=true
[INFO] Building jar: /home/hadoop/soul/soft/source/hadoop-lzo-master/target/hadoop-lzo-0.4.21-SNAPSHOT-javadoc.jar
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 01:52 min
[INFO] Finished at: 2019-04-20T16:33:02+08:00
[INFO] Final Memory: 32M/78M
[INFO] ------------------------------------------------------------------------
编译后的lzo jar为 /home/hadoop/soul/soft/source/hadoop-lzo-master/target/hadoop-lzo-0.4.21-SNAPSHOT.jar
重命名一下
[hadoop@hadoop000 target]$ mv hadoop-lzo-0.4.21-SNAPSHOT.jar hadoop-lzo.jar
三、测试前的准备
1、将hadoop-lzo.jar复制到Hadoop的common目录
[hadoop@hadoop000 target]$ cp hadoop-lzo.jar /home/hadoop/soul/app/hadoop-2.6.0-cdh5.7.0/share/hadoop/common
2、配置core-site.xml和mapred-site.xml
core-site.xml
<property>
<name>io.compression.codecs</name>
<value>
org.apache.hadoop.io.compress.GzipCodec,
org.apache.hadoop.io.compress.DefaultCodec,
org.apache.hadoop.io.compress.BZip2Codec,
org.apache.hadoop.io.compress.SnappyCodec,
com.hadoop.compression.lzo.LzoCodec,
com.hadoop.compression.lzo.LzopCodec
</value>
</property>
mapred-site.xml
#开启mr输出时的压缩
<property>
<name>mapreduce.output.fileoutputformat.compress</name>
<value>true</value>
</property>
<property>
<name>mapreduce.output.fileoutputformat.compress.codec</name>
<value>org.apache.hadoop.io.compress.BZip2Codec</value>
</property>
#开启mr中map阶段的输出压缩
<property>
<name>mapreduce.map.output.compress</name>
<value>true</value>
</property>
#指定mr中map阶段的输出压缩为Snappy
<property>
<name>mapreduce.map.output.compress.codec</name>
<value>org.apache.hadoop.io.compress.SnappyCodec</value>
</property>
配置完记得重启Hadoop
3、Hive创建access_lzo 表
create table access_lzo (
cdn string,
region string,
level string,
time string,
ip string,
domain string,
url string,
traffic bigint
)ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS INPUTFORMAT "com.hadoop.mapred.DeprecatedLzoTextInputFormat"
OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat";
如果没有重启Hadoop会报错,hadoop-lzo.jar没有拷贝到Hadoop的common目录同样会报错
FAILED: SemanticException Cannot find class 'com.hadoop.mapred.DeprecatedLzoTextInputFormat'
加载数据(用LZO压缩后的数据)
load data local inpath '/home/hadoop/soul/data/accesslog/access.log.lzo' overwrite into table access_lzo;
-------------------------------------------------------------------------------------------------------
[hadoop@hadoop000 accesslog]$ hadoop fs -du -s -h /user/hive/warehouse/access_lzo
210.8 M 210.8 M /user/hive/warehouse/access_lzo
四、没有创建索引前的LZO
1、查询
hive (default)> select count(*) from access_lzo;
Query ID = hadoop_20190420193636_7054c76b-ae04-458e-a745-bf74fda63f28
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1555760099632_0001, Tracking URL = http://hadoop000:8088/proxy/application_1555760099632_0001/
Kill Command = /home/hadoop/soul/app/hadoop-2.6.0-cdh5.7.0/bin/hadoop job -kill job_1555760099632_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2019-04-20 19:44:08,145 Stage-1 map = 0%, reduce = 0%
2019-04-20 19:44:19,033 Stage-1 map = 1%, reduce = 0%, Cumulative CPU 5.09 sec
2019-04-20 19:44:22,202 Stage-1 map = 18%, reduce = 0%, Cumulative CPU 7.53 sec
2019-04-20 19:44:25,384 Stage-1 map = 44%, reduce = 0%, Cumulative CPU 9.56 sec
2019-04-20 19:44:27,489 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 11.61 sec
2019-04-20 19:44:34,899 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 12.88 sec
MapReduce Total cumulative CPU time: 12 seconds 880 msec
Ended Job = job_1555760099632_0001
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 12.88 sec HDFS Read: 221000721 HDFS Write: 9 SUCCESS
Total MapReduce CPU Time Spent: 12 seconds 880 msec
OK
_c0
18000000
Time taken: 38.651 seconds, Fetched: 1 row(s)
重要日志分析
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 12.88 sec HDFS Read: 221000721 HDFS Write: 9
由于LZO不支持分片,所以access_lzo表的大小(210.8M)虽说超过了HDFS Block(128M),依然是采用一个map处理数据。
五、创建索引后的LZO
1、为access_lzo表创建索引
hadoop jar $HADOOP_HOME/share/hadoop/common/hadoop-lzo.jar \
com.hadoop.compression.lzo.LzoIndexer /user/hive/warehouse/access_lzo
创建完后access_lzo表在HDFS上的路径下回多出一个索引文件
[hadoop@hadoop000 accesslog]$ hadoop fs -ls /user/hive/warehouse/access_lzo
Found 2 items
-rwxr-xr-x 1 hadoop supergroup 220993815 2019-04-20 19:40 /user/hive/warehouse/access_lzo/access.log.lzo
-rw-r--r-- 1 hadoop supergroup 85208 2019-04-20 19:55 /user/hive/warehouse/access_lzo/access.log.lzo.index
再来执行查询
hive (default)> select count(*) from access_lzo;
Query ID = hadoop_20190420193636_7054c76b-ae04-458e-a745-bf74fda63f28
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1555760099632_0002, Tracking URL = http://hadoop000:8088/proxy/application_1555760099632_0002/
Kill Command = /home/hadoop/soul/app/hadoop-2.6.0-cdh5.7.0/bin/hadoop job -kill job_1555760099632_0002
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
2019-04-20 19:56:59,101 Stage-1 map = 0%, reduce = 0%
2019-04-20 19:57:13,236 Stage-1 map = 5%, reduce = 0%, Cumulative CPU 4.81 sec
2019-04-20 19:57:15,374 Stage-1 map = 6%, reduce = 0%, Cumulative CPU 6.06 sec
2019-04-20 19:57:16,426 Stage-1 map = 20%, reduce = 0%, Cumulative CPU 7.37 sec
2019-04-20 19:57:18,550 Stage-1 map = 21%, reduce = 0%, Cumulative CPU 8.66 sec
2019-04-20 19:57:19,593 Stage-1 map = 51%, reduce = 0%, Cumulative CPU 9.99 sec
2019-04-20 19:57:20,634 Stage-1 map = 52%, reduce = 0%, Cumulative CPU 11.79 sec
2019-04-20 19:57:23,889 Stage-1 map = 70%, reduce = 0%, Cumulative CPU 13.38 sec
2019-04-20 19:57:27,161 Stage-1 map = 76%, reduce = 0%, Cumulative CPU 13.98 sec
2019-04-20 19:57:30,426 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 14.67 sec
2019-04-20 19:57:32,516 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 15.94 sec
MapReduce Total cumulative CPU time: 15 seconds 940 msec
Ended Job = job_1555760099632_0002
MapReduce Jobs Launched:
Stage-Stage-1: Map: 2 Reduce: 1 Cumulative CPU: 15.94 sec HDFS Read: 221044738 HDFS Write: 9 SUCCESS
Total MapReduce CPU Time Spent: 15 seconds 940 msec
OK
_c0
18000000
Time taken: 41.605 seconds, Fetched: 1 row(s)
重要日志分析
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
Stage-Stage-1: Map: 2 Reduce: 1 Cumulative CPU: 15.94 sec HDFS Read: 221044738 HDFS Write: 9 SUCCESS
可以看到Map数已变为2,也就是创建索引后LZO支持分片
CDH集群配置LZO压缩请参考CDH5集群配置lzo
网友评论