1)array、map、struct
2)meta
3)join
4)compression
Flume
RDBMS ==> Sqoop ==> Hadoop
日志:分散在各个服务器上 ??? ===> Hadoop
Flume is a distributed, reliable, and available service
for efficiently collecting, aggregating, and moving large amounts of log data.
It has a simple and flexible architecture based on streaming data flows.
collecting 采集 source
aggregating 聚合 channel (找个地方把采集过来的数据暂存下)
moving 移动 sink
Flume: 编写配置文件,组合source、channel、sink三者之间的关系
Agent:就是由source、channel、sink组成
编写flume的配置文件其实就是配置agent的构成
Flume就是一个框架:针对日志数据进行采集汇总,把日志从A地方搬运到B地方去
Flume部署
1) 下载
2) 解压到~/app
3) 添加到系统环境变量 ~/.bash_profile
export FLUME_HOME=/home/hadoop/app/apache-flume-1.6.0-cdh5.7.0-bin
export PATH=$FLUME_HOME/bin:$PATH
4) $FLUME_HOME/conf/flume-env.sh
JAVA_HOME
flume-og
flume-ng
./flume-ng agent \
--name a1 \
--conf $FLUME_HOME/conf \
--conf-file /home/hadoop/script/flume/simple-flume.conf \
-Dflume.root.logger=INFO,console \
-Dflume.monitoring.type=http \
-Dflume.monitoring.port=34343
agent_name: 配置的agent的名称a1:就是agent的名称a1、r1、k1、c1# Name the components on this agent.sources =.sinks =.channels =.sources..type = xx.sinks..type = yyy.channels..type = zzz.sources..channels =.sinks..channel =从指定的网络端口上采集日志到控制台输出
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 44444
# Describe the sink
a1.sinks.k1.type = logger
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
Event: 一条数据
Event: { headers:{} body: 72 75 6F 7A 65 64 61 74 61 0D ruozedata. }
Event:headers + body(字节数组)
Flume支持的source、channel、sink有哪些呢?
source
avro
exec : tail -F xx.log
Spooling Directory:
Taildir
netcat
sink
HDFS
logger
avro : 配合avro source使用
kafka
channel
memory
file
Agent:各种组合source、channel、sink之间的关系
把一个文件中新增的内容收集到HDFS上去
exec - memory - hdfs
一个文件夹
spooling - memory - hdfs
文件数据写入kafka
exec - memory - kafka
exce - memory - hdfs ==> Spark/Hive/MR ETL ==> HDFS <== 分析
需求:采集指定文件的内容到HDFS
技术选型:exec - memory - hdfs
./flume-ng agent \
--name a1 \
--conf $FLUME_HOME/conf \
--conf-file /home/hadoop/script/flume/exec-memory-hdfs.conf \
-Dflume.root.logger=INFO,console \
-Dflume.monitoring.type=http \
-Dflume.monitoring.port=34343
需求:采集指定文件夹的内容到控制台
选型:spooling - memory - logger
./flume-ng agent \
--name a1 \
--conf $FLUME_HOME/conf \
--conf-file /home/hadoop/script/flume/spooling-memory-logger.conf \
-Dflume.root.logger=INFO,console \
-Dflume.monitoring.type=http \
-Dflume.monitoring.port=34343
taildir
./flume-ng agent \
--name a1 \
--conf $FLUME_HOME/conf \
--conf-file /home/hadoop/script/flume/taildir-memory-logger.conf \
-Dflume.root.logger=INFO,console \
-Dflume.monitoring.type=http \
-Dflume.monitoring.port=34343
网友评论