大数据学习之flume

作者: Movle | 来源:发表于2019-11-23 18:40 被阅读0次

大数据学习之flume
大数据学习之：Flume
flume学习与总结记录
Spark学习之Spark Streaming（二）
项目技术选型
Flume | Flume NG架构
flume 写入HDFS文件无法读取，提示数据不完整。
091-BigData-19Flume与Flume之间数据传递
分布式日志收集框架Flume
Kafka学习笔记二：Flume+Kafka安装

一.flume概述

1.概述:

Flume是一种分布式，可靠且可用的服务，用于有效地收集，聚合和移动大量日志数据。它具有基于流数据流的简单灵活的架构。它具有可靠的可靠性机制和许多故障转移和恢复机制，具有强大的容错性。它使用简单的可扩展数据模型，允许在线分析应用程序。

2.大数据架构

数据采集(爬虫\日志数据\flume)
数据存储(hdfs/hive/hbase(nosql))
数据计算(mapreduce/hive/sparkSQL/sparkStreaming/flink)
数据可视化

3.Flume基于流式架构，容错性强，也很灵活简单。

4.Flume、Kafka用来实时进行数据收集，Spark、Flink用来实时处理数据，impala用来实时查询。

二.flume角色

1.source

数据源，用户采集数据，source产生数据流，同时会把产生的数据流传输到channel。

2.channel

传输通道，用于桥接source和sink

3.sink

下沉，用于收集channel传输的数据，将数据源传递到目标源

4.event

在flume中使用事件作为传输的基本单元

Flume角色

5.Flume常用的Type

(1)source

名称	含义	注意点
avro	avro协议的数据源	主要用于agent to agent之间的连接
exec	unix命令	可以命令监控文件 tail -F
spooldir	监控一个文件夹	不能含有子文件夹，不监控windows文件夹,处理完文件不能再写数据到文件 ,文件名不能冲突
TAILDIR	既可以监控文件也可以监控文件夹	支持断点续传功能，重点使用这个
netcat	监听某个端口
kafka	监控卡夫卡数据

(2)sink

名称	含义	注意点
kafka	写到kafka中
HDFS	将数据写到HDFS中
logger	输出到控制台
avro	avro协议	配合avro source使用

(3)channel:

名称	含义	注意点
memory	存在内存中
kafka	将数据存到kafka中
file	存在本地磁盘文件中

6.flume的启动参数

(1)命令

参数	描述
help	打印帮助信息
agent	运行一个Flume Agent
avro-client	运行一个Avro Flume 客户端
version	显示Flume版本。

(2)全局选项

参数	描述
--conf,-c <conf>	在<conf>目录使用配置文件。指定配置文件放在什么目录
--classpath,-C <cp>	追加一个classpath
--dryrun,-d	不真正运行Agent，而只是打印命令一些信息。
--plugins-path <dirs>	插件目录列表。默认：$FLUME_HOME/plugins.d
-Dproperty=value	设置一个JAVA系统属性值。
-Xproperty=value	设置一个JAVA -X的选项。

(3)Agent选项

参数	描述
--conf-file ,-f <file>	指定配置文件，这个配置文件必须在全局选项的--conf参数定义的目录下。（必填）
--name,-n <name>	Agent的名称（必填）
--help,-h	帮助

日志相关：

-Dflume.root.logger=INFO,console

该参数将会把flume的日志输出到console,为了将其输出到日志文件(默认在FLUME_HOME/logs),可以将console改为LOGFILE形式,具体的配置可以修改$FLUME_HOME/conf/log4j.properties

-Dflume.log.file=./wchatAgent.logs

该参数直接输出日志到目标文件

(4)Avro客户端选项

参数	描述
--rpcProps,-P <file>	连接参数的配置文件。
--host,-H <host>	Event所要发送到的Hostname。
--port,-p <port>	Avro Source的端口。
--dirname <dir>	Avro Source流到达的目录。
--filename,-F <file>	Avro Source流到达的文件名。
--headerFile,-R <file>	设置一个JAVA -X的选项。

启动Avro客户端要么指定--rpcProps，要么指定--host和--port

三.Flume传输过程：

source监控某个文件或数据流，数据源产生新的数据，拿到该数据后，将数据封装在一个Event中，并put到channel后commit提交，channel队列先进先出，sink去channel队列中拉取数据，然后写入到HDFS中。

四.flume安装配置

1.下载
2.上传到linux:/opt/software

3.解压

cd /opt/software

tar -zxvf apache-flume-1.6.0-bin.tar.gz -C /opt/moudule

4.重命名


cd /opt/module/flume/conf

mv flume-env.sh.template flume-env.sh

5.修改配置

vi flume-env.sh

修改内容如下：

export JAVA_HOME=/opt/module/jdk1.8.0_144

flume-env.sh

四.flume监听端口

1.新建配置文件flumejob_telnat.conf

#smple.conf: A single-node Flume configuration

# Name the components on this agent 定义变量方便调用 加s可以有多个此角色
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source 描述source角色 进行内容定制
# 此配置属于tcp source 必须是netcat类型
a1.sources.r1.type = netcat 
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444

# Describe the sink 输出日志文件
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory（file） 使用内存 总大小1000 每次传输100
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel 一个source可以绑定多个channel 
# 一个sinks可以只能绑定一个channel  使用的是图二的模型
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

2.上传到/opt/module/flume/conf

(3)启动命令：

bin/flume-ng agent --conf conf/ --name a1 --conf-file conf/flumejob_telnet.conf -Dflume.root.logger=INFO,console

bin/flume-ng agent     //使用ng启动agent
--conf conf/           //指定配置所在文件夹
--name a1              //指定agent别买
--conf-file conf/flumejob_telnet.conf   //指定配置文件 
-Dflume.root.logger=INFO,console       //指定日志级别

image.png

3.测试
(1).下载telnet：往端口内发送数据(netcat也可以)

yum install nc 

yum search telnat 

yum install telnat.x86_64

image.png

(2).开启telnet工具，输入信息

telnet localhost 444444   //开启

11
22
33
are you ok       //输入信息

image.png

(3).查看监控

image.png

五.flume监听本地linux文件采集到hdfs

1.新建配置文件flumejob_hdfs.conf,然后上传（用户监听hive的操作log）

# Name the components on this agent agent别名设置
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source  设置数据源监听本地文件配置
# exec 执行一个命令的方式去查看文件 tail -F 实时查看
a1.sources.r1.type = exec
# 要执行的脚本command tail -F 默认10行 man tail  查看帮助
a1.sources.r1.command = tail -F /tmp/root/hive.log
# 执行这个command使用的是哪个脚本 -c 指定使用什么命令
# whereis bash
# bash: /usr/bin/bash /usr[表情]/man/man1/bash.1.gz 
a1.sources.r1.shell = /usr/bin/bash -c

# Describe the sink 
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs:/[表情]09-01:9000/flume/%Y%m%d/%H
#上传文件的前缀
a1.sinks.k1.hdfs.filePrefix = logs-
#是否按照时间滚动文件夹
a1.sinks.k1.hdfs.round = true
#多少时间单位创建一个新的文件夹  秒 （默认30s）
a1.sinks.k1.hdfs.roundValue = 1
#重新定义时间单位（每小时滚动一个文件夹）
a1.sinks.k1.hdfs.roundUnit = minute
#是否使用本地时间戳
a1.sinks.k1.hdfs.useLocalTimeStamp = true
#积攒多少个 Event 才 flush 到 HDFS 一次
a1.sinks.k1.hdfs.batchSize = 500
#设置文件类型，可支持压缩
a1.sinks.k1.hdfs.fileType = DataStream
#多久生成一个新的文件 秒
a1.sinks.k1.hdfs.rollInterval = 30
#设置每个文件的滚动大小 字节（最好128M）
a1.sinks.k1.hdfs.rollSize = 134217700
#文件的滚动与 Event 数量无关
a1.sinks.k1.hdfs.rollCount = 0
#最小冗余数(备份数 生成滚动功能则生效roll hadoop本身有此功能 无需配置) 1份 不冗余
a1.sinks.k1.hdfs.minBlockReplicas = 1

# Use a channel which buffers events in memory 
a1.channels.c1.type = memory 
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

2.把hadoop相关依赖的包拷贝到flume的lib文件夹中，搞定适配问题

/opt/module/flume/lib    拷贝到此目录下

commons-configuration-1.6.jar
commons-io-2.4.jar
hadoop-auth-2.8.4.jar
hadoop-common-2.8.4.jar
hadoop-hdfs-2.8.4.jar
htrace-core4-4.0.1-incubating.jar

3.启动

bin/flume-ng agent --conf conf/ --name a1 --conf-file conf/flumejob_hdfs.conf

4.验证

image.png

六.监听文件夹

1.新建配置文件

# 定义别名
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = spooldir
# 监控的文件夹
a1.sources.r1.spoolDir = /root/testdir
# 上传成功后显示后缀名 
a1.sources.r1.fileSuffix = .COMPLETED
# 如论如何 加绝对路径的文件名 默认false
a1.sources.r1.fileHeader = true
#忽略所有以.tmp 结尾的文件（正在被写入），不上传
# ^以任何开头 出现无限次 以.tmp结尾的
a1.sources.r1.ignorePattern = ([^ ]*\.tmp)

# Describe the sink 下沉到hdfs
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs:/bigdata121:9000/flume/testdir/%Y%m%d/%H
#上传文件的前缀
a1.sinks.k1.hdfs.filePrefix = testdir-
#是否按照时间滚动文件夹
a1.sinks.k1.hdfs.round = true
#多少时间单位创建一个新的文件夹
a1.sinks.k1.hdfs.roundValue = 1
#重新定义时间单位
a1.sinks.k1.hdfs.roundUnit = hour
#是否使用本地时间戳
a1.sinks.k1.hdfs.useLocalTimeStamp = true
#积攒多少个 Event 才 flush 到 HDFS 一次
a1.sinks.k1.hdfs.batchSize = 100
#设置文件类型，可支持压缩
a1.sinks.k1.hdfs.fileType = DataStream
#多久生成一个新的文件
a1.sinks.k1.hdfs.rollInterval = 600
#设置每个文件的滚动大小大概是 128M 
a1.sinks.k1.hdfs.rollSize = 134217700
#文件的滚动与 Event 数量无关
a1.sinks.k1.hdfs.rollCount = 0
#最小副本数
a1.sinks.k1.hdfs.minBlockReplicas = 1

# Use a channel which buffers events in memory 
a1.channels.c1.type = memory 
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1 
a1.sinks.k1.channel = c1

2.启动

cd /opt/module/flume  
 
bin/flume-ng agent --conf conf/ --name a1 --conf-file conf/flumejob_dir.conf

3.在文件夹内进行操作，比如新建文件，修改文件，之后文件会有后缀名.COMPLETED

注意：所监控的文件夹内不允许有子文件夹

image.png

七.多channel多sink监控：监控文件并采集到hdfs与本地

image.png

1.修改多配置文件,flumejob_1.conf,flumejob_2.conf,flumejob_3.conf

#flumejob_1.conf文件
# name the components on this agent 别名设置
a1.sources = r1
a1.sinks = k1 k2 
a1.channels = c1 c2

# 将数据流复制给多个 channel
a1.sources.r1.selector.type = replicating

# Describe/configure the source 
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /tmp/root/hive.log
a1.sources.r1.shell = /bin/bash -c

# Describe the sink
# 分两个端口发送数据 
a1.sinks.k1.type = avro 
a1.sinks.k1.hostname = hd-01 
a1.sinks.k1.port = 4141

a1.sinks.k2.type = avro 
a1.sinks.k2.hostname = hd-01 
a1.sinks.k2.port = 4142

# Describe the channel 
a1.channels.c1.type = memory 
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

a1.channels.c2.type = memory 
a1.channels.c2.capacity = 1000
a1.channels.c2.transactionCapacity = 100

# Bind the source and sink to the channel 
a1.sources.r1.channels = c1 c2 
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c2

# flumejob_2.conf
# Name the components on this agent 
a2.sources = r1
a2.sinks = k1 
a2.channels = c1

# Describe/configure the source
a2.sources.r1.type = avro 
# 端口抓取数据
a2.sources.r1.bind = bigdata121
a2.sources.r1.port = 4141

# Describe the sink 
a2.sinks.k1.type = hdfs
a2.sinks.k1.hdfs.path = hdfs:/bigdata121:9000/flume2/%Y%m%d/%H

#上传文件的前缀
a2.sinks.k1.hdfs.filePrefix = flume2-
#是否按照时间滚动文件夹
a2.sinks.k1.hdfs.round = true
#多少时间单位创建一个新的文件夹
a2.sinks.k1.hdfs.roundValue = 1
#重新定义时间单位
a2.sinks.k1.hdfs.roundUnit = hour
#是否使用本地时间戳
a2.sinks.k1.hdfs.useLocalTimeStamp = true
#积攒多少个 Event 才 flush 到 HDFS 一次
a2.sinks.k1.hdfs.batchSize = 100

#设置文件类型，可支持压缩
a2.sinks.k1.hdfs.fileType = DataStream
#多久生成一个新的文件
a2.sinks.k1.hdfs.rollInterval = 600
#设置每个文件的滚动大小大概是 128M 
a2.sinks.k1.hdfs.rollSize = 134217700
#文件的滚动与 Event 数量无关
a2.sinks.k1.hdfs.rollCount = 0
#最小副本数
a2.sinks.k1.hdfs.minBlockReplicas = 1

# Describe the channel 
a2.channels.c1.type = memory 
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel 
a2.sources.r1.channels = c1
a2.sinks.k1.channel = c1

#flumejob_3.conf文件
# Name the components on this agent 
a3.sources = r1
a3.sinks = k1 
a3.channels = c1

# Describe/configure the source 
a3.sources.r1.type = avro
a3.sources.r1.bind = bigdata121
a3.sources.r1.port = 4142

# Describe the sink 
a3.sinks.k1.type = file_roll
a3.sinks.k1.sink.directory = /root/flume2

# Describe the channel 
a3.channels.c1.type = memory 
a3.channels.c1.capacity = 1000
a3.channels.c1.transactionCapacity = 100


# Bind the source and sink to the channel 
a3.sources.r1.channels = c1
a3.sinks.k1.channel = c1

2.启动：先启动1，再启动2，3
由于flumejob_3.conf是采集到本地，故本地linux必须存在/root/flume2目录

bin/flume-ng agent --conf conf/ --name a1 --conf-file conf/flumejob_1.conf

bin/flume-ng agent --conf conf/ --name a1 --conf-file conf/flumejob_2.conf

bin/flume-ng agent --conf conf/ --name a1 --conf-file conf/flumejob_3.conf

3.验证：启动hive，进行操作，并在本地/root/flume2或hdfs下/flume2目录下查看

八.拦截器(多转换，少计算，轻量级)：

image.png

1.常用拦截器
2.自定义拦截器
(1)写自定义拦截器程序，即成flume的拦截器包
(2)打包
(3)上传到linux
(4)修改flume.conf文件
(5)运行

大数据学习之flume
一.flume概述 1.概述: Flume是一种分布式，可靠且可用的服务，用于有效地收集，聚合和移动大量日志数据。...
大数据学习之：Flume
flume作用从磁盘采集文件发送到HDFS 数据采集来源：系统日志文件、Python爬虫数据、端口数据数据发送...
flume学习与总结记录
1.什么是flume Cloudera 开发的框架，实时收集数据 Flume学习的核心： agent的设计官方文...
Spark学习之Spark Streaming（二）
三、高级数据源 1、Spark Streaming接收Flume数据基于Flume的Push模式 Flume被用...
项目技术选型
数据采集传输 FLUME，DATAHUB，RDS FLUME，KAFKA，SQOOP，DATAX 数据存储 MAX...
Flume | Flume NG架构
flume ng架构 event event是flume数据传输的基本单元flume以事件的形式将数据从源头传送到...
flume 写入HDFS文件无法读取，提示数据不完整。
容器化flume以后，在缩减的flume 容器的时候，出现数据无法读取，错误提示数据不完整。根据flume si...
091-BigData-19Flume与Flume之间数据传递
上一篇：090-BigData-18Flume Flume与Flume之间数据传递一、单Flume多Channe...
分布式日志收集框架Flume
Flume概述 Flume在大数据中扮演着数据收集的角色，收集到数据以后在通过计算框架进行处理。Flume是Clo...
Kafka学习笔记二：Flume+Kafka安装
Flume介绍 Flume是流式日志采集工具，FLume提供对数据进行简单处理并且写到各种数据接收方（可定制）的能...

大数据学习之flume

一.flume概述

1.概述:

2.大数据架构

3.Flume基于流式架构，容错性强，也很灵活简单。

4.Flume、Kafka用来实时进行数据收集，Spark、Flink用来实时处理数据，impala用来实时查询。

二.flume角色

1.source

2.channel

3.sink

4.event

5.Flume常用的Type

6.flume的启动参数

三.Flume传输过程：

四.flume安装配置

四.flume监听端口

五.flume监听本地linux文件采集到hdfs

六.监听文件夹

七.多channel多sink监控：监控文件并采集到hdfs与本地

八.拦截器(多转换，少计算，轻量级)：

相关文章

大数据学习之flume

大数据学习之：Flume

flume学习与总结记录

Spark学习之Spark Streaming（二）

项目技术选型

Flume | Flume NG架构

flume 写入HDFS文件无法读取，提示数据不完整。

091-BigData-19Flume与Flume之间数据传递

分布式日志收集框架Flume

Kafka学习笔记二：Flume+Kafka安装

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

玩转大数据

大数据爬虫Python AI Sql

大数据，机器学习，人工智能

大数据