大数据之 Flume 监听文件和文件夹并上传至 hdfs

作者: 小飞牛_666 | 来源:发表于2019-05-14 17:36 被阅读9次

Flume可以监控我们需要的日记文件以及目录或者端口等，至于它的好处还挺多，安装 Flume 也挺简单，直接在 flume-env.sh 文件中配置Java的路径即可，它的帮助命令 bin/flume-ng ，至于其他的在这里不再一一介绍；下面直接使用 Flume 对 Hive 的日记文件以及自创建的一个目录进行监控，并上传至 HDFS 相关的目录上去。

1.监控 hive 日记文件

首先进入 apache-flume-1.5.0-cdh5.3.6-bin/conf 目录，然后复制临时为新文件：

cp flume-conf.properties.template flume-hdfs.conf

flume-hdfs.conf 的配置信息如下：

# The configuration file needs to define the sources, 
# the channels and the sinks.
# Sources, channels and sinks are defined per agent, 
# in this case called 'agent'
a2.sources = r2
a2.sinks = k2
a2.channels = c2
#
# # Describe/configure the source
a2.sources.r2.type = exec
a2.sources.r2.command = tail -f /opt/module/apache-hive-1.2.1-bin/logs/hive.log
a2.sources.r2.shell = /bin/bash -c
#
# # Describe the sink
a2.sinks.k2.type = hdfs
a2.sinks.k2.hdfs.path = hdfs://hadoop101:9000/flume/%Y%m%d/%H
# #上传文件的前缀
a2.sinks.k2.hdfs.filePrefix = events-hive-
# #是否按照时间滚动文件夹
a2.sinks.k2.hdfs.round = true
# #多少时间单位创建一个新的文件夹
a2.sinks.k2.hdfs.roundValue = 1
# #重新定义时间单位
a2.sinks.k2.hdfs.roundUnit = hour
# #是否使用本地时间戳
a2.sinks.k2.hdfs.useLocalTimeStamp = true
# #积攒多少个Event才flush到HDFS一次
a2.sinks.k2.hdfs.batchSize = 1000
# #设置文件类型，可支持压缩
a2.sinks.k2.hdfs.fileType = DataStream
# #多久生成一个新的文件
a2.sinks.k2.hdfs.rollInterval = 600
# #设置每个文件的滚动大小
a2.sinks.k2.hdfs.rollSize = 134217700
# #文件的滚动与Event数量无关
a2.sinks.k2.hdfs.rollCount = 0
# #最小冗余数
a2.sinks.k2.hdfs.minBlockReplicas = 1
#
#
# # Use a channel which buffers events in memory
a2.channels.c2.type = memory
a2.channels.c2.capacity = 1000
a2.channels.c2.transactionCapacity = 100
#
# Bind the source and sink to the channel
a2.sources.r2.channels = c2
a2.sinks.k2.channel = c2

这里需要注意的是，HDFS 的路径 a2.sinks.k2.hdfs.path 的值必须要和 Hadoop 里边的 core-site.xml 的配置一样，否则无法写入 HDFS 去：

image.png

配置弄好后，我们开始用以下命令启动：

bin/flume-ng agent --conf conf/ --name a2 --conf-file conf/flume-hdfs.conf

启动 hive 并让其生成日记文件:

[hw@hadoop101 apache-hive-1.2.1-bin]$ bin/hive
hive> create luchangyin;

如图：

image.png
接下来我们查看 http://hadoop101:50070/，如图已成功监听并上传文件数据：

image.png

2.监听整个创建的目录并上传至HDFS

首先我们创建一个要监控的目录：

[hw@hadoop101 apache-flume-1.5.0-cdh5.3.6-bin]$ mdkir upload

编辑 conf 目录下的 flume-dir.conf 的自定义文件：

# The configuration file needs to define the sources, 
# the channels and the sinks.
# Sources, channels and sinks are defined per agent, 
# in this case called 'agent'

a3.sources = r3
a3.sinks = k3
a3.channels = c3

# Describe/configure the source
a3.sources.r3.type = spooldir
a3.sources.r3.spoolDir = /opt/module/apache-flume-1.5.0-cdh5.3.6-bin/upload
a3.sources.r3.fileHeader = true
# #忽略所有以.tmp结尾的文件，不上传
a3.sources.r3.ignorePattern = ([^ ]*\.tmp)
#
# # Describe the sink
a3.sinks.k3.type = hdfs
a3.sinks.k3.hdfs.path = hdfs://hadoop101:9000/flume/upload/%Y%m%d/%H
# #上传文件的前缀
a3.sinks.k3.hdfs.filePrefix = upload-
# #是否按照时间滚动文件夹
a3.sinks.k3.hdfs.round = true
# #多少时间单位创建一个新的文件夹
a3.sinks.k3.hdfs.roundValue = 1
# #重新定义时间单位
a3.sinks.k3.hdfs.roundUnit = hour
# #是否使用本地时间戳
a3.sinks.k3.hdfs.useLocalTimeStamp = true
# #积攒多少个Event才flush到HDFS一次
a3.sinks.k3.hdfs.batchSize = 1000
# #设置文件类型，可支持压缩
a3.sinks.k3.hdfs.fileType = DataStream
# #多久生成一个新的文件
a3.sinks.k3.hdfs.rollInterval = 600
# #设置每个文件的滚动大小
a3.sinks.k3.hdfs.rollSize = 134217700
# #文件的滚动与Event数量无关
a3.sinks.k3.hdfs.rollCount = 0
# #最小冗余数
a3.sinks.k3.hdfs.minBlockReplicas = 1
#
#
# # Use a channel which buffers events in memory
a3.channels.c3.type = memory
a3.channels.c3.capacity = 1000
a3.channels.c3.transactionCapacity = 100
#
# # Bind the source and sink to the channel
a3.sources.r3.channels = c3
a3.sinks.k3.channel = c3

接下来我们启动监听服务：

bin/flume-ng agent --conf conf/ --name a3 --conf-file conf/flume-dir.conf &

然后我们使用一下命令往监控目录放入文件：

[hw@hadoop101 upload]$ cp -a /opt/datas/* ./

此时可以在 HDFS 的可视化网页中查看效果：

image.png

总结：
在使用Spooling Directory Source
注意事项：
1、不要在监控目录中创建并持续修改文件
2、上传完成的文件会以.COMPLETED结尾
3、被监控文件夹每600毫秒扫描一次变动

网友评论

本文标题：大数据之 Flume 监听文件和文件夹并上传至 hdfs

本文链接：https://www.haomeiwen.com/subject/elreaqtx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

大数据之 Flume 监听文件和文件夹并上传至 hdfs

1.监控 hive 日记文件

2.监听整个创建的目录并上传至HDFS

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

大数据，机器学习，人工智能

大数据爬虫Python AI Sql

大数据之 Flume 监听 文件和文件夹 并上传至 hdfs

1.监控 hive 日记文件

2.监听整个创建的目录并上传至HDFS

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

大数据，机器学习，人工智能

大数据 爬虫Python AI Sql

大数据之 Flume 监听文件和文件夹并上传至 hdfs

大数据爬虫Python AI Sql