flume

作者: 夙夜M | 来源:发表于2017-08-29 11:35 被阅读0次

Flume01
Flume
玩转大数据计算之Flume
Flume 入门
flume的部署和测试
091-BigData-19Flume与Flume之间数据传递
java大数据之flume
Flume(一)概述
Flume pull方式和push方式整合
4.Flume1.9安装

数据（日志）采集

数据从A服务器到B服务器

简单方式：

1）数据量小命令 scp xxx

2）开发java/python代码实现日志收集，还需要写监控健壮性的代码，麻烦

缺点：场景变了，代码需要改写；监控代码

3）一般自己写的代码适合场景比较单一。

flume能实现的是：

把A服务器的数据收集到B，只需通过配置文件就可以了。

Flume的版本：

Flume OG 0.9

FLume NG 1.x（工作中使用的版本）

版本：flume-ng-1.5.0-cdh5.2.0.tar

Flume的组成：

sqoop，azkaban，kafka，flume--小工具，具体场景下还需研究

flume：Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data.

flume是分布式高可用有效收集、聚合、移动大量日志数据的服务。

It uses a simple extensible data model that allows for online analytic application.

它使用一个简单可扩展的数据模型使得在线分析应用程序可以被支持。

Flume的组成

Flume只有一个角色：Agent，类似于kafka中broker

Agent有三个部分：

source：用来采集数据（类似于kafka中producer）并发送数据到channel

sink：从channel中获取数据，并向HDFS写数据

channel：信道，连接source和sink

flume集群

In order to flow the data across multiple agents or hops, the sink of the previous agent and source of the current hop need to be avro type with the sink pointing to the hostname (or IP address) and port of the source.

A very common scenario in log collection is a large number of log producing clients sending data to a few consumer agents that are attached to the storage subsystem. For example, logs collected from hundreds of web servers sent to a dozen of agents that write to HDFS cluster.

多台flume收集数据并整合

This can be achieved in Flume by configuring a number of first tier agents with an avro sink, all pointing to an avro source of single agent (Again you could use the thrift sources/sinks/clients in such a scenario). This source on the second tier agent consolidates the received events into a single channel which is consumed by a sink to its final destination.

案例一：

flume业务场景下的配置和使用

具备监控目录功能的source:spooling dirctory

可以将数据展示在屏幕上的sink

配置文件后缀必须是properties

1、定义角色 a1 a1.sources a1.channels a1.sinks

2、配置一个spooldir类型的source

固定格式：a1.sources.r1.type=spooldir

3、配置channels

4、配置sink

首先配置logger sink 指定打印日志级别为打印到控制台

5、组合三部分之间的关系

连接条件：channels channel

运行flume

bin/flume-ng agent --conf conf --conf-file conf/spooldir.properties --name a1 -Dflume.root.logger=INFO,console

--conf-file：指定我们的properties配置文件

--name a1：配置文件中的agent的名字

-Dflume.root.logger=INFO,console ：表示打印到控制台

问题：为什么有一个agent分为三个部分？

因为分为三个部分，可以实现随机组合。

比如：source可以监控各种目录

sink可以输出数据到各种平台

channel可以是内存，也可以是磁盘

需求2：查看tomcat.log中最新的日志

tail -f tomcat.log

于是有exec source

必配三个参数

type:exec

command :tail -F /home/hadoop/flume1705/tomcat.log

channel--配在最后

sink

type ：hdfs

hdfs.path /bigdata/%y-%m-%d/%H%M

hdfs.filePrefix=aura-

目录是否回滚：以下配置表示每10分钟回滚一次

回滚即每个一段时间会合并和删除一些日志。

hdfs.round=true

hdfs.roundValue=10

hdfs.roundUnit=minute

每隔10分钟新产生一个目录，以时间结尾的目录

回滚文件参数说明

hdfs.useLocalTimeStamp=true 使用本机时间

hdfs.fileType=DataStream 数据流

exec source的运用

tail -F：关心的是文件名

tail -f：关心的是文件独一无二的id号

日志回滚时文件id号不变，只是文件名的后缀时间变了，因此必须用tail -F

运行

bin/flume-ng agent --conf conf --conf-file conf/tailcat.properties --name a1 -Dflume.root.logger=INFO,console

数据仓库的分层

ODS DM DW

一个项目中所用技术和场合

网友评论

本文标题：flume

本文链接：https://www.haomeiwen.com/subject/twandxtx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

flume

相关文章

Flume01

Flume

玩转大数据计算之Flume

Flume 入门

flume的部署和测试

091-BigData-19Flume与Flume之间数据传递

java大数据之flume

Flume(一)概述

Flume pull方式和push方式整合

4.Flume1.9安装

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读