美文网首页大数据首页投稿(暂停使用,暂停投稿)Hadoop
部署Ganglia监控Hadoop集群并通过Nagios发送告警

部署Ganglia监控Hadoop集群并通过Nagios发送告警

作者: 俺是亮哥 | 来源:发表于2017-07-13 13:33 被阅读0次

    基本介绍
    Ganglia:是UC Berkeley发起的一个开源集群监视项目,设计用于测量数以千计的节点。Ganglia的核心包含gmond、gmetad以及一个Web前端。主要是用来监控系统性能,如:cpu 、mem、硬盘利用率, I/O负载、网络流量情况、系统负载等,通过曲线很容易见到每个节点的工作状态,对合理调整、分配系统资源,提高系统整体性能起到重要作用。
    更重要的是,HDFS、YARN、HBase等已经支持其守护进程的资源情况发送给Ganglia进行监控。

    Nagios:是一款开源的电脑系统和网络监视工具,能有效监控Windows、Linux和Unix的主机状态,交换机路由器等网络设置,打印机等。尤其有用的是,在系统或服务状态异常时发出邮件或短信报警第一时间通知网站运维人员,在状态恢复后发出正常的邮件或短信通知。

    我们这次的架构设计:

    • 1,Ganglia的优势在于监控数据的实时性和丰富的图形化界面,同时对Mobile端支持的很好,但是在出现问题的时候报警提示功能,相对较弱。
    • 2,Nagios的优势在于出现问题和问题恢复时可以提供强大的报警提示功能,但是在实时监控和图形化展示上功能较弱,对大规模集群支持较差。
    • 3,要对数据平台中支持的Hadoop集群(HDFS、YARN)对资源的使用情况进行监控。

    所以我们将3者结合起来,架构如下:


    相关版本:Ubuntu 16.04 LTS、Ganglia 3.6.1、Nagios 4.1.1、Hadoop 2.7.3

    1,部署Ganglia:
    在需要进行Web展示的节点上安装:

    sudo apt-get update
    sudo apt install apache2 php libapache2-mod-php 
    sudo apt-get install rrdtool
    sudo apt-get install gmetad ganglia-webfrontend
    #过程中出现apache2重启的对话框,选择yes即可
    

    在需要被监控的节点上安装:

    sudo apt-get update
    sudo apt install php libapache2-mod-php 
    sudo apt-get install ganglia-monitor
    #过程中出现apache2重启的对话框,选择yes即可
    

    下述操作过程,在主节点上进行:

    #复制 Ganglia webfrontend Apache 配置:
    sudo cp /etc/ganglia-webfrontend/apache.conf /etc/apache2/sites-enabled/ganglia.conf
    
    #编辑gmetad配置文件 
    sudo vi /etc/ganglia/gmetad.conf
    #更改数据源 data_source “my cluster” localhost 修改为:
    data_source "bigdata cluster" 10  wl1:8649 wl2:8649 wl3:8649
    setuid_username "nobody"
    gridname "bigdata cluster"
    case_sensitive_hostnames 1
    all_trusted on
    
    #主节点上执行:
    sudo ln -s /usr/share/ganglia-webfrontend/ /var/www/ganglia
    

    下述操作过程,在所有被监控节点上进行:

    #编辑gmond配置文件 
    sudo vi /etc/ganglia/gmond.conf
    globals {
      daemonize = yes
      setuid = yes
      user = ganglia
      debug_level = 0
      max_udp_msg_len = 1472
      mute = no
      deaf = no
      host_dmax = 0 /*secs */
      cleanup_threshold = 300 /*secs */
      gexec = no
      send_metadata_interval = 10
    }
    /* If a cluster attribute is specified, then all gmond hosts are wrapped inside
     * of a <CLUSTER> tag.  If you do not specify a cluster tag, then all <HOSTS> will
     * NOT be wrapped inside of a <CLUSTER> tag. */
    cluster {
      name = "bigdata cluster"
      owner = "ganglia"
      latlong = "unspecified"
      url = "unspecified"
    }
    
    /* The host section describes attributes of the host, like the location */
    host {
      location = “wl1"  #每个节点写自己的主机名
    }
    
    /* Feel free to specify as many udp_send_channels as you like.  Gmond
       used to only support having a single channel */
    udp_send_channel {
      #mcast_join = 239.2.11.71
      host = wl1  #每个节点都指向gmetad主机
      port = 8649
      ttl = 1
    }
    
    /* You can specify as many udp_recv_channels as you like as well. */
    udp_recv_channel {
      #mcast_join = 239.2.11.71
      port = 8649
      #bind = 239.2.11.71
    }
    
    
    /* You can specify as many tcp_accept_channels as you like to share
       an xml description of the state of the cluster */
    tcp_accept_channel {
      port = 8649
    }
    

    2,收集Hadoop集群的HDFS、YARN metric源:
    下述操作过程,在所有Hadoop集群节点上进行:

    #编辑hadoop-metrics2.properties
    vi hadoop-2.7.3/etc/hadoop/hadoop-metrics2.properties
    #注释掉所有原来的内容,增加如下内容:
    *.sink.ganglia.class=org.apache.hadoop.metrics2.sink.ganglia.GangliaSink31
    *.sink.ganglia.period=10
    
    *.sink.ganglia.slope=jvm.metrics.gcCount=zero,jvm.metrics.memHeapUsedM=both
    *.sink.ganglia.dmax=jvm.metrics.threadsBlocked=70,jvm.metrics.memHeapUsedM=40
    
    namenode.sink.ganglia.servers=wl1:8649
    resourcemanager.sink.ganglia.servers=wl1:8649
    
    datanode.sink.ganglia.servers=wl1:8649
    nodemanager.sink.ganglia.servers=wl1:8649
    
    jobhistoryserver.sink.ganglia.servers=wl1:8649
    
    maptask.sink.ganglia.servers=wl1:8649
    reducetask.sink.ganglia.servers=wl1:8649
    
    

    重启Hadoop集群、重启gmond、gmetad、gweb:

    hadoop-2.7.3/sbin/stop-all.sh
    hadoop-2.7.3/sbin/start-all.sh
    sudo /etc/init.d/ganglia-monitor restart (所有节点) gmond服务 
    sudo /etc/init.d/gmetad restart    (gmetad节点)   gmetad服务
    sudo /etc/init.d/apache2 restart  (gweb节点)    web服务(包含gweb服务)
    

    然后在安装了gweb的节点上使用主机ip/ganglia即可登录Web:

    选择某个Node具体观察,可以看到已经收集到了HDFS和YARN的度量数据:


    选择Mobile标签页,可以看到对移动终端的展示支持的很好:


    3,部署Nagios:

    1,为了Nagios能正常发送告警邮件,先要安装sendmail工具:

    sudo apt-get install sendmail  
    sudo apt-get install sendmail-cf
    sudo apt-get install mailutils  
    sudo apt-get install sharutils
    #终端输入命令:
    ps aux |grep sendmail
    #输出如下:说明sendmail 已经安装成功并启动了
    root     20978  0.0  0.3   8300  1940 ?        Ss   06:34   0:00 sendmail: MTA: accepting connections          
    root     21711  0.0  0.1   3008   776 pts/0    S+   06:51   0:00 grep sendmail
    

    配置sendmail:

    #打开sendmail的配置文件 /etc/mail/sendmail.mc
    vi  /etc/mail/sendmail.mc
    #找到如下行:
    DAEMON_OPTIONS(`Family=inet,  Name=MTA-v4, Port=smtp, Addr=127.0.0.1')dnl
    #将Addr=127.0.0.1修改为Addr=0.0.0.0,表明可以连接到任何服务器。
    DAEMON_OPTIONS(`Family=inet,  Name=MTA-v4, Port=smtp, Addr=0.0.0.0')dnl
    
    #生成新的配置文件:
    cd /etc/mail  
    mv sendmail.cf sendmail.cf~      #做一个备份  
    m4 sendmail.mc > sendmail.cf  #>的左右有空格
    #修改sendmail.cf
    vi /etc/mail/sendmail.cf
    #新增
    Dj$w. #注意最后面有一个点
    
    #修改hosts,否则发送邮件的过程会非常慢,因为sendmail
    #以wl1作为域名加到主机名wl1后组成完整的长名wl1.wl1来访问,
    #会提示找不到域名
    vi /etc/hosts
    x.x.x.x       wl1 wl1.localdomain wl1.wl1
    #重启sendmail服务:
    service sendmail restart
    #测试发送邮件,看看能否收得到:
    echo "test" | mail -s test xxx@xxx.com
    

    2,安装Nagios:
    参考Ubuntu 16.04 安装 Nagios Core
    其中下载Nagios插件那一步时,nagios-plugins-2.1.1官网下载太慢,先从下面的链接下载好,然后编译安装
    nagios-plugins-2.1.1下载

    使用http://主机IP/nagios/ 登录,需要输入安装时设置的用户名nagiosadmin及其密码,然后就可以看到主页了:

    3,主要说一下如何用Nagios监控Ganglia数据,并根据阀值发出告警:

    #新生成一个监控ganglia的插件check_ganglia.py
    cd /usr/local/nagios/libexec
    vi check_ganglia.py #内容如下:
    #!/usr/bin/env python
    # -*- coding: UTF-8 -*-
    import sys
    import getopt
    import socket
    import xml.parsers.expat
    
    class GParser:
      def __init__(self, host, metric):
        self.inhost =0
        self.inmetric = 0
        self.value = None
        self.host = host
        self.metric = metric
    
      def parse(self, file):
        p = xml.parsers.expat.ParserCreate()
        p.StartElementHandler = parser.start_element
        p.EndElementHandler = parser.end_element
        p.ParseFile(file)
        if self.value == None:
          raise Exception('Host/value not found')
        return float(self.value)
    
      def start_element(self, name, attrs):
        if name == "HOST":
          if attrs["NAME"]==self.host:
            self.inhost=1
        elif self.inhost==1 and name == "METRIC" and attrs["NAME"]==self.metric:
          self.value=attrs["VAL"]
    
      def end_element(self, name):
        if name == "HOST" and self.inhost==1:
          self.inhost=0
    
    def usage():
     print """Usage: check_ganglia \
    -h|--host= -m|--metric= -w|--warning= \
    -c|--critical= [-o|--opposite=] [-s|--server=] [-p|--port=] """
     sys.exit(3)
    
    if __name__ == "__main__":
    ##############################################################
     ganglia_host = 'x.x.x.x'  #修改为你的gmetad主机的ip
     ganglia_port = 8651
     host = None
     metric = None
     warning = None
     critical = None
     opposite = 0  ##增加一个参数,表示设定值取反,也就是当实际值小于等于设定值报警
    
     try:
       options, args = getopt.getopt(sys.argv[1:],
         "h:m:w:c:o:s:p:",
         ["host=", "metric=", "warning=","critical=","opposite=", "server=","port="],
         )
     except getopt.GetoptError, err:
       print "check_gmond:", str(err)
       usage()
       sys.exit(3)
    
     for o, a in options:
       if o in ("-h", "--host"):
          host = a
       elif o in ("-m", "--metric"):
          metric = a
       elif o in ("-w", "--warning"):
          warning = float(a)
       elif o in ("-c", "--critical"):
          critical = float(a)
       elif o in ("-o", "--opposite"):
          opposite = int(a)
       elif o in ("-p", "--port"):
          ganglia_port = int(a)
       elif o in ("-s", "--server"):
          ganglia_host = a
    
    
     if critical == None or warning == None or metric == None or host ==None:
       usage()
       sys.exit(3)
    
     try:
       s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
       s.connect((ganglia_host,ganglia_port))
       parser = GParser(host, metric)
       value = parser.parse(s.makefile("r"))
       s.close()
     except Exception, err:
       #import pdb
       #pdb.set_trace()
       print "CHECKGANGLIA UNKNOWN: Error while getting value\"%s\"" % (err)
       sys.exit(3)
    
     if opposite == 1: ###根据传入参数做判断,等于1时,表示取反,等于0,不取反
          if value <= critical:
            print "CHECKGANGLIA CRITICAL: %s is %.2f" % (metric, value)
            sys.exit(2)
          elif value <= warning:
           print "CHECKGANGLIA WARNING: %s is %.2f" % (metric, value)
           sys.exit(1)
          else:
           print "CHECKGANGLIA OK: %s is %.2f" % (metric, value)
           sys.exit(0)
     else:
          if value >= critical:
            print "CHECKGANGLIA CRITICAL: %s is %.2f" % (metric, value)
            sys.exit(2)
          elif value >= warning:
              print "CHECKGANGLIA WARNING: %sis %.2f" % (metric, value)
              sys.exit(1)
          else:
            print "CHECKGANGLIA OK: %s is %.2f" % (metric, value)
            sys.exit(0)
    
    

    修改该脚本为可读写、操作权限:

    chmod 755 check_ganglia.py
    

    在如下目录,新建文件:(注意啊,里面最好不要有注释,可能会引起功能不可用,原因我没时间去分析)

    #在/usr/local/nagios/etc/objects/下新建一个services.cfg
    cd /usr/local/nagios/etc/objects/
    vi services.cfg #内容如下:
    define host {
        use linux-server
        host_name wl1
        address   x.x.x.1
    }
    
    define host {
        use linux-server
        host_name wl2
        address   x.x.x.2
    }
    
    define host {
        use linux-server
        host_name wl3
        address   x.x.x.3
    }
    
    define hostgroup {
        hostgroup_name ganglia-servers
        alias   nagios server
        members *
    }
    
    define servicegroup {
      servicegroup_name ganglia-metrics
      alias Ganglia Metrics
    }
    
    define command {
      command_name check_ganglia
      command_line $USER1$/check_ganglia.py -h $HOSTNAME$ -m $ARG1$ -w $ARG2$ -c $ARG3$ -o $ARG4$
    }
    
    define service {
        use generic-service
        name ganglia-service
        hostgroup_name ganglia-servers
        service_groups ganglia-metrics
        notifications_enabled 1
        notification_interval 10
        register  0
    }
    
    define service{
            use                             ganglia-service
            service_description             内存空闲
            check_command                   check_ganglia!mem_free!200!50!1
            contact_groups admins
    }
    
    define service{
            use                             ganglia-service
            service_description             load_one
            check_command                   check_ganglia!load_one!4!5!0
            contact_groups admins
    }
    define service{
            use                             ganglia-service
            service_description             disc_free
            check_command                   check_ganglia!disk_free!40!50!0
            contact_groups admins
    }
    define service{
            use                             ganglia-service
            service_description             yarn.NodeManagerMetrics.AvailableGB
            check_command                   check_ganglia!yarn.NodeManagerMetrics.AvailableGB!8!4!1
            contact_groups admins
    }
    

    需要注意的是,这个services.cfg文件就是用来你的Nagios自动去Ganglia里面取数据的,里面定义的需要关注的Ganglia的项目越多,Nagios里面显示的越多,我这里仅仅是一个范本,只举例了几个简单的数据,如果有需要,请自行增加。

    修改该配置文件为可读写、操作权限:

    chown nagios:nagios services.cfg
    chmod 664 services.cfg
    

    修改Nagios主配置文件:

    vi /usr/local/nagios/etc/nagios.cfg
    #cfg_file=/usr/local/nagios/etc/objects/localhost.cfg
    
    #add by wangliang for ganglia
    cfg_file=/usr/local/nagios/etc/objects/services.cfg
    

    修改和发送告警邮件相关的配置:

    vi /usr/local/nagios/etc/objects/commands.cfg
    #将其中的/bin/mail替换为mail
    # 'notify-host-by-email' command definition
    define command{
            command_name    notify-host-by-email
            command_line    /usr/bin/printf "%b" "***** Nagios *****\n\nNotification Type: $NOTIFICATIONTYPE$\nHost: $HOSTNAME$\nState: $HOSTSTATE$\nAddress: $HOSTADDRESS$\nInfo: $HOSTOUTPUT$\n\nDate/Time: $LONGDATETIME$\n" | mail -s "** $NOTIFICATIONTYPE$ Host Alert: $HOSTNAME$ is $HOSTSTATE$ **" $CONTACTEMAIL$
            }
    
    # 'notify-service-by-email' command definition
    define command{
            command_name    notify-service-by-email
            command_line    /usr/bin/printf "%b" "***** Nagios *****\n\nNotification Type: $NOTIFICATIONTYPE$\n\nService: $SERVICEDESC$\nHost: $HOSTALIAS$\nAddress: $HOSTADDRESS$\nState: $SERVICESTATE$\n\nDate/Time: $LONGDATETIME$\n\nAdditional Info:\n\n$SERVICEOUTPUT$\n" | mail -s "** $NOTIFICATIONTYPE$ Service Alert: $HOSTALIAS$/$SERVICEDESC$ is $SERVICESTATE$ **" $CONTACTEMAIL$
            }
    
    #修改发送的邮件地址和收件人:
    vi /usr/local/nagios/etc/objects/contacts.cfg
    ###############################################################################
    # CONTACTS.CFG - SAMPLE CONTACT/CONTACTGROUP DEFINITIONS
    #
    #
    # NOTES: This config file provides you with some example contact and contact
    #        group definitions that you can reference in host and service
    #        definitions.
    #
    #        You don't need to keep these definitions in a separate file from your
    #        other object definitions.  This has been done just to make things
    #        easier to understand.
    #
    ###############################################################################
    
    
    
    ###############################################################################
    ###############################################################################
    #
    # CONTACTS
    #
    ###############################################################################
    ###############################################################################
    
    # Just one contact defined by default - the Nagios admin (that's you)
    # This contact definition inherits a lot of default values from the 'generic-contact'
    # template which is defined elsewhere.
    
    define contact{
            contact_name                    nagiosadmin     ; Short name of user
        use             generic-contact     ; Inherit default values from generic-contact template (defined above)
            alias                           Nagios Admin        ; Full name of user
    
            email                           xxx1@xxx.com    ; <<***** CHANGE THIS TO YOUR EMAIL ADDRESS ******
            }
    
    define contact{
            contact_name                    nagiosadmin2           ; Short name of user
            use                             generic-contact         ; Inherit default values from generic-contact template (defined above)
            alias                           Nagios Admin2            ; Full name of user
    
            email                           xxx2@xxx.com     ; <<***** CHANGE THIS TO YOUR EMAIL ADDRESS ******
            }
    
    define contact{
            contact_name                    nagiosadmin3             ; Short name of user
            use                             generic-contact         ; Inherit default values from generic-contact template (defined above)
            alias                           Nagios Admin3            ; Full name of user
    
            email                          xxx3@xxx.com     ; <<***** CHANGE THIS TO YOUR EMAIL ADDRESS ******
            }
    ###############################################################################
    ###############################################################################
    #
    # CONTACT GROUPS
    #
    ###############################################################################
    ###############################################################################
    
    # We only have one contact in this simple configuration file, so there is
    # no need to create more than one contact group.
    
    define contactgroup{
            contactgroup_name       admins
            alias                   Nagios Administrators
            members            *
            }
    

    利用如下命令,判断修改是否成功:

    /usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg
    Total Warnings: 0
    Total Errors:   0
    

    按顺序重启相关服务:

    sudo /etc/init.d/ganglia-monitor restart  (所有节点)
    sudo /etc/init.d/gmetad restart       (gmetad节点)
    sudo /etc/init.d/apache2 restart      (gweb节点)
    service nagios restart    (nagios节点)
    service sendmail restart  (nagios节点)
    

    最后的效果图如下:
    (nagios采集数据的过程略慢,有的时候会短暂的显示service status是unknown或者pending,过一会就会好的,不用着急)

    收到的邮件告警的图示:

    下一步工作是把这几个组件做成docker镜像,用k8s调度,具体过程不在详述,参考我前面的文章就可以完成。

    需要注意的地方:

    • 1, 如果你想在Ganglia Web上显示各节点的主机名,则需要提前在
      gmetad节点的/etc/hosts里面配置好ip和hostname的映射关系,ganglia会在收到各节点数据时,先按照ip查找hosts里面的hostname,如果没有,则rrd中就按照ip存储;如果有,则rrd中按照查到的名字存储,Web显示数据时,是根据rrd中的记录的名字或者Ip来显示的。
    • 2, 如果你以前是按照ip显示,后来想改成hostname,则先要把rrd的内容清空,反之亦然。
    • 3, 记得rrd的权限是:
      drwxr-xr-x nobody nogroup rrds/
      否则网页会提示拒绝连接
    • 4, 由于sendmail使用的是smtp协议,而有的公司用的是esmtp协议的服务器,所以用本文描述的sendmail发送告警邮件可能邮箱会收不到。后来我使用了sendEmail的工具,它可以使用esmtp协议,如下格式:sendEmail -f xxx@xxx.com -t xxx@xxx.com -s smtp.exmail.qq.com -xu xxx@xxxx.com -xp xxx -m "test"
      进行测试,就可以发送成功啦,
      安装很简单,参考此处
      需要同步修改nagios的/usr/local/nagios/etc/objects/commands.cfg
    define command{
            command_name    notify-host-by-email
            command_line    /usr/bin/printf "%b" "***** Nagios *****\n\nNotification Type: $NOTIFICATIONTYPE$\nHost: $HOSTNAME$\nState: $HOSTSTATE$\nAddress: $HOSTADDRESS$\nInfo: $HOSTOUTPUT$\n\nDate/Time: $LONGDATETIME$\n" | /usr/local/bin/sendEmail -f xxx@xxx.com -t $CONTACTEMAIL$ -s smtp.exmail.qq.com -u "** $NOTIFICATIONTYPE$ Host Alert: $HOSTNAME$ is $HOSTSTATE$ **" -xu xxxx@xxx.com -xp xxxx
            }
    
    # 'notify-service-by-email' command definition
    define command{
            command_name    notify-service-by-email
            command_line    /usr/bin/printf "%b" "***** Nagios *****\n\nNotification Type: $NOTIFICATIONTYPE$\n\nService: $SERVICEDESC$\nHost: $HOSTALIAS$\nAddress: $HOSTADDRESS$\nState: $SERVICESTATE$\n\nDate/Time: $LONGDATETIME$\n\nAdditional Info:\n\n$SERVICEOUTPUT$\n" | /usr/local/bin/sendEmail -f xxx@xxx.com -t $CONTACTEMAIL$ -s smtp.exmail.qq.com -u "** $NOTIFICATIONTYPE$ Service Alert: $HOSTALIAS$/$SERVICEDESC$ is $SERVICESTATE$ **" -xu xxx@xxx.com -xp xxx
            }
    

    相关文章

      网友评论

        本文标题:部署Ganglia监控Hadoop集群并通过Nagios发送告警

        本文链接:https://www.haomeiwen.com/subject/ipqwhxtx.html