Ganglia监控Hadoop集群 使用Nagios发送告警邮件

基本介绍

Ganglia:是UC Berkeley发起的一个开源集群监视项目,设计用于测量数以千计的节点。Ganglia的核心包含gmond、gmetad以及一个Web前端。主要是用来监控系统性能,如:cpu 、mem、硬盘利用率, I/O负载、网络流量情况、系统负载等,通过曲线很容易见到每个节点的工作状态,对合理调整、分配系统资源,提高系统整体性能起到重要作用。
更重要的是,HDFS、YARN、HBase等已经支持其守护进程的资源情况发送给Ganglia进行监控。

Nagios:是一款开源的电脑系统和网络监视工具,能有效监控Windows、Linux和Unix的主机状态,交换机路由器等网络设置,打印机等。尤其有用的是,在系统或服务状态异常时发出邮件或短信报警第一时间通知网站运维人员,在状态恢复后发出正常的邮件或短信通知。

我们这次的架构设计:

  1. Ganglia的优势在于监控数据的实时性和丰富的图形化界面,同时对Mobile端支持的很好,但是在出现问题的时候报警提示功能,相对较弱。

  2. Nagios的优势在于出现问题和问题恢复时可以提供强大的报警提示功能,但是在实时监控和图形化展示上功能较弱,对大规模集群支持较差。

  3. 要对数据平台中支持的Hadoop集群(HDFS、YARN)对资源的使用情况进行监控。

所以我们将3者结合起来,架构如下:

监控

相关版本:Ubuntu 16.04 LTS、Ganglia 3.6.1、Nagios 4.1.1、Hadoop 2.7.3

部署Ganglia:

在需要进行Web展示的节点上安装:

sudo apt-get update
sudo apt install apache2 php libapache2-mod-php 
sudo apt-get install rrdtool
sudo apt-get install gmetad ganglia-webfrontend
#过程中出现apache2重启的对话框,选择yes即可

在需要被监控的节点上安装:

sudo apt-get update
sudo apt install php libapache2-mod-php 
sudo apt-get install ganglia-monitor
#过程中出现apache2重启的对话框,选择yes即可

下述操作过程,在主节点上进行:

#复制 Ganglia webfrontend Apache 配置:
sudo cp /etc/ganglia-webfrontend/apache.conf /etc/apache2/sites-enabled/ganglia.conf

#编辑gmetad配置文件 
sudo vi /etc/ganglia/gmetad.conf
#更改数据源 data_source “my cluster” localhost 修改为:
data_source "bigdata cluster" 10  wl1:8649 wl2:8649 wl3:8649
setuid_username "nobody"
gridname "bigdata cluster"
case_sensitive_hostnames 1
all_trusted on

#主节点上执行:
sudo ln -s /usr/share/ganglia-webfrontend/ /var/www/ganglia

下述操作过程,在所有被监控节点上进行:

#编辑gmond配置文件 
sudo vi /etc/ganglia/gmond.conf
globals {
  daemonize = yes
  setuid = yes
  user = ganglia
  debug_level = 0
  max_udp_msg_len = 1472
  mute = no
  deaf = no
  host_dmax = 0 /*secs */
  cleanup_threshold = 300 /*secs */
  gexec = no
  send_metadata_interval = 10
}
/* If a cluster attribute is specified, then all gmond hosts are wrapped inside
 * of a <CLUSTER> tag.  If you do not specify a cluster tag, then all <HOSTS> will
 * NOT be wrapped inside of a <CLUSTER> tag. */
cluster {
  name = "bigdata cluster"
  owner = "ganglia"
  latlong = "unspecified"
  url = "unspecified"
}

/* The host section describes attributes of the host, like the location */
host {
  location = “wl1"  #每个节点写自己的主机名
}

/* Feel free to specify as many udp_send_channels as you like.  Gmond
   used to only support having a single channel */
udp_send_channel {
  #mcast_join = 239.2.11.71
  host = wl1  #每个节点都指向gmetad主机
  port = 8649
  ttl = 1
}

/* You can specify as many udp_recv_channels as you like as well. */
udp_recv_channel {
  #mcast_join = 239.2.11.71
  port = 8649
  #bind = 239.2.11.71
}


/* You can specify as many tcp_accept_channels as you like to share
   an xml description of the state of the cluster */
tcp_accept_channel {
  port = 8649
}

收集Hadoop集群的HDFS、YARN metric源:

下述操作过程,在所有Hadoop集群节点上进行:

#编辑hadoop-metrics2.properties
vi hadoop-2.7.3/etc/hadoop/hadoop-metrics2.properties
#注释掉所有原来的内容,增加如下内容:
*.sink.ganglia.class=org.apache.hadoop.metrics2.sink.ganglia.GangliaSink31
*.sink.ganglia.period=10

*.sink.ganglia.slope=jvm.metrics.gcCount=zero,jvm.metrics.memHeapUsedM=both
*.sink.ganglia.dmax=jvm.metrics.threadsBlocked=70,jvm.metrics.memHeapUsedM=40

namenode.sink.ganglia.servers=wl1:8649
resourcemanager.sink.ganglia.servers=wl1:8649

datanode.sink.ganglia.servers=wl1:8649
nodemanager.sink.ganglia.servers=wl1:8649

jobhistoryserver.sink.ganglia.servers=wl1:8649

maptask.sink.ganglia.servers=wl1:8649
reducetask.sink.ganglia.servers=wl1:8649

重启Hadoop集群、重启gmond、gmetad、gweb:

hadoop-2.7.3/sbin/stop-all.sh
hadoop-2.7.3/sbin/start-all.sh
sudo /etc/init.d/ganglia-monitor restart (所有节点) gmond服务 
sudo /etc/init.d/gmetad restart    (gmetad节点)   gmetad服务
sudo /etc/init.d/apache2 restart  (gweb节点)    web服务(包含gweb服务)

然后在安装了gweb的节点上使用主机ip/ganglia即可登录Web:

监控

选择某个Node具体观察,可以看到已经收集到了HDFS和YARN的度量数据:

监控

监控

选择Mobile标签页,可以看到对移动终端的展示支持的很好:

监控

部署Nagios:

  • 为了Nagios能正常发送告警邮件,先要安装sendmail工具:
sudo apt-get install sendmail  
sudo apt-get install sendmail-cf
sudo apt-get install mailutils  
sudo apt-get install sharutils
#终端输入命令:
ps aux |grep sendmail
#输出如下:说明sendmail 已经安装成功并启动了
root     20978  0.0  0.3   8300  1940 ?        Ss   06:34   0:00 sendmail: MTA: accepting connections          
root     21711  0.0  0.1   3008   776 pts/0    S+   06:51   0:00 grep sendmail
  • 配置sendmail:
#打开sendmail的配置文件 /etc/mail/sendmail.mc
vi  /etc/mail/sendmail.mc
#找到如下行:
DAEMON_OPTIONS(`Family=inet,  Name=MTA-v4, Port=smtp, Addr=127.0.0.1')dnl
#将Addr=127.0.0.1修改为Addr=0.0.0.0,表明可以连接到任何服务器。
DAEMON_OPTIONS(`Family=inet,  Name=MTA-v4, Port=smtp, Addr=0.0.0.0')dnl

#生成新的配置文件:
cd /etc/mail  
mv sendmail.cf sendmail.cf~      #做一个备份  
m4 sendmail.mc &gt; sendmail.cf  #&gt;的左右有空格
#修改sendmail.cf
vi /etc/mail/sendmail.cf
#新增
Dj$w. #注意最后面有一个点

#修改hosts,否则发送邮件的过程会非常慢,因为sendmail
#以wl1作为域名加到主机名wl1后组成完整的长名wl1.wl1来访问,
#会提示找不到域名
vi /etc/hosts
x.x.x.x       wl1 wl1.localdomain wl1.wl1
#重启sendmail服务:
service sendmail restart
#测试发送邮件,看看能否收得到:
echo "test" | mail -s test xxx@xxx.com
  • 安装Nagios:

参考Ubuntu 16.04 安装 Nagios Core:

Ubuntu 16.04 安装 Nagios Core

其中下载Nagios插件那一步时,nagios-plugins-2.1.1官网下载太慢,先从下面的链接下载好,然后编译安装

nagios-plugins-2.1.1下载:
http://download.csdn.net/download/u014722463/9288011

使用http://主机IP/nagios/ 登录,需要输入安装时设置的用户名nagiosadmin及其密码,然后就可以看到主页了:

监控

  • 主要说一下如何用Nagios监控Ganglia数据,并根据阀值发出告警:
#新生成一个监控ganglia的插件check_ganglia.py
cd /usr/local/nagios/libexec
vi check_ganglia.py #内容如下:
#!/usr/bin/env python
# -*- coding: UTF-8 -*-
import sys
import getopt
import socket
import xml.parsers.expat

class GParser:
  def __init__(self, host, metric):
    self.inhost =0
    self.inmetric = 0
    self.value = None
    self.host = host
    self.metric = metric

  def parse(self, file):
    p = xml.parsers.expat.ParserCreate()
    p.StartElementHandler = parser.start_element
    p.EndElementHandler = parser.end_element
    p.ParseFile(file)
    if self.value == None:
      raise Exception('Host/value not found')
    return float(self.value)

  def start_element(self, name, attrs):
    if name == "HOST":
      if attrs["NAME"]==self.host:
        self.inhost=1
    elif self.inhost==1 and name == "METRIC" and attrs["NAME"]==self.metric:
      self.value=attrs["VAL"]

  def end_element(self, name):
    if name == "HOST" and self.inhost==1:
      self.inhost=0

def usage():
 print """Usage: check_ganglia 
-h|--host= -m|--metric= -w|--warning= 
-c|--critical= [-o|--opposite=] [-s|--server=] [-p|--port=] """
 sys.exit(3)

if __name__ == "__main__":
##############################################################
 ganglia_host = 'x.x.x.x'  #修改为你的gmetad主机的ip
 ganglia_port = 8651
 host = None
 metric = None
 warning = None
 critical = None
 opposite = 0  ##增加一个参数,表示设定值取反,也就是当实际值小于等于设定值报警

 try:
   options, args = getopt.getopt(sys.argv[1:],
     "h:m:w:c:o:s:p:",
     ["host=", "metric=", "warning=","critical=","opposite=", "server=","port="],
     )
 except getopt.GetoptError, err:
   print "check_gmond:", str(err)
   usage()
   sys.exit(3)

 for o, a in options:
   if o in ("-h", "--host"):
      host = a
   elif o in ("-m", "--metric"):
      metric = a
   elif o in ("-w", "--warning"):
      warning = float(a)
   elif o in ("-c", "--critical"):
      critical = float(a)
   elif o in ("-o", "--opposite"):
      opposite = int(a)
   elif o in ("-p", "--port"):
      ganglia_port = int(a)
   elif o in ("-s", "--server"):
      ganglia_host = a


 if critical == None or warning == None or metric == None or host ==None:
   usage()
   sys.exit(3)

 try:
   s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
   s.connect((ganglia_host,ganglia_port))
   parser = GParser(host, metric)
   value = parser.parse(s.makefile("r"))
   s.close()
 except Exception, err:
   #import pdb
   #pdb.set_trace()
   print "CHECKGANGLIA UNKNOWN: Error while getting value"%s"" % (err)
   sys.exit(3)

 if opposite == 1: ###根据传入参数做判断,等于1时,表示取反,等于0,不取反
      if value &lt;= critical:
        print &quot;CHECKGANGLIA CRITICAL: %s is %.2f&quot; % (metric, value)
        sys.exit(2)
      elif value = critical:
        print "CHECKGANGLIA CRITICAL: %s is %.2f" % (metric, value)
        sys.exit(2)
      elif value &gt;= warning:
          print "CHECKGANGLIA WARNING: %sis %.2f" % (metric, value)
          sys.exit(1)
      else:
        print "CHECKGANGLIA OK: %s is %.2f" % (metric, value)
        sys.exit(0)

修改该脚本为可读写、操作权限:

 chmod 755 check_ganglia.py

在如下目录,新建文件:(注意啊,里面最好不要有注释,可能会引起功能不可用,原因我没时间去分析)

#在/usr/local/nagios/etc/objects/下新建一个services.cfg
cd /usr/local/nagios/etc/objects/
vi services.cfg #内容如下:
define host {
    use linux-server
    host_name wl1
    address   x.x.x.1
}

define host {
    use linux-server
    host_name wl2
    address   x.x.x.2
}

define host {
    use linux-server
    host_name wl3
    address   x.x.x.3
}

define hostgroup {
    hostgroup_name ganglia-servers
    alias   nagios server
    members *
}

define servicegroup {
  servicegroup_name ganglia-metrics
  alias Ganglia Metrics
}

define command {
  command_name check_ganglia
  command_line $USER1$/check_ganglia.py -h $HOSTNAME$ -m $ARG1$ -w $ARG2$ -c $ARG3$ -o $ARG4$
}

define service {
    use generic-service
    name ganglia-service
    hostgroup_name ganglia-servers
    service_groups ganglia-metrics
    notifications_enabled 1
    notification_interval 10
    register  0
}

define service{
        use                             ganglia-service
        service_description             内存空闲
        check_command                   check_ganglia!mem_free!200!50!1
        contact_groups admins
}

define service{
        use                             ganglia-service
        service_description             load_one
        check_command                   check_ganglia!load_one!4!5!0
        contact_groups admins
}
define service{
        use                             ganglia-service
        service_description             disc_free
        check_command                   check_ganglia!disk_free!40!50!0
        contact_groups admins
}
define service{
        use                             ganglia-service
        service_description             yarn.NodeManagerMetrics.AvailableGB
        check_command                   check_ganglia!yarn.NodeManagerMetrics.AvailableGB!8!4!1
        contact_groups admins
}

需要注意的是,这个services.cfg文件就是用来你的Nagios自动去Ganglia里面取数据的,里面定义的需要关注的Ganglia的项目越多,Nagios里面显示的越多,我这里仅仅是一个范本,只举例了几个简单的数据,如果有需要,请自行增加。

修改该配置文件为可读写、操作权限:

chown nagios:nagios services.cfg
chmod 664 services.cfg

修改Nagios主配置文件:

vi /usr/local/nagios/etc/nagios.cfg
#cfg_file=/usr/local/nagios/etc/objects/localhost.cfg

#add by wangliang for ganglia
cfg_file=/usr/local/nagios/etc/objects/services.cfg

修改和发送告警邮件相关的配置:

vi /usr/local/nagios/etc/objects/commands.cfg
#将其中的/bin/mail替换为mail
# 'notify-host-by-email' command definition
define command{
        command_name    notify-host-by-email
        command_line    /usr/bin/printf "%b" "***** Nagios *****nnNotification Type: $NOTIFICATIONTYPE$nHost: $HOSTNAME$nState: $HOSTSTATE$nAddress: $HOSTADDRESS$nInfo: $HOSTOUTPUT$nnDate/Time: $LONGDATETIME$n" | mail -s "** $NOTIFICATIONTYPE$ Host Alert: $HOSTNAME$ is $HOSTSTATE$ **" $CONTACTEMAIL$
        }

# 'notify-service-by-email' command definition
define command{
        command_name    notify-service-by-email
        command_line    /usr/bin/printf "%b" "***** Nagios *****nnNotification Type: $NOTIFICATIONTYPE$nnService: $SERVICEDESC$nHost: $HOSTALIAS$nAddress: $HOSTADDRESS$nState: $SERVICESTATE$nnDate/Time: $LONGDATETIME$nnAdditional Info:nn$SERVICEOUTPUT$n" | mail -s "** $NOTIFICATIONTYPE$ Service Alert: $HOSTALIAS$/$SERVICEDESC$ is $SERVICESTATE$ **" $CONTACTEMAIL$
        }

#修改发送的邮件地址和收件人:
vi /usr/local/nagios/etc/objects/contacts.cfg
###############################################################################
# CONTACTS.CFG - SAMPLE CONTACT/CONTACTGROUP DEFINITIONS
#
#
# NOTES: This config file provides you with some example contact and contact
#        group definitions that you can reference in host and service
#        definitions.
#
#        You don't need to keep these definitions in a separate file from your
#        other object definitions.  This has been done just to make things
#        easier to understand.
#
###############################################################################



###############################################################################
###############################################################################
#
# CONTACTS
#
###############################################################################
###############################################################################

# Just one contact defined by default - the Nagios admin (that's you)
# This contact definition inherits a lot of default values from the 'generic-contact'
# template which is defined elsewhere.

define contact{
        contact_name                    nagiosadmin        ; Short name of user
    use                generic-contact        ; Inherit default values from generic-contact template (defined above)
        alias                           Nagios Admin        ; Full name of user

        email                           xxx1@xxx.com    ; &lt;&lt;***** CHANGE THIS TO YOUR EMAIL ADDRESS ******
        }

define contact{
        contact_name                    nagiosadmin2           ; Short name of user
        use                             generic-contact         ; Inherit default values from generic-contact template (defined above)
        alias                           Nagios Admin2            ; Full name of user

        email                           xxx2@xxx.com     ; &lt;&lt;***** CHANGE THIS TO YOUR EMAIL ADDRESS ******
        }

define contact{
        contact_name                    nagiosadmin3             ; Short name of user
        use                             generic-contact         ; Inherit default values from generic-contact template (defined above)
        alias                           Nagios Admin3            ; Full name of user

        email                          xxx3@xxx.com     ; &lt;&lt;***** CHANGE THIS TO YOUR EMAIL ADDRESS ******
        }
###############################################################################
###############################################################################
#
# CONTACT GROUPS
#
###############################################################################
###############################################################################

# We only have one contact in this simple configuration file, so there is
# no need to create more than one contact group.

define contactgroup{
        contactgroup_name       admins
        alias                   Nagios Administrators
        members            *
        }

利用如下命令,判断修改是否成功:

/usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg
Total Warnings: 0
Total Errors:   0

按顺序重启相关服务:

sudo /etc/init.d/ganglia-monitor restart  (所有节点)
sudo /etc/init.d/gmetad restart       (gmetad节点)
sudo /etc/init.d/apache2 restart      (gweb节点)
service nagios restart    (nagios节点)
service sendmail restart  (nagios节点)

最后的效果图如下:
(nagios采集数据的过程略慢,有的时候会短暂的显示service status是unknown或者pending,过一会就会好的,不用着急)

监控

收到的邮件告警的图示:

监控

下一步工作是把这几个组件做成docker镜像,用k8s调度,具体过程不在详述,参考我前面的文章就可以完成。

需要注意的地方:

  1. 如果你想在Ganglia Web上显示各节点的主机名,则需要提前在
    gmetad节点的/etc/hosts里面配置好ip和hostname的映射关系,ganglia会在收到各节点数据时,先按照ip查找hosts里面的hostname,如果没有,则rrd中就按照ip存储;如果有,则rrd中按照查到的名字存储,Web显示数据时,是根据rrd中的记录的名字或者Ip来显示的。

  2. 如果你以前是按照ip显示,后来想改成hostname,则先要把rrd的内容清空,反之亦然。

  3. 记得rrd的权限是:
    drwxr-xr-x nobody nogroup rrds/
    否则网页会提示拒绝连接

  4. 由于sendmail使用的是smtp协议,而有的公司用的是esmtp协议的服务器,所以用本文描述的sendmail发送告警邮件可能邮箱会收不到。后来我使用了sendEmail的工具,它可以使用esmtp协议,如下格式:sendEmail -f xxx@xxx.com -t xxx@xxx.com -s smtp.exmail.qq.com -xu xxx@knownsec.com -xp xxx -m “test”
    进行测试,就可以发送成功啦,
    安装很简单,参考此处:http://www.linuxidc.com/Linux/2011-12/49699.htm
    需要同步修改nagios的/usr/local/nagios/etc/objects/commands.cfg

define command{
        command_name    notify-host-by-email
        command_line    /usr/bin/printf &quot;%b&quot; &quot;***** Nagios *****nnNotification Type: $NOTIFICATIONTYPE$nHost: $HOSTNAME$nState: $HOSTSTATE$nAddress: $HOSTADDRESS$nInfo: $HOSTOUTPUT$nnDate/Time: $LONGDATETIME$n&quot; | /usr/local/bin/sendEmail -f xxx@xxx.com -t $CONTACTEMAIL$ -s smtp.exmail.qq.com -u &quot;** $NOTIFICATIONTYPE$ Host Alert: $HOSTNAME$ is $HOSTSTATE$ **&quot; -xu xxxx@xxx.com -xp xxxx
        }

# 'notify-service-by-email' command definition
define command{
        command_name    notify-service-by-email
        command_line    /usr/bin/printf &quot;%b&quot; &quot;***** Nagios *****nnNotification Type: $NOTIFICATIONTYPE$nnService: $SERVICEDESC$nHost: $HOSTALIAS$nAddress: $HOSTADDRESS$nState: $SERVICESTATE$nnDate/Time: $LONGDATETIME$nnAdditional Info:nn$SERVICEOUTPUT$n&quot; | /usr/local/bin/sendEmail -f xxx@xxx.com -t $CONTACTEMAIL$ -s smtp.exmail.qq.com -u &quot;** $NOTIFICATIONTYPE$ Service Alert: $HOSTALIAS$/$SERVICEDESC$ is $SERVICESTATE$ **&quot; -xu xxx@xxx.com -xp xxx
        }