监控 – 第3页 – Linux系统运维日志

OpenResty(Nginx Lua)统计网站访问信息

背景

之前的一篇文章openresty(nginx lua)统计域名状态码、平均响应时间和流量实现了对域名状态码，平均响应时间和流量的统计。但之前的统计方法没有实现当某一域名404或500等状态码超过一定数量后发送具体的url来快速定位位置。这个功能我们其实是通过统计网站日志来实现了。为了摆脱对网站日志的依赖以及提高统计性能，我们尝试把此功能也用nginx lua来实现。具体的使用方法与之前的文章一样，这里只是更新了两个lua脚本。

使用方法

1、获取域名devops.webres.wang 404状态码数量

curl -s "localhost/domain_status?count=status&host=devops.webres.wang&status=404"

输出:
10 688
第一列为状态码数量，第二列为域名请求总数
2、获取当域名devops.webres.wang 404状态码超过50个时,输出前10个url

curl -s "localhost/domain_status?count=statusUrl&host=devops.webres.wang&status=404&exceed=50&output=10"

输出:
/hello-world 90
/centos 10
第一列为url，第二列为url请求次数。
3、获取域名devops.webres.wang upstream一分钟内平均耗时

curl -s "localhost/domain_status?count=upT&host=devops.webres.wang"

输出:
0.02 452
第一列为upstream平均耗时，第二列为域名总请求次数。
4、获取当域名devops.webres.wang upstream平均耗时超过0.5秒时,输出其url

curl -s "localhost/domain_status?count=upTUrl&host=devops.webres.wang&exceed=0.5"

输出:
/hello.php 0.82 52
第一列为url，第二列为此url平均耗时,第三列为此url请求次数。监控此接口数据可以快速定位出具体哪些url慢了。
5、获取域名devops.webres.wang request time平均耗时

curl -s "localhost/domain_status?count=reqT&host=devops.webres.wang"

输出:
1.82 52
第一列为平均耗时，第二列为域名请求数。request time是指完成整个请求所需要的时间（包括把数据传输到用户浏览器的时间）。对于php请求，upstream time指的是nginx把php请求传给fastcgi到完成数据接收所需时间。所以request time永远大于upstream time。
6、获取域名devops.webres.wang占用的带宽(单位:字节/秒)

curl -s "localhost/domain_status?count=flow&host=devops.webres.wang"

输出:
1024 52
第一列为此域名一分钟内平均传输速率，单位为字节/秒，第二列为域名请求总数。

openresty(nginx lua)统计域名状态码、平均响应时间和流量

背景

之前我们统计域名状态码、平均响应时间和流量的方法是：在每台机器添加一个定时脚本，来获取每个域名最近一分钟的访问日志到临时文件。然后zabbix再对这个一分钟日志临时文件作相关统计。一直运行良好，最近发现某台服务器突然负载增高。使用iotop查看发现获取最近一分钟日志的脚本占用的IO特别高。停止这个定时任务之后恢复正常。于是就打算使用nginx lua来替换目前的方法。新的方法具有统计时占用资源少，实时的特点。

方法介绍

使用nginx lua统计网站相关数据的方法为（我们以统计devops.webres.wang 404状态码为例）：
记录过程：

1、定义了一个共享词典access，获取当前时间戳，获取当前域名，如devops.webres.wang;
2、我们定义用来存储状态码的词典key为，devops.webres.wang-404-当前时间戳;
3、自增1 key(devops.webres.wang-404-当前时间戳)的值;
4、循环2,3步。

查询过程：
提供一个接口，来累加key为devops.webres.wang-404-(前60秒时间戳-当前时间戳)的值，返回结果。

方法实现

nginx.conf设置

http {
[…]
lua_shared_dict access 10m;
log_by_lua_file conf/log_acesss.lua;
server {
[…]
location /domain_status {
default_type text/plain;
content_by_lua_file "conf/domain_status.lua";
}
[…]
}
[…]
}

log_access.lua

local access = ngx.shared.access
local host = ngx.var.host
local status = ngx.var.status
local body_bytes_sent = ngx.var.body_bytes_sent
local request_time = ngx.var.request_time
local timestamp = os.date("%s")
local expire_time = 70
local status_key = table.concat({host,"-",status,"-",timestamp})
local flow_key = table.concat({host,"-flow-",timestamp})
local req_time_key = table.concat({host,"-reqt-",timestamp})
local total_req_key = table.concat({host,"-total_req-",timestamp})
— count total req
local total_req_sum = access:get(total_req_key) or 0
total_req_sum = total_req_sum + 1
access:set(total_req_key, total_req_sum, expire_time)
— count status
local status_sum = access:get(status_key) or 0
status_sum = status_sum + 1
access:set(status_key, status_sum, expire_time)
— count flow
local flow_sum = access:get(flow_key) or 0
flow_sum = flow_sum + body_bytes_sent
access:set(flow_key, flow_sum, expire_time)
— count request time
local req_sum = access:get(req_time_key) or 0
req_sum = req_sum + request_time
access:set(req_time_key, req_sum, expire_time)

domain_status.lua

local access = ngx.shared.access
local args = ngx.req.get_uri_args()
local count = args["count"]
local host = args["host"]
local status = args["status"]
local one_minute_ago = tonumber(os.date("%s")) – 60
local now = tonumber(os.date("%s"))
local status_total = 0
local flow_total = 0
local reqt_total = 0
local req_total = 0
if not host then
ngx.print("host arg not found.")
ngx.exit(ngx.HTTP_OK)
end
if count == "status" and not status then
ngx.print("status arg not found.")
ngx.exit(ngx.HTTP_OK)
end
if not (count == "status" or count == "flow" or count == "reqt") then
ngx.print("count arg invalid.")
ngx.exit(ngx.HTTP_OK)
end
for second_num=one_minute_ago,now do
local flow_key = table.concat({host,"-flow-",second_num})
local req_time_key = table.concat({host,"-reqt-",second_num})
local total_req_key = table.concat({host,"-total_req-",second_num})
if count == "status" then
local status_key = table.concat({host,"-",status,"-",second_num})
local status_sum = access:get(status_key) or 0
status_total = status_total + status_sum
elseif count == "flow" then
local flow_sum = access:get(flow_key) or 0
flow_total = flow_total + flow_sum
elseif count == "reqt" then
local req_sum = access:get(total_req_key) or 0
local req_time_sum = access:get(req_time_key) or 0
reqt_total = reqt_total + req_time_sum
req_total = req_total + req_sum
end
end
if count == "status" then
ngx.print(status_total)
elseif count == "flow" then
ngx.print(flow_total)
elseif count == "reqt" then
if req_total == 0 then
reqt_avg = 0
else
reqt_avg = reqt_total/req_total
end
ngx.print(reqt_avg)
end

使用说明

1、获取域名状态码
如请求devops.webres.wang一分钟内404状态码数量
请求接口http://$host/domain_status?count=status&host=devops.webres.wang&status=404
2、获取域名流量
请求接口http://$host/domain_status?count=flow&host=devops.webres.wang
3、获取域名一分钟内平均响应时间
请求接口http://$host/domain_status?count=reqt&host=devops.webres.wang

使用zabbix根据时间监控多行格式的日志

我们目前想使用zabbix每五分钟监控一个错误日志文件，如果监控到有错误产生，就发邮件告警。像标准的访问日志，如nginx的access log，一行表示一条日志，解析起来比较容易，但当日志不是一行一条时，如tomcat,glassfish的日志，如下：
[2015-07-17T14:24:04.552+0800] [glassfish 4.0] [SEVERE] [AS-WEB-CORE-00037] [javax.enterprise.web.core] [tid: _ThreadID=26 _ThreadName=http-listener-1(3)] [timeMillis: 1437114244552] [levelValue: 1000] [[
An exception or error occurred in the container during the request processing
java.lang.IllegalArgumentException
at org.glassfish.grizzly.http.util.CookieParserUtils.parseClientCookies(CookieParserUtils.java:353)
at org.glassfish.grizzly.http.util.CookieParserUtils.parseClientCookies(CookieParserUtils.java:336)
at org.glassfish.grizzly.http.Cookies.processClientCookies(Cookies.java:220)
at org.glassfish.grizzly.http.Cookies.get(Cookies.java:131)
at org.glassfish.grizzly.http.server.Request.parseCookies(Request.java:1911)
at org.glassfish.grizzly.http.server.Request.getCookies(Request.java:1505)
at org.apache.catalina.connector.Request.parseSessionCookiesId(Request.java:4077)
at org.apache.catalina.connector.CoyoteAdapter.postParseRequest(CoyoteAdapter.java:649)
at org.apache.catalina.connector.CoyoteAdapter.doService(CoyoteAdapter.java:297)
]]
这个时候解析起来就相对复杂，我们可以使用如下脚本来取得最近五分钟的日志:

#!/bin/bash
# 取得前5分钟时间
LAST_MINUTE=$(date -d ‘-5 minute’ +%H%M%S)
# 初始化日志条数
LOG_NUM=0
# 最大获取日志条数
MAX_LOG=3
# 初始化最终匹配日志
LOG_CONTENT=""
# 初始化包含时间行的匹配值
LOG_DATE_MATCH=false
# 设置日志路径
LOG_PATH="/data/log/glassfish/domain1/server.log"
while read line;do
# 匹配包含时间的行
if echo "$line" | grep -q ‘^[20’;then
# 根据包含时间行获取出特定时间格式，如181320
date_time=$(echo $line | grep -E -o "[0-9]{2}:[0-9]{2}:[0-9]{2}" | tr -d ‘:’)
date_time=$(echo $date_time | sed ‘s/^0//’)
LAST_MINUTES=$(echo $LAST_MINUTES | sed ‘s/^0//’)
# 当前行的时间是否大于5分钟前的时间
if [[ "$date_time" -gt "$LAST_MINUTE" ]];then
LOG_CONTENT="$LOG_CONTENTn$log_entry"
((LOG_NUM++))
LOG_DATE_MATCH=true
log_entry="$linen"
else
LOG_DATE_MATCH=false
continue
fi
else
# 只当前面日志时间满足条件时才设置log_entry值
if $LOG_DATE_MATCH;then
log_entry="$log_entryn$line"
fi
fi
# 限制最大获取行数
if [[ "$LOG_NUM" -gt "$MAX_LOG" ]];then
break
fi
done < $LOG_PATH
# 输出全部日志
echo -n -e "$LOG_CONTENT"

前面的脚本按顺序读取的，但当日志文件比较大时，获取日志的效率就非常低了，所以推荐下面倒序读取日志的方法，更高效。

#!/bin/bash
# 取得前5分钟时间
LAST_MINUTE=$(date -d ‘-5 minute’ +%H%M%S)
# 初始化日志条数
LOG_NUM=0
# 最大获取日志条数
MAX_LOG=3
# 初始化最终匹配日志
LOG_CONTENT=""
# 设置日志路径
LOG_PATH="/data/log/glassfish/domain1/server.log"
while read line;do
# 匹配包含时间的行
if echo "$line" | grep -q ‘^[20’;then
# 根据包含时间行获取出特定时间格式，如181320
date_time=$(echo $line | grep -E -o "[0-9]{2}:[0-9]{2}:[0-9]{2}" | tr -d ‘:’)
# 当前行的时间是否大于5分钟前的时间
if [[ "$date_time" > "$LAST_MINUTE" ]];then
((LOG_NUM++))
log_entry="$linen$log_entry"
LOG_CONTENT="$LOG_CONTENTn$log_entry"
else
break
fi
log_entry=""
else
log_entry="$linen$log_entry"
fi
# 限制最大获取行数
if [[ "$LOG_NUM" > "$MAX_LOG" ]];then
break
fi
done < <(tac $LOG_PATH)
# 输出全部日志
echo -n -e "$LOG_CONTENT"

之后就可以在zabbix添加一个监控项用来获取日志内容，触发器就使用{itemName.strlen(0)}#0表达式来检测获取到的日志内容是否不为空。itemName为监控项名称。

Zabbix监控Memcached PHP-FPM Tomcat Nginx MySQL 网站日志

Zabbix作为监控软件非常的灵活，支持的数据类型非常丰富，比如数字(无正负),数字(浮点),日志，文字等。我们需要做的就是使用脚本来收集好数据，然后zabbix收集并画图，设置告警线。这里我们来学习使用Zabbix监控Memcached、PHP-FPM、Tomcat、Nginx、MySQL及网站日志。

Memcached监控

自定义键值

UserParameter=memcached.stat[*],/data/sh/memcached-status.sh "$1"

memcached-status.sh脚本内容为：

#!/bin/bash
item=$1
ip=127.0.0.1
port=11211
(echo "stats";sleep 0.5) | telnet $ip $port 2>/dev/null | grep "STAT $itemb" | awk ‘{print $3}’

导入模板

memcached zabbix模板下载

PHP-FPM监控

配置php-fpm状态页

打开php-fpm.conf配置文件，添加如下配置后重启php：

pm.status_path = /fpm_status

自定义键值

UserParameter=php-fpm[*],/data/sh/php-fpm-status.sh "$1"

php-fpm-status.sh脚本内容：

#!/bin/bash
##################################
# Zabbix monitoring script
#
# php-fpm:
# – anything available via FPM status page
#
##################################
# Contact:
# [email protected]
##################################
# ChangeLog:
# 20100922 VV initial creation
##################################
# Zabbix requested parameter
ZBX_REQ_DATA="$1"
# FPM defaults
URL="http://localhost/fpm_status"
WGET_BIN="/usr/bin/wget"
#
# Error handling:
# – need to be displayable in Zabbix (avoid NOT_SUPPORTED)
# – items need to be of type "float" (allow negative + float)
#
ERROR_NO_ACCESS_FILE="-0.9900"
ERROR_NO_ACCESS="-0.9901"
ERROR_WRONG_PARAM="-0.9902"
ERROR_DATA="-0.9903" # either can not connect / bad host / bad port
# save the FPM stats in a variable for future parsing
FPM_STATS=$($WGET_BIN -q $URL -O – 2> /dev/null)
# error during retrieve
if [ $? -ne 0 -o -z "$FPM_STATS" ]; then
echo $ERROR_DATA
exit 1
fi
#
# Extract data from FPM stats
#
RESULT=$(echo "$FPM_STATS" | sed -n -r "s/^$ZBX_REQ_DATA: +([0-9]+)/1/p")
if [ $? -ne 0 -o -z "$RESULT" ]; then
echo $ERROR_WRONG_PARAM
exit 1
fi
echo $RESULT
exit 0

导入模板

php-fpm zabbix模板下载

Tomcat监控

刚开始决定监控Tomcat时，使用的是JMX，不过这货设置太复杂了，而且对防火墙要求还挺高，需要开放几个端口。只好使用Tomcat自带的状态页来监控了。

自定义键值

UserParameter=tomcat.status[*],/data/sh/tomcat-status.py $1

因为需要解析到xml，所以还是决定用python实现比较方便。
/data/sh/tomcat-status.py脚本内容：

#!/usr/bin/python
import urllib2
import xml.dom.minidom
import sys
url = ‘http://127.0.0.1:8080/manager/status?XML=true’
username = ‘username’
password = ‘password’
passman = urllib2.HTTPPasswordMgrWithDefaultRealm()
passman.add_password(None, url, username, password)
authhandler = urllib2.HTTPBasicAuthHandler(passman)
opener = urllib2.build_opener(authhandler)
urllib2.install_opener(opener)
pagehandle = urllib2.urlopen(url)
xmlData = pagehandle.read()
doc = xml.dom.minidom.parseString(xmlData)
item = sys.argv[1]
if item == "memory.free":
print doc.getElementsByTagName("memory")[0].getAttribute("free")
elif item == "memory.total":
print doc.getElementsByTagName("memory")[0].getAttribute("total")
elif item == "memory.max":
print doc.getElementsByTagName("memory")[0].getAttribute("max")
elif item == "threadInfo.maxThreads":
print doc.getElementsByTagName("threadInfo")[0].getAttribute("maxThreads")
elif item == "threadInfo.currentThreadCount":
print doc.getElementsByTagName("threadInfo")[0].getAttribute("currentThreadCount")
elif item == "threadInfo.currentThreadsBusy":
print doc.getElementsByTagName("threadInfo")[0].getAttribute("currentThreadsBusy")
elif item == "requestInfo.maxTime":
print doc.getElementsByTagName("requestInfo")[0].getAttribute("maxTime")
elif item == "requestInfo.processingTime":
print doc.getElementsByTagName("requestInfo")[0].getAttribute("processingTime")
elif item == "requestInfo.requestCount":
print doc.getElementsByTagName("requestInfo")[0].getAttribute("requestCount")
elif item == "requestInfo.errorCount":
print doc.getElementsByTagName("requestInfo")[0].getAttribute("errorCount")
elif item == "requestInfo.bytesReceived":
print doc.getElementsByTagName("requestInfo")[0].getAttribute("bytesReceived")
elif item == "requestInfo.bytesSent":
print doc.getElementsByTagName("requestInfo")[0].getAttribute("bytesSent")
else:
print "unsupport item."

这个脚本是监控Tomcat7的，Tomcat6没有试过，应该区别在状态页的url以及管理页面的用户密码设置上。以上脚本可运行需要在tomcat-users.xml里添加用户，至少权限为manager-status。

导入模板

tomcat zabbix模板下载

Nginx监控

配置Nginx状态页

在nginx配置文件server{}中加入：

location /nginx_status {
stub_status on;
access_log off;
}

自定义键值

UserParameter=nginx[*],/data/sh/nginx-status.sh "$1"

nginx-status.sh脚本内容：

#!/bin/bash
##################################
# Zabbix monitoring script
#
# nginx:
# – anything available via nginx stub-status module
#
##################################
# Contact:
# [email protected]
##################################
# ChangeLog:
# 20100922 VV initial creation
##################################
# Zabbix requested parameter
ZBX_REQ_DATA="$1"
ZBX_REQ_DATA_URL="$2"
# Nginx defaults
URL="http://127.0.0.1/nginx_status"
WGET_BIN="/usr/bin/wget"
#
# Error handling:
# – need to be displayable in Zabbix (avoid NOT_SUPPORTED)
# – items need to be of type "float" (allow negative + float)
#
ERROR_NO_ACCESS_FILE="-0.9900"
ERROR_NO_ACCESS="-0.9901"
ERROR_WRONG_PARAM="-0.9902"
ERROR_DATA="-0.9903" # either can not connect / bad host / bad port
# save the nginx stats in a variable for future parsing
NGINX_STATS=$($WGET_BIN -q $URL -O – 2> /dev/null)
# error during retrieve
if [ $? -ne 0 -o -z "$NGINX_STATS" ]; then
echo $ERROR_DATA
exit 1
fi
#
# Extract data from nginx stats
#
case $ZBX_REQ_DATA in
active_connections) echo "$NGINX_STATS" | head -1 | cut -f3 -d’ ‘;;
accepted_connections) echo "$NGINX_STATS" | grep -Ev ‘[a-zA-Z]’ | cut -f2 -d’ ‘;;
handled_connections) echo "$NGINX_STATS" | grep -Ev ‘[a-zA-Z]’ | cut -f3 -d’ ‘;;
handled_requests) echo "$NGINX_STATS" | grep -Ev ‘[a-zA-Z]’ | cut -f4 -d’ ‘;;
reading) echo "$NGINX_STATS" | tail -1 | cut -f2 -d’ ‘;;
writing) echo "$NGINX_STATS" | tail -1 | cut -f4 -d’ ‘;;
waiting) echo "$NGINX_STATS" | tail -1 | cut -f6 -d’ ‘;;
*) echo $ERROR_WRONG_PARAM; exit 1;;
esac
exit 0

导入模板

nginx zabbix模板下载

MySQL监控

MySQL的监控，zabbix是默认支持的，已经有现成的模板，现成的键值，我们需要做的只是在/var/lib/zabbix里新建一个.my.cnf文件，内容如下：

[client]
host=127.0.0.1
port=1036
user=root
password=root

网站日志监控

配置日志格式

我们假设你用的web服务器是Nginx，我们添加一个日志格式，如下：

log_format withHost ‘$remote_addrt$remote_usert$time_localt$hostt$requestt’
‘$statust$body_bytes_sentt$http_referert’
‘$http_user_agent’;

我们使用tab作分隔符，为了方便awk识别列的内容，以防出错。
然后再设置全局的日志，其它server就不需要设置日志了：

access_log /data/home/logs/nginx/$host.log withHost;

定时获取一分钟日志

设置一个定时任务：

* * * * * /data/sh/get_nginx_access.sh

脚本内容为：

#!/bin/bash
logDir=/data/home/logs/nginx/
logNames=`ls ${logDir}/*.*.log |awk -F"/" ‘{print $NF}’`
for $logName in $logNames;
do
#设置变量
split_log="/tmp/split_$logName"
access_log="${logDir}/$logName"
status_log="/tmp/$logName"
#取出最近一分钟日志
tac $access_log | awk ‘
BEGIN{
FS="t"
OFS="t"
cmd="date -d "1 minute ago" +%H%M%S"
cmd|getline oneMinuteAgo
}
{
$3 = substr($3,13,8)
gsub(":","",$3)
if ($3>=oneMinuteAgo){
print
} else {
exit;
}
}’ > $split_log
#统计状态码个数
awk -F’t’ ‘{
status[$4" "$6]++
}
END{
for (i in status)
{
print i,status[i]
}
}
‘ $split_log > $status_log
done

这个定时任务是每分钟执行，因为我们监控的频率是每分钟。添加这个任务是为了取得最近一分钟各域名的日志，以及统计各域名的所有状态码个数，方便zabbix来获取所需的数据。

自定义键值

UserParameter=nginx.detect,/data/sh/nginx-detect.sh
UserParameter=nginx.access[*],awk -v sum=0 -v domain=$1 -v code=$2 ‘{if($$1 == domain && $$2 == code ){sum+=$$3} }END{print sum}’ /tmp/$1.log
UserParameter=nginx.log[*],awk -F’t’ -v domain=$1 -v code=$2 -v number=$3 -v sum=0 -v line="" ‘{if ($$4 == domain && $$6 == code ){sum++;line=line$$5"n" }}END{if (sum > number) print line}’ /tmp/split_$1.log | sort | uniq -c | sort -nr | head -10 | sed -e ‘s/^/<p>/’ -e ‘s/$/</p>/’

nginx-detect.sh脚本内容为：

#!/bin/bash
function json_head {
printf "{"
printf ""data":["
}
function json_end {
printf "]"
printf "}"
}
function check_first_element {
if [[ $FIRST_ELEMENT -ne 1 ]]; then
printf ","
fi
FIRST_ELEMENT=0
}
FIRST_ELEMENT=1
json_head
logNames=`ls /data/home/logs/nginx/*.*.log |awk -F"/" ‘{print $NF}’`
for logName in $logNames;
do
while read domain code count;do
check_first_element
printf "{"
printf ""{#DOMAIN}":"$domain","{#CODE}":"$code""
printf "}"
done < /tmp/$logName
done
json_end

这里我们定义了三个键值，nginx.detect是为了发现所有域名及其所有状态码，nginx.access[*]是为了统计指定域名的状态码的数量，nginx.log[*]是为了测试指定域名的状态码超过指定值时输出排在前十的url。我们监控nginx访问日志用到了zabbix的自动发现功能，当我们增加域名时，不需要修改脚本，zabbix会帮助我们自动发现新增的域名并作监控。

配置探索规则

添加一个探索规则，用来发现域名及状态码，如图：

配置监控项原型

监控所有的域名及状态码：

域名状态码404超过200次监控：

域名状态码500超过50次监控：

配置触发器

404状态码超过200告警：

500状态码超过50告警:

使用监控宝监控php-fpm状态

上次我们介绍如何开启php-fpm的状态页，这对于php-fpm的参数调整有很高的参考价值。我们可以使用监控宝的自定义监控来保存php-fpm的状态，来达到了解网站各时候php的请求情况。在开始之前，请确保已经开启php-fpm的status。
一、创建收集数据脚本
新建脚本/home/sh/monitor_fpm.sh，并添加到cronjob，每五分钟运行一次。脚本代码为：
fpm_status=$(curl -s http://devops.webres.wang/fpm_status)
start_since_now=$(echo “$fpm_status” | awk -F’:’ ‘/start since/{gsub(/ /,””,$2);print $2}’)
listen_queue=$(echo “$fpm_status” | awk -F’:’ ‘/^listen queue:/{gsub(/ /,””,$2);print $2}’)
idle_processes=$(echo “$fpm_status” | awk -F’:’ ‘/idle processes/{gsub(/ /,””,$2);print $2}’)
active_processes=$(echo “$fpm_status” | awk -F’:’ ‘/^active processes:/{gsub(/ /,””,$2);print $2}’)
total_processes=$(echo “$fpm_status” | awk -F’:’ ‘/total processes/{gsub(/ /,””,$2);print $2}’)
accepted_conn_now=$(echo “$fpm_status” | awk -F’:’ ‘/accepted conn/{gsub(/ /,””,$2);print $2}’)
max_listen_queue=$(echo “$fpm_status” | awk -F’:’ ‘/max listen queue/{gsub(/ /,””,$2);print $2}’)
max_active_processes=$(echo “$fpm_status” | awk -F’:’ ‘/max active processes/{gsub(/ /,””,$2);print $2}’)
max_children_reached=$(echo “$fpm_status” | awk -F’:’ ‘/max children reached/{gsub(/ /,””,$2);print $2}’)
if [ -f “/tmp/accepted_conn78” ];then
accepted_conn_pre=$(cat /tmp/accepted_conn78)
((accepted_conn_inc=$accepted_conn_now – $accepted_conn_pre))
[[ $accepted_conn_inc -lt 0 ]] && accepted_conn_inc=0
else
accepted_conn_inc=0
fi
echo $accepted_conn_now > /tmp/accepted_conn78

if [ -f “/tmp/start_since78″ ];then
start_since_pre=$(cat /tmp/start_since78)
((start_since_inc=$start_since_now – $start_since_pre))
[[ $start_since_inc -lt 0 ]] && per_request=0 || ((per_request=$accepted_conn_inc/$start_since_inc))
else
per_request=0
fi
echo $start_since_now > /tmp/start_since78
echo ”

accepted_conn:$accepted_conn_inc 
listen_queue:$listen_queue 
idle_processes:$idle_processes 
active_processes:$active_processes 
total_processes:$total_processes 
per_request:$per_request 
max_listen_queue:$max_listen_queue 
max_active_processes:$max_active_processes 
max_children_reached:$max_children_reached

” > /home/devops.webres.wang/web/php_status.html
二、到监控宝添加自定义监控
1、点击网站头部的创建监控项目,拉到底部，选择创建自定义监控，再点击创建自定义监控规则，
2、基本信息填写

3、规则指标添加

3、添加php请求图

4、添加php进程图

5、添加php最大值图

6、点击完成，输入监控fpm的页面

完成了以上步骤之后，过一段时间就可以看到php-fpm的状态统计信息了。
如图：

使用rrdtool统计网站PV和IP

现在网站服务器已经使用snmp进行监控，已经对CPU，内存，流量等进行了监控，但觉得还需要加一项监控，就是网站的PV和IP的监控，这样可以快速知道服务器负载上升是否是网站访问量增加的原因。这几天初学rrdtool，这个工具既能存储数据，又能画图，非常的方便。
下面是统计近一天的pv和ip图。

1、安装rrdtool

centos: yum install rrdtool
ubuntu: sudo apt-get install rrdtool

2、创建rrdtool数据库

rrdtool create /var/www/test.rrd
-s 300
DS:pv:GAUGE:600:U:U
DS:ip:GAUGE:600:U:U
RRA:AVERAGE:0.5:1:288

这里创建一个test.rrd数据文件，相关参数说明如下：
-s 300 300秒存储一次数据
DS:pv:GAUGE:600:U:U
DS:ip:GAUGE:600:U:U 指定两个数据源DS,字段分别为pv和ip
RRA:AVERAGE:0.5:1:288 指定RRA，相当于数据表，存储一天的数据。

3、创建更新脚本

#!/bin/bash
becur=`date -d "5 minute ago" +%H%M%S`
list=`tac /var/log/apache2/access.log | awk -v a="$becur" -F [‘ ‘:] ‘{t=$5$6$7;if (t>=a) {print;} else {exit;} }’ | egrep -v ".(gif|jpg|jpeg|png|css|js)" `
#获取五分钟内PV
pv=`echo "$list" | wc -l`
#获取五分钟内IP
ip=`echo "$list" | awk ‘{print $1}’ | sort | uniq | wc -l `
#每五分钟更新数据库
rrdtool update /var/www/test.rrd N:${pv}:${ip}
#每五分钟更新图片
rrdtool graph /var/www/1h-pv.png
-t "PV and IP statistics in an hour"
–start now-3600
–watermark "`date`"
–no-gridfit
–slope-mode
-l 0
-y 1000:5
-X 0
DEF:mypv=/var/www/test.rrd:pv:AVERAGE
DEF:myip=/var/www/test.rrd:ip:AVERAGE
AREA:mypv#9F35FF:"PV Num"
AREA:myip#00DB00:"IP Num"

把此脚本添加进计划任务，每五分钟执行一次。
这是一个包含数据更新和图片生成的脚本，相关参数说明如下：
-t “PV and IP statistics in an hour” 指定图表标题
–start now-3600 获取近一小时数据
-l 0 Y轴从0开始
-y 1000:5 定义y轴分隔线为1000，5条显示一刻度
-X 0 以原值显示y轴

rrdtool相关教程：http://oss.oetiker.ch/rrdtool/

监控mysql主从健康状态shell脚本

#!/bin/bash
#define mysql variable
mysql_user="root"
mysql_pass="123456"
email_addr="[email protected]"
mysql_status=`netstat -nl | awk ‘NR>2{if ($4 ~ /.*:3306/) {print "Yes";exit 0}}’`
if [ "$mysql_status" == "Yes" ];then
slave_status=`mysql -u${mysql_user} -p${mysql_pass} -e"show slave statusG" | grep "Running" | awk ‘{if ($2 != "Yes") {print "No";exit 1}}’`
if [ "$slave_status" == "No" ];then
echo "slave is not working!"
[ ! -f "/tmp/slave" ] && echo "Slave is not working!" | mail -s "Warn!MySQL Slave is not working" ${email_addr}
touch /tmp/slave
else
echo "slave is working."
[ -f "/tmp/slave" ] && rm -f /tmp/slave
fi
[ -f "/tmp/mysql_down" ] && rm -f /tmp/mysql_down
else
[ ! -f "/tmp/mysql_down" ] && echo "Mysql Server is down!" | mail -s "Warn!MySQL server is down!" ${email_addr}
touch /tmp/mysql_down
fi

此脚本首先判断mysql服务器是否运行，如果正常，继续判断主从，否则发邮件告警,只发一次。
判断主从状态是判断IO和SQL线程是否都为yes，如果不是则发邮件通知，只发一次。

nagios监控mysql主从状态

使用nagios监控mysql主从可以有两种方法，一种是使用nagios的nrpe插件来执行远程的shell脚本，并把数据发回监控服务器分析，二种方法是使用snmp的extend功能来执行远程脚本。我们这里介绍后一种方法。

一、mysql从服务器设置

1、mysql从服务器用户添加
执行如下语句添加用户:

mysql> GRANT REPLICATION CLIENT ON *.* TO monitor@localhost IDENTIFIED BY ‘PassWord’;

2、下载check-mysql-slave.pl脚本

cd /usr/local/bin/
wget http://devops.webres.wang/wp-content/uploads/2012/10/check-mysql-slave.pl
chmod +x check-mysql-slave.pl

3、在mysql从服务器上配置extend mysql-slave
在/etc/snmp/snmpd.conf文件的末尾添加如下代码：

extend mysql-slave /usr/local/bin/check-mysql-slave.pl –user monitor –pass PassWord –sock /var/lib/mysql/mysql.sock

注意相关参数修改为自己的。
之后重载snmp:

service snmpd reload

二、监控服务器设置

1、下载check_snmp_extend.sh脚本

mkdir /usr/local/nagios/libexec.local
cd /usr/local/nagios/libexec.local
wget http://devops.webres.wang/wp-content/uploads/2012/10/check_snmp_extend.sh
chmod +x check_snmp_extend.sh

2、定义USER10变量
在文件/usr/local/nagios/etc/resource.cfg添加如下变量：

$USER10$=/usr/local/nagios/libexec.local

3、定义check_snmp_extend命令
在/usr/local/nagios/etc/objects/commands.cfg添加：

define command{
command_name check_snmp_extend
command_line $USER10$/check_snmp_extend.sh $HOSTADDRESS$ $ARG1$
}

4、定义监控mysql主从服务
在主机配置文件，如/usr/local/nagios/etc/objects/devops.webres.wang.cfg中添加如下service(注意，此www.cnetos.bz.cfg文件已经在nagios.cfg配置文件中包含)

define host{
use linux-server
host_name devops.webres.wang
alias devops.webres.wang
address 142.4.33.74
}
……
……
define service{
## This is an example service configured as
## extend servicename /path/to/service-check.sh
## on remote.server in /etc/snmp/snmpd.conf
use generic-service
host_name devops.webres.wang
service_description mysql slave status
check_command check_snmp_extend!mysql-slave
}

参考：http://www.logix.cz/michal/devel/nagios/

nagios snmp监控服务常用command定义

# ‘check_system’ command definition
define command{
command_name check_system
command_line /usr/lib/nagios/plugins/check_snmp -H ‘$HOSTADDRESS$’ -C ‘$ARG1$’ -o sysDescr.0
}
# ‘snmp_load’ command definition
define command{
command_name snmp_load
command_line /usr/lib/nagios/plugins/check_snmp -H ‘$HOSTADDRESS$’ -C ‘$ARG1$’ -o .1.3.6.1.4.1.2021.10.1.5.1,.1.3.6.1.4.1.2021.10.1.5.2,.1.3.6.1.4.1.2021.10.1.5.3 -w :’$ARG2$’,:’$ARG3$’,:’$ARG4$’ -c :’$ARG5$’,:’$ARG6$’,:’$ARG7$’ -l load
}
# ‘snmp_cpustats’ command definition
define command{
command_name snmp_cpustats
command_line /usr/lib/nagios/plugins/check_snmp -H ‘$HOSTADDRESS$’ -C ‘$ARG1$’ -o .1.3.6.1.4.1.2021.11.9.0,.1.3.6.1.4.1.2021.11.10.0,.1.3.6.1.4.1.2021.11.11.0 -l ‘CPU usage (user system idle)’ -u ‘%’
}
# ‘snmp_procname’ command definition
define command{
command_name snmp_procname
command_line /usr/lib/nagios/plugins/check_snmp -H ‘$HOSTADDRESS$’ -C ‘$ARG1$’ -o .1.3.6.1.4.1.2021.2.1.5.’$ARG2$’ -w ‘$ARG3$’:’$ARG4$’ -c ‘$ARG5$’:’$ARG6$’
}
# ‘snmp_disk’ command definition
define command{
command_name snmp_disk
command_line /usr/lib/nagios/plugins/check_snmp -H ‘$HOSTADDRESS$’ -C ‘$ARG1$’ -o .1.3.6.1.4.1.2021.9.1.7.’$ARG2$’,.1.3.6.1.4.1.2021.9.1.9.’$ARG2$’ -w ‘$ARG3$’:,:’$ARG4$’ -c ‘$ARG5$’:,:’$ARG6$’ -u ‘kB free (‘,’% used)’ -l ‘disk space’
}
# ‘snmp_mem’ command definition
define command{
command_name snmp_mem
command_line /usr/lib/nagios/plugins/check_snmp -H ‘$HOSTADDRESS$’ -C ‘$ARG1$’ -o .1.3.6.1.4.1.2021.4.6.0,.1.3.6.1.4.1.2021.4.5.0 -w ‘$ARG2$’: -c ‘$ARG3$’:
}
# ‘snmp_uptime’ command definition
define command{
command_name snmp_uptime
command_line /usr/lib/nagios/plugins/check_snmp -H ‘$HOSTADDRESS$’ -C ‘$ARG1$’ -o sysUpTime.0
}
# ‘snmp_swap’ command definition
define command{
command_name snmp_swap
command_line /usr/lib/nagios/plugins/check_snmp -H ‘$HOSTADDRESS$’ -C ‘$ARG1$’ -o .1.3.6.1.4.1.2021.4.4.0,.1.3.6.1.4.1.2021.4.3.0 -w ‘$ARG2$’: -c ‘$ARG3$’:
}
# ‘snmp_procs’ command definition
define command{
command_name snmp_procs
command_line /usr/lib/nagios/plugins/check_snmp -H ‘$HOSTADDRESS$’ -C ‘$ARG1$’ -o host.hrSystem.hrSystemProcesses -w :’$ARG2$’ -c :’$ARG3$’ -l processes
}
# ‘snmp_users’ command definition
define command{
command_name snmp_users
command_line /usr/lib/nagios/plugins/check_snmp -H ‘$HOSTADDRESS$’ -C ‘$ARG1$’ -o host.hrSystem.hrSystemNumUsers -w :’$ARG2$’ -c :’$ARG3$’ -l users
}
# ‘snmp_mem2’ command definition
define command{
command_name snmp_mem2
command_line /usr/lib/nagios/plugins/check_snmp -H ‘$HOSTADDRESS$’ -C ‘$ARG1$’ -o host.hrStorage.hrStorageTable.hrStorageEntry.hrStorageUsed.’$ARG2$’,host.hrStorage.hrStorageTable.hrStorageEntry.hrStorageSize.’$ARG2$’ -w ‘$ARG3$’ -c ‘$ARG4$’
}
# ‘snmp_swap2’ command definition
define command{
command_name snmp_swap2
command_line /usr/lib/nagios/plugins/check_snmp -H ‘$HOSTADDRESS$’ -C ‘$ARG1$’ -o host.hrStorage.hrStorageTable.hrStorageEntry.hrStorageUsed.’$ARG2$’,host.hrStorage.hrStorageTable.hrStorageEntry.hrStorageSize.’$ARG2$’ -w ‘$ARG3$’ -c ‘$ARG4$’
}
# ‘snmp_mem3’ command definition
define command{
command_name snmp_mem3
command_line /usr/lib/nagios/plugins/check_snmp -H ‘$HOSTADDRESS$’ -C ‘$ARG1$’ -o host.hrStorage.hrStorageTable.hrStorageEntry.hrStorageUsed.’$ARG2$’,host.hrStorage.hrStorageTable.hrStorageEntry.hrStorageSize.’$ARG2$’ -w ‘$ARG3$’ -c ‘$ARG4$’
}
# ‘snmp_swap3’ command definition
define command{
command_name snmp_swap3
command_line /usr/lib/nagios/plugins/check_snmp -H ‘$HOSTADDRESS$’ -C ‘$ARG1$’ -o host.hrStorage.hrStorageTable.hrStorageEntry.hrStorageUsed.’$ARG2$’,host.hrStorage.hrStorageTable.hrStorageEntry.hrStorageSize.’$ARG2$’ -w ‘$ARG3$’ -c ‘$ARG4$’
}
# ‘snmp_disk2’ command definition
define command{
command_name snmp_disk2
command_line /usr/lib/nagios/plugins/check_snmp -H ‘$HOSTADDRESS$’ -C ‘$ARG1$’ -o host.hrStorage.hrStorageTable.hrStorageEntry.hrStorageUsed.’$ARG2$’ -w ‘$ARG3$’ -c ‘$ARG4$’
}
# ‘snmp_tcpopen’ command definition
define command{
command_name snmp_tcpopen
command_line /usr/lib/nagios/plugins/check_snmp -H ‘$HOSTADDRESS$’ -C ‘$ARG1$’ -o tcp.tcpCurrEstab.0 -w ‘$ARG2$’ -c ‘$ARG3$’
}
# ‘snmp_tcpstats’ command definition
define command{
command_name snmp_tcpstats
command_line /usr/lib/nagios/plugins/check_snmp -H ‘$HOSTADDRESS$’ -C ‘$ARG1$’ -o tcp.tcpActiveOpens.0,tcp.tcpPassiveOpens.0,tcp.tcpInSegs.0,tcp.tcpOutSegs.0,tcp.tcpRetransSegs.0 -l ‘TCP stats’
}
# ‘check_snmp_bgpstate’ command definition
define command{
command_name check_snmp_bgpstate
command_line /usr/lib/nagios/plugins/check_bgpstate ‘$HOSTADDRESS$’ -c ‘$ARG1$’
}
# ‘check_netapp_uptime’ command definition
define command{
command_name check_netapp_uptime
command_line /usr/lib/nagios/plugins/check_snmp -H ‘$HOSTADDRESS$’ -C ‘$ARG1$’ -o .1.3.6.1.2.1.1.3.0 –delimiter=’)’ -l "Uptime is"
}
# ‘check_netapp_cpuload’ command definition
define command{
command_name check_netapp_cpuload
command_line /usr/lib/nagios/plugins/check_snmp -H ‘$HOSTADDRESS$’ -C ‘$ARG1$’ -o .1.3.6.1.4.1.789.1.2.1.3.0 -w 90 -c 95 -u ‘%’ -l "CPU LOAD "
}
# ‘check_netapp_numdisks’ command definition
define command{
command_name check_netapp_numdisks
command_line /usr/lib/nagios/plugins/check_snmp -H ‘$HOSTADDRESS$’ -C ‘$ARG1$’ -o .1.3.6.1.4.1.789.1.6.4.1.0,.1.3.6.1.4.1.789.1.6.4.2.0,.1.3.6.1.4.1.789.1.6.4.8.0,.1.3.6.1.4.1.789.1.6.4.7.0 -u ‘Total Disks’,’Active’,’Spare’,’Failed’ -l ""
}
# ‘check_compaq_thermalCondition’ command definition
define command{
command_name check_compaq_thermalCondition
command_line /usr/lib/nagios/plugins/check_snmp -H ‘$HOSTADDRESS$’ -C ‘$ARG1$’ -o .1.3.6.1.4.1.232.6.2.1.0,.1.3.6.1.4.1.232.6.2.2.0,.1.3.6.1.4.1.232.6.2.3.0,.1.3.6.1.4.1.232.6.2.4.0 -u ‘ThermalCondition’,’ThermalTemp’,’ThermalSystem’,’ThermalCPUFan’ -w 2:2,2:2,2:2,2:2 -c 1:2,1:2,1:2,1:2 -l "Thermal status "
}

转载自：https://nsrc.org/workshops/configs/2010/apricot/etc/nagios-plugins/config/snmp.cfg

cacti监控磁盘IO

1、检查net-snmp是否支持IO监控

snmpwalk -v 3 -u username -A password -a md5 -l authNoPriv localhost .1.3.6.1.4.1.2021.13.15.1.1.1

执行如上命令，如果返回类似如下数据，则表示支持disk io的监控，否则需要重新编译增加diskio-module模块。

UCD-DISKIO-MIB::diskIOIndex.1 = INTEGER: 1
UCD-DISKIO-MIB::diskIOIndex.2 = INTEGER: 2
UCD-DISKIO-MIB::diskIOIndex.3 = INTEGER: 3
……

2、下载Cacti_Net-SNMP_DevIO_v3.1.zip

下载Cacti_Net-SNMP_DevIO_v3.1.zip,解压并上传net-snmp_devio.xml到 /resource/snmp_queries/目录。

3、导入模板

通过cacti后台的”Import Templates”导入所有的*_TMPL.xml文件,最后导入net-snmp_devIO-Data_query.xml文件。完成后，你就可以在“Data Queries”看到“ucd/net – Get Device I/O”。

4、为已存在的”ucd/net SNMP Host”增加磁盘IO监控。

切换到”devices”，点击已存在的”ucd/net snmp host”主机，如我的devops.webres.wang-ucd/net-snmp。往页尾看，在“Associated Data Queries”的”Add Data Query: “中选择”ucd/net – Get Device I/O”,”Re-Index Method: “选择”Index Count Changed”，点击”Add”增加Data Queries。

5、创建IO图形监控

接着点击页头的”Create Graphs for this Host”，在”
Data Query [ucd/net – Get Device I/O]”下面选择需要监控的磁盘，点击”create”开始创建图形。
至此cacti监控磁盘IO的设置已经完成。
参考：http://forums.cacti.net/about8777-0-asc-0.html

背景

使用方法

相关脚本

log_acesss.lua

domain_status.lua

背景

方法介绍

方法实现

nginx.conf设置

log_access.lua

domain_status.lua

使用说明

Memcached监控

自定义键值

导入模板

PHP-FPM监控

配置php-fpm状态页

自定义键值

导入模板

Tomcat监控

自定义键值

导入模板

Nginx监控

配置Nginx状态页

自定义键值

导入模板

MySQL监控

网站日志监控

配置日志格式

定时获取一分钟日志

自定义键值

配置探索规则

配置监控项原型

配置触发器

1、安装rrdtool

2、创建rrdtool数据库

3、创建更新脚本

一、mysql从服务器设置

二、监控服务器设置

1、检查net-snmp是否支持IO监控

2、下载Cacti_Net-SNMP_DevIO_v3.1.zip

3、导入模板

4、为已存在的”ucd/net SNMP Host”增加磁盘IO监控。

5、创建IO图形监控