heartbeat – Linux系统运维日志

双机热备+负载均衡线上方案(Heartbeat+DRBD+NFS+Keepalived+Lnmp)

我们下面来实现一个架构，heartbeat+drbd+nfs实现mysql和网站数据的同步，keepalived实现nginx的高可用，而用nginx和dns轮询实现负载均衡。

架构说明

目录规划

/usr/local/src/lnmp：用来存放源码工具等等
/data：用来存放所有数据和NFS以及DRBD的挂载
/data/shell：用来存放所有管理脚本
/data/mysql：用来挂载DRBD的mysql资源，以供mysql存放数据库
/data/wwwnfs：用来挂载DRBD生成的www资源，以供两个节点挂载到各个节点的/data/www目录，以供论坛等程序数据使用
/data/www：用来挂载NFS资源，用来存放论坛(网站)等程序数据

拓扑工作原理

内网：
1，DRBD网络存储创建出两个资源，一个mysql给mysql数据库同步用，一个www给web(论坛)数据NFS共享挂载用，虚拟出两个虚拟IP，一个是 192.168.1.100，用来连接数据库，一个是192.168.1.200，用来给节点挂载NFS
注意：NFS底下挂载了三次：DRBD挂载一次，文件系统挂载一次，客户端挂载一次
2，Heartbeat来实现DRBD的HA，同时虚拟出两个内网IP，并管理NFS，MySQL的启动和关闭

外网：
1，两个节点都用Nginx做均衡器，通过内网调度负载两个节点，实现内部均衡
2，DNS配置双IP对应一个域名的方式来实现DNS轮询，实现外网均衡
3，Keepalived使用双主(master)配置虚拟出两个虚拟IP：节点一 12.12.12.100和节点二 12.12.12.200，同时共外网访问，两个节点互为主从关系，当某个节点挂掉的时候，另外一个节点将同时是两个资源的master，同时拥有两个虚拟IP，实现资源转移。

我们知道DNS的缺点就是生效慢，分配资源不合理，理论上有可能把所有的请求都发送给同一节点，导致均衡不合理导致所有资源不可用，这里我们由于有了NGINX内部负载，就不怕DNS轮询不均衡了，因为NGINX内部有严谨的调度方式，不管那台请求有多少，在内部都能实现理想的调度，这样就能把DNS负载均衡和NGINX完美结合，是硬件资源得到合理的利用，然后利用keepalive保证了每个节点的可靠性，几乎完美！
拓扑图如下：

架构实现

LNMP架构配置

配置LNMp架构需要注意两点：
注意一：这里MYSQL都不要初始化，不要启动！后面有专门的配置的
注意二：nginx所有端口都改成 8080，因为一会还要安装nginx来做均衡器并对外提供服务，所以不要用默认的80
注意三、nginx和php-fpm运行的用户都是www。

安装配置NFS

1、安装NFS

yum install nfs-utils nfs4-acl-tools portmap

2、配置/etc/exports

/data/wwwnfs 192.168.1.0/24(rw,,no_root_squash,sync,anonuid=502,anongid=502)

注意：
/data/wwwnfs：就是给两个节点挂载的目录，所有网站程序都放在这里，实现论坛程序等数据的共享(同步)
anonuid=502,anongid=502：这个表示客户端上任何用户进入到挂载目录都以uid=502和gid=502身份，我这里这个代表的是www用户
3、启动

service portmap start
service nfs start

切忌，必须先启动portmap

chkconfig nfs off
chkconfig portmap on

注意：portmap服务器必须常驻，且不收heartbeat管理；而nfs这必须要用heartbeat来管理他的启动和关闭，所以这里要关闭nfs开机自动启动

同时要启动锁机制，因为同时有两个节点要使用同一份数据，所以需要有总裁，这个尤其是在NFS给mysql用的时候是必须要用的，对于论坛或网站，要看情况，如果存在对同一文件同时修改的时候必须要启动NFS锁机制，如果没有这种情况，那么建议不要启动，启动了会降低NFS的性能：

/sbin/rpc.lockd
echo "/sbin/rpc.lockd" >>/etc/rc.local

4、开机自动挂载

echo "sleep 20" >>/etc/rc.local
echo "/bin/mount -t nfs 192.168.1.200:/data/wwwnfs /data/www" >>/etc/rc.local

为什么为延迟20秒再挂载nfs？因为如果不等待立即挂载，会发现挂载不上，这是由于heartbeat启动用的vip还没设置好的原因。
立即挂载：

mount -a

安装配置DRBD

安装方法见：http://devops.webres.wang/2012/02/drbd-compile-install-deploy/

配置文件

DRBD有三种配置文件：
/usr/local/drbd/etc/drbd.conf
/usr/local/drbd/etc/drbd.d/global_common.conf
/usr/local/drbd/etc/drbd.d/*.res
1、drbd.conf

include "drbd.d/global_common.conf";
include "drbd.d/*.res";

2、global_common.conf

global {
usage-count yes;
}
common {
net {
protocol C;
}
}

3、mysql.res和www.res
mysql.res:

vi /usr/local/drbd/etc/drbd.d/mysql.res

#资源组的名称
resource mysql{
#定义主服务器资源
on node1{
#建立块设备文件
device /dev/drbd1;
#要用于复制的分区
disk /dev/sdb1;
#定义侦听IP和端口
address 192.168.1.10:7788;
#meta data信息存放的方式，这里为内部存储，即和真实数据放在一起存储
meta-disk internal;
}
#定义备服务器资源
on node2{
device /dev/drbd1;
disk /dev/sdb1;
address 192.168.1.20:7788;
meta-disk internal;
}
}

www.res:

vi /usr/local/drbd/etc/drbd.d/www.res

#资源组的名称
resource www{
#定义主服务器资源
on node2{
#建立块设备文件
device /dev/drbd2;
#要用于复制的分区
disk /dev/sdb2;
#定义侦听IP和端口
address 192.168.1.20:7789;
#meta data信息存放的方式，这里为内部存储，即和真实数据放在一起存储
meta-disk internal;
}
#定义备服务器资源
on node1{
device /dev/drbd2;
disk /dev/sdb2;
address 192.168.1.10:7789;
meta-disk internal;
}
}

最后复制这些文件到node2。

初始化DRBD资源

1)在各个节点启用资源mysql和www

modprobe drbd
dd if=/dev/zero of=/dev/sdb1 bs=1M count=10
dd if=/dev/zero of=/dev/sdb2 bs=1M count=10
drbdadm create-md mysql
drbdadm create-md www
drbdadm up mysql
drbdadm up www

2)，提升各个节点上的主
在node1上：

drbdadm primary –force mysql

在node2上：

drbdadm primary –force www

3)格式化drbd块设备
在node1上

mkfs.ext3 /dev/drbd1

在node2上

mkfs.ext3 /dev/drbd2

4)挂载分区
在node1上

mount /dev/drbd1 /data/mysql

在node2上

mount /dev/drbd2 /data/wwwnfs

安装配置heartbeat

1、安装heartbeat

yum install heartbeat

安装完后会自动建立用户hacluster和组haclient
确保两个节点上hacluster用户的的UID和GID相同
2、同步两台节点的时间

rm -rf /etc/localtime
cp -f /usr/share/zoneinfo/Asia/Shanghai /etc/localtime
yum install -y ntp
ntpdate -d cn.pool.ntp.org

3、配置/etc/ha.d/ha.cf

debugfile /var/log/ha-debug #打开错误日志报告
keepalive 2 #两秒检测一次心跳线连接
deadtime 10 #10 秒测试不到主服务器心跳线为有问题出现
warntime 6 #警告时间（最好在 2 ～ 10 之间）
initdead 120 #初始化启动时 120 秒无连接视为正常，或指定heartbeat
#在启动时，需要等待120秒才去启动任何资源。
udpport 694 #用 udp 的 694 端口连接
ucast eth0 192.168.1.20 #单播方式连接（主从都写对方的 ip 进行连接）
node node1 #声明主服(注意是主机名uname -n不是域名)
node node2 #声明备服(注意是主机名uname -n不是域名)
auto_failback on #自动切换（主服恢复后可自动切换回来）这个不要开启
respawn hacluster /usr/lib/heartbeat/ipfail #监控ipfail进程是否挂掉，如果挂掉就重启它

4、/etc/ha.d/authkeys

auth 1
1 crc

5、/etc/ha.d/haresources

node1 IPaddr::192.168.1.100/24/eth0 drbddisk::mysql Filesystem::/dev/drbd1::/data/mysql::ext3 mysqld portmap
node2 IPaddr::192.168.1.200/24/eth0 drbddisk::www Filesystem::/dev/drbd2::/data/wwwnfs::ext3 portmap nfs

6、创建nfs管理脚本

vi /etc/ha.d/resource.d/nfs

写入：

#!/bin/bash
NFSD=/etc/rc.d/init.d/nfs
NFSDPID=`/sbin/pidof nfsd`
case $1 in
start)
$NFSD start;
;;
stop)
$NFSD stop;
if [ "$NFSDPID" != " " ];then
for NFSPID in $NFSDPID
do /bin/kill -9 $NFSPID;
done
fi
;;
*)
echo "Syntax incorrect. You need one of {start|stop }"
;;
esac

先启动node1的heartbeat，再启动node2的heartbeat
启动成功后，这里有几项需要检查
node1:
1、执行ip a，检查是否已经设置有虚拟ip 192.168.1.100
2、执行cat /proc/drbd检查状态是否正常
3、执行df -h查看/dev/drbd1是否已经挂载到/data/mysql
4、执行service mysqld status查看mysql是否已经启动
node2:
1、执行ip a查看是否已经设置虚拟ip 192.168.1.200
2、执行cat /proc/drbd检查状态是否正常
3、执行df -h查看/dev/drbd2是否已经挂载到/data/wwwnfs和192.168.1.200:/data/wwwnfs是否已经挂载到/data/www

nginx均衡器配置

user www;
worker_processes 1;
error_log /var/log/nginx/error.log warn;
pid /var/run/nginx.pid;
events {
worker_connections 1024;
}
http {
include /etc/nginx/mime.types;
default_type application/octet-stream;
log_format main ‘$remote_addr – $remote_user [$time_local] "$request" ‘
‘$status $body_bytes_sent "$http_referer" ‘
‘"$http_user_agent" "$http_x_forwarded_for"’;
access_log /var/log/nginx/access.log main;
sendfile on;
#tcp_nopush on;
keepalive_timeout 65;
#gzip on;
upstream devops.webres.wang_server
{
server 192.168.1.10:8080 weight=3 max_fails=2 fail_timeout=30s;
server 192.168.1.20:8080 weight=9 max_fails=2 fail_timeout=30s;
}
server
{
listen 80;
server_name devops.webres.wang;
location / {
root /data/www/devops.webres.wang;
index index.php index.htm index.html;
proxy_redirect off;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_pass http://devops.webres.wang_server;
}
access_log off;
}
server
{
listen 8080;
server_name devops.webres.wang;
index index.html index.htm index.php;
root /data/www/devops.webres.wang;
#limit_conn crawler 20;
location ~ .php$ {
root /data/www/devops.webres.wang;
fastcgi_pass 127.0.0.1:9000;
fastcgi_index index.php;
fastcgi_param SCRIPT_FILENAME /data/www/devops.webres.wang/$fastcgi_script_name;
include fastcgi_params;
}
location ~ .*.(gif|jpg|jpeg|png|bmp|swf)$
{
expires 30d;
}
location ~ .*.(js|css)?$
{
expires 1h;
}
access_log off;
}
}

这里定义了两台用于负载均衡的机子，分别是192.168.1.10:8080和192.168.1.20:8080，通过proxy_pass http://devops.webres.wang_server代理循询转发到这两台机，达到负载均衡的作用。
你可以建立index.php，里面写入：

<?php
echo $_SERVER[‘SERVER_ADDR’];
?>

如果连续刷新几次，得到不同的IP，证明已经均衡负载到不同的服务器。

Keepalived实现nginx和php的HA

1、keepalived安装
安装方法见：http://devops.webres.wang/2012/02/nginx-keepalived-high-availability/
2、配置
节点一node1配置如下：

global_defs {
notification_email {
[email protected]
}
notification_email_from [email protected]
smtp_server 127.0.0.1
smtp_connect_timeout 30
router_id LVS_DEVEL
}
vrrp_instance VI_1 {
state MASTER ############ 辅机为 BACKUP
interface eth0
virtual_router_id 100
mcast_src_ip 192.168.1.10 ########### 本机IP
priority 102 ########### 权值要比 back 高
advert_int 1
authentication {
auth_type PASS
auth_pass 1111
}
virtual_ipaddress {
12.12.12.100
}
}
vrrp_instance VI_1 {
state BACKUP
interface eth0
virtual_router_id 200
mcast_src_ip 192.168.1.101 ########### 本机IP
priority 101 ##########权值要比 master 低。。
advert_int 1
authentication {
auth_type PASS
auth_pass 1111
}
virtual_ipaddress {
12.12.12.200
}
}

节点二配置：

global_defs {
notification_email {
[email protected]
}
notification_email_from [email protected]
smtp_server 127.0.0.1
smtp_connect_timeout 30
router_id LVS_DEVEL
}
vrrp_instance VI_1 {
state BACKUP
interface eth0
virtual_router_id 100
mcast_src_ip 192.168.1.20 ########### 本机IP
priority 101 ##########权值要比 master 低。。
advert_int 1
authentication {
auth_type PASS
auth_pass 1111
}
virtual_ipaddress {
12.12.12.100
}
}
vrrp_instance VI_1 {
state MASTER ############ 辅机为 BACKUP
interface eth0
virtual_router_id 200
mcast_src_ip 192.168.1.103 ########### 本机IP
priority 102 ########### 权值要比 back 高
advert_int 1
authentication {
auth_type PASS
auth_pass 1111
}
virtual_ipaddress {
12.12.12.200
}
}

3、创建监控脚本
node1监控脚本：

vi /opt/check.sh

#!/bin/bash
while :
do
mysqlcheck=`/usr/bin/mysqladmin -uroot ping 2>&1`
mysqlcode=`echo $?`
heartbeat=`ps -C heartbeat –no-header | wc -l`
if [ $mysqlcode -ne 0 ] ;then
if [ $heartbeat-ne 0 ];then
service heartbeat stop
fi
fi
phpcheck=`ps -C php-fpm –no-header | wc -l`
nginxcheck=`ps -C nginx –no-header | wc -l`
keepalivedcheck=`ps -C keepalived –no-header | wc -l`
if [ $nginxcheck -eq 0 ]|| [ $phpcheck -eq 0 ];then
if [ $keepalivedcheck -ne 0 ];then
killall -TERM keepalived
else
echo "keepalived is stoped"
fi
else
if [ $keepalivedcheck -eq 0 ];then
/etc/init.d/keepalived start
else
echo "keepalived is running"
fi
fi
sleep 5
done

node2监控脚本：

#!/bin/bash
while :
do
phpcheck=`ps -C php-cgi –no-header | wc -l`
nginxcheck=`ps -C nginx –no-header | wc -l`
keepalivedcheck=`ps -C keepalived –no-header | wc -l`
if [ $nginxcheck -eq 0 ]|| [ $phpcheck -eq 0 ];then
if [ $keepalivedcheck -ne 0 ];then
killall -TERM keepalived
else
echo "keepalived is stoped"
fi
else
if [ $keepalivedcheck -eq 0 ];then
/etc/init.d/keepalived start
else
echo "keepalived is running"
fi
fi
sleep 5
done

这个监控代码实现了mysql,nginx,php-fpm的HA。
加上权限,并执行

chmod +x /opt/check.sh
nohup sh /opt/check.sh &

设置开机启动：
echo “nohup sh /opt/check.sh &” >> /etc/rc.local

4、测试keepalived
分别启动keepalived

service keepalived start

1）执行ip a检查node1和node2是否已经存在vip：12.12.12.100和12.12.12.200
2)测试nginx和php-fpm的HA。在node1执行service nginx stop或者service php-fpm stop停止nginx或php-fpm，过几秒钟后你会发现node2已经接管了vip 12.12.12.100，并且使用vip 12.12.12.100或12.12.12.200浏览nginx网页你会发现网页显示的IP一直是192.168.1.20，表明keepalived已经成功接管node1的vip和nginx或php-fpm服务。
3)测试mysql HA。在node1执行service mysqld stop停止mysql服务，几秒后在node2查看，发现node2已经接管vip 192.168.1.100，并且已经启动mysql服务。
注意：在恢复mysql或nginx,php-fpm时，先停止监控脚本，要不heartbeat或keepalived还没实现接管又被停止。
参考：http://bbs.ywlm.net/thread-965-1-1.html

lvs负载均衡及高可用(heartbeat+ldirectord)集群配置

lvs是一个开源免费的负载均衡软件，能实现多台服务器之间的负载均衡，搭配heartbeat和ldirectord的使用，就能配置成高可用的集群。

服务器环境说明

下面说明本次测试配置的服务器环境。
系统：CentOS-5 32 内核2.6.18-238.el5
因为机器只有两台，所以lvs负载器和后端服务器在同一机器。
node1 192.168.79.130
node2 192.168.79.131
VIP 192.168.79.135
当node1出现故障时，lvs负载器和web服务器转移到node2。
如果机器充足，还是建议lvs负载器和web服务器分开。

软件安装

yum -y install heartbeat heartbeat-ldirectord ipvsadm

配置

主要的配置文件有以下几个：
Authkeys
ha.cf
ldirectord.cf
haresources

authkeys

vi /etc/ha.d/authkeys

代码：

auth 1
1 crc

ha.cf

vi /etc/ha.d/ha.cf

debugfile /var/log/ha-debug
logfile /var/log/ha-log
logfacility local0
keepalive 8
deadtime 60
warntime 60
initdead 120
udpport 694
ucast eth0 192.168.79.131
auto_failback on
node node1
node node2
respawn hacluster /usr/lib/heartbeat/ipfail
apiauth ipfail gid=haclient uid=hacluster

node2唯一不同是ucast eth0 192.168.79.131，把IP改成node1的IP。

haresources

vi /etc/ha.d/haresources

填入：

node1 lvs IPaddr::192.168.79.135/24/eth0:0 ldirectord

这段代码的意思是双机启动heartbeat时，启动node1的lvs脚本，接着配置vip 192.168.79.135/24/eth0:0，然后启动ldirectord来设置node1成lvs负载器并监控80端口。如果node1出故障，node1的heartbeat将从右到左停止服务，如先停止ldirectord，取消vip等。接着node2将接管node1的所有服务，如vip,web服务等。

ldirectord.cf

vi /etc/ha.d/ldirectord.cf

checktimeout=10
checkinterval=8
autoreload=yes
logfile="/var/log/ldirectord.log"
logfile="local0"
quiescent=no
virtual=192.168.79.135:80
real=192.168.79.130:80 gate
real=192.168.79.131:80 gate
service=http
request="test.html"
receive="Test Page"
scheduler=wrr
persistent=30
protocol=tcp
checktype=negotiate
checkport=80

node2配置这文件时，需要把real=192.168.79.130:80 gate删除，因为当lvs负载器转移到node2时，不能把故障机node1添加到虚拟机。

test.html

在网站根目录建立test.html，并写入Test Page字段，这个用来监控web服务器的健康情况。假设根目录为/var/www/html：

echo "Test Page" > /var/www/html/test.html

lvs启动脚本

vi /etc/init.d/lvs

node1上的lvs启动脚本:

#!/bin/bash
/sbin/ipvsadm –set 10 10 10

node1上的lvs启动脚本:

#!/bin/bash
VIP=192.168.79.135
/etc/rc.d/init.d/functions
/sbin/ipvsadm –set 10 10 10
case "$1" in
start)
/sbin/ifconfig lo:0 down
/sbin/ifconfig eth0:0 $VIP broadcast $VIP netmask 255.255.255.255 up
/sbin/route add -host $VIP dev eth0:0
;;
stop)
/sbin/ifconfig eth0:0 down
/sbin/ifconfig lo:0 $VIP broadcast $VIP netmask 255.255.255.255 up
/sbin/route add -host $VIP dev lo:0
;;
*)
echo "Usage: $0 {start|stop}"
exit 1
esac

最后加上执行权限：

chmod +x /etc/init.d/lvs

主机名及hosts配置

1、对两台机分别设置对应的主机名
192.168.79.130 为 node1
192.168.79.131 为 node2
2、添加主机名解析

vi /etc/hosts

192.168.79.130 node1
192.168.79.131 node2

解决arp问题

vi /etc/sysctl.conf

net.ipv4.ip_forward = 1
net.ipv4.conf.lo.arp_ignore = 1
net.ipv4.conf.lo.arp_announce = 2
net.ipv4.conf.all.arp_ignore = 1
net.ipv4.conf.all.arp_announce = 2

立即使内核参数生效：

sysctl -p

lvs测试

测试负载均衡：可以在两台机放入不同的首页内容，在不同的客户端测试是否显示不一样的内容
测试高可用：关掉node1 heartbeat，在node2执行ip a查看是否已经接管vip。
测试ldirectord:ldirectord可以实时监控指定的服务是否可用，如果发现不可用，就会使用ipvsadm把这台故障的机从虚拟机中删除。

使用heartbeat实现DRBD主从自动切换

这里简单介绍一下heartbeat和drbd。
如果主服务器宕机，造成的损失是不可估量的。要保证主服务器不间断服务，就需要对服务器实现冗余。在众多的实现服务器冗余的解决方案中，heartbeat为我们提供了廉价的、可伸缩的高可用集群方案。我们通过heartbeat+drbd在Linux下创建一个高可用(HA)的集群服务器。

DRBD是一种块设备，可以被用于高可用(HA)之中。它类似于一个网络RAID-1功能。当你将数据写入本地文件系统时，数据还将会被发送到网络中另一台主机上。以相同的形式记录在一个文件系统中。本地(主节点)与远程主机(备节点)的数据可以保证实时同步。当本地系统出现故障时，远程主机上还会保留有一份相同的数据，可以继续使用。在高可用(HA)中使用DRBD功能，可以代替使用一个共享盘阵。因为数据同时存在于本地主机和远程主机上。切换时，远程主机只要使用它上面的那份备份数据，就可以继续进行服务了。

下面我们部署这一高可用。首先安装heartbeat，执行yum install heartbeat即可，不建议编译安装heartbeat，因为安装时间特长，容易出问题；接着安装drbd，安装方法见：http://devops.webres.wang/2012/02/drbd-compile-install-deploy/，唯一不同的是在./configure命令中添加–with-heartbeat，安装完成后会在/usr/local/drbd/etc/ha.d/resource.d生成drbddisk和drbdupper文件，把这两个文件复制到/usr/local/heartbeat/etc/ha.d/resource.d目录,命令cp -R /usr/local/drbd/etc/ha.d/resource.d/* /etc/ha.d/resource.d。
我们的主机ip是192.168.79.130，备机ip:192.168.79.131，虚拟ip:192.168.79.135，drbd同步的分区/dev/sdb1，挂载的目录/data。

drbd配置

1、首先对/dev/sdb分区出/dev/sdb1,建立目录/data。
2、配置global和resource。
配置drbd.conf:

vi /usr/local/drbd/etc/drbd.conf

写入：

include "drbd.d/global_common.conf";
include "drbd.d/*.res";

配置global_common.conf

vi /usr/local/drbd/etc/drbd.d/global_common.conf

写入：

global {
usage-count yes;
}
common {
net {
protocol C;
}
}

配置r0资源：

vi /usr/local/drbd/etc/drbd.d/r0.res

写入：

resource r0 {
on node1 {
device /dev/drbd1;
disk /dev/sdb1;
address 192.168.79.130:7789;
meta-disk internal;
}
on node2 {
device /dev/drbd1;
disk /dev/sdb1;
address 192.168.79.131:7789;
meta-disk internal;
}
}

3、设置hostname。

vi /etc/sysconfig/network

修改HOSTNAME为node1
编辑hosts

vi /etc/hosts

添加：

192.168.79.130 node1
192.168.79.131 node2

使node1 hostnmae临时生效

hostname node1

node2设置类似。
4、设置resource
以下操作需要在node1和node2操作。

modprobe drbd //载入 drbd 模块
dd if=/dev/zero of=/dev/sdb1 bs=1M count=100 /把一些资料塞到 sdb 內 (否则 create-md 时有可能会出现错误)
drbdadm create-md r0 //建立 drbd resource
drbdadm up r0 //启动 resource r0

5、设置Primary Node
以下操作仅在node1执行。
设置node1为primary node:

drbdadm primary –force r0

6、创建DRBD文件系统
以下操作仅在node1执行。
上面已经完成了/dev/drbd1的初始化，现在来把/dev/drbd1格式化成ext3格式的文件系统。

mkfs.ext3 /dev/drbd1

然后将/dev/drbd1挂载到之前创建的/data目录。

mount /dev/drbd1 /data

heartbeat配置

总共有三个文件需要配置:
ha.cf 监控配置文件
haresources 资源管理文件
authkeys 心跳线连接加密文件
1、同步两台节点的时间

rm -rf /etc/localtime
cp -f /usr/share/zoneinfo/Asia/Shanghai /etc/localtime
yum install -y ntp
ntpdate -d cn.pool.ntp.org

2、配置ha.cf

vi /etc/ha.d/ha.cf

debugfile /var/log/ha-debug #打开错误日志报告
keepalive 2 #两秒检测一次心跳线连接
deadtime 10 #10 秒测试不到主服务器心跳线为有问题出现
warntime 6 #警告时间（最好在 2 ～ 10 之间）
initdead 120 #初始化启动时 120 秒无连接视为正常，或指定heartbeat
#在启动时，需要等待120秒才去启动任何资源。
udpport 694 #用 udp 的 694 端口连接
ucast eth0 192.168.79.131 #单播方式连接（主从都写对方的 ip 进行连接）
node node1 #声明主服(注意是主机名uname -n不是域名)
node node2 #声明备服(注意是主机名uname -n不是域名)
auto_failback on #自动切换（主服恢复后可自动切换回来）这个不要开启
respawn hacluster /usr/lib/heartbeat/ipfail #监控ipfail进程是否挂掉，如果挂掉就重启它

3、配置authkeys

vi /etc/ha.d/authkeys

写入：

auth 1
1 crc

4、配置haresources

vi /etc/ha.d/haresources

写入：

node1 IPaddr::192.168.79.135/24/eth0 drbddisk::r0 Filesystem::/dev/drbd1::/data::ext3

node1:master主机名
IPaddr::192.168.79.135/24/eth0:设置虚拟IP
drbddisk::r0:管理资源r0
Filesystem::/dev/drbd1::/data::ext3:执行mount与unmout操作
node2配置基本相同，不同的是ha.cf中的192.168.79.131改为192.168.79.130。

DRBD主从自动切换测试

首先先在node1启动heartbeat，接着在node2启动，这时，node1等node2完全启动后，相继执行设置虚拟IP，启动drbd并设置primary，并挂载/dev/drbd1到/data目录，启动命令为：

service heartbeat start

这时，我们执行ip a命令，发现多了一个IP 192.168.79.135，这个就是虚拟IP，cat /proc/drbd查看drbd状态，显示primary/secondary状态，df -h显示/dev/drbd1已经挂载到/data目录。
然后我们来测试故障自动切换，停止node1的heartbeat服务或者断开网络连接，几秒后到node2查看状态。
接着恢复node1的heartbeat服务或者网络连接，查看其状态。

heartbeat配置文件中英对照

ha.cf

#
# There are lots of options in this file. All you have to have is a set
# of nodes listed {“node …} one of {serial, bcast, mcast, or ucast},
# and a value for “auto_failback”.
# 这文件下面有很多的选项，你必须设置的有节点列表集{node …}，{serial,bcast,mcast,或ucast}中的一个，auto_failback的值
#
# ATTENTION: As the configuration file is read line by line,
# THE ORDER OF DIRECTIVE MATTERS!
# 注意：配置文件是逐行读取的，并且选项的顺序是会影响最终结果的。
#
# In particular, make sure that the udpport, serial baud rate
# etc. are set before the heartbeat media are defined!
# debug and log file directives go into effect when they
# are encountered.
# 特别注意，确保udpport,serial baud rate等配置在心跳检测媒体（heartbeat media）前！他们将影响debug和log file指令。
# 也就是是在定义网卡，串口等心跳检测接口前先要定义端口号。
#
# All will be fine if you keep them ordered as in this example.
# 如果你保持他们在此例子中的顺序的话一切都不会有问题。
#
# Note on logging:
# If all of debugfile, logfile and logfacility are not defined,
# logging is the same as use_logd yes. In other case, they are
# respectively effective. if detering the logging to syslog,
# logfacility must be “none”.
# 记录日志方面的注意事项：
# 如果debugfile,logfile和logfacility都没有定义，日志记录就相当于use_logd yes。否则，他们将分别生效。如果要阻止记录日志到syslog，那么logfacility必须设置为“none”
#
# File to write debug messages to
# 写入debug消息的文件
#debugfile /var/log/ha-debug
#
#
# File to write other messages to
# 写入其他消息的文件
#logfile /var/log/ha-log
#
#
# Facility to use for syslog()/logger
# 用于syslog()/logger的设备
logfacility local0
#
#
# A note on specifying “how long” times below…
# 在下面指定多长时间时应该注意
# The default time unit is seconds
# 缺省的时间单位是秒
# 10 means ten seconds
# 10就代表10秒
#
# You can also specify them in milliseconds
# 1500ms means 1.5 seconds
# 你也可以指定他们以毫秒为单位
# 1500ms表示 1.5秒
#
# keepalive: how long between heartbeats?
# keepalive: 在heartbeat之间连接保持多久
#keepalive 2
#
# deadtime: how long-to-declare-host-dead?
# deadtime：
# If you set this too low you will get the problematic
# split-brain (or cluster partition) problem.
# See the FAQ for how to use warntime to tune deadtime.
# 如果这个时间值设置得太低可能会导致出现很难判断的问题，如何使用warntime来调节deadtime请查看FAQ。
#
#deadtime 30
#
# warntime: how long before issuing “late heartbeat” warning?
# See the FAQ for how to use warntime to tune deadtime.
#
#warntime 10
#
#
# Very first dead time (initdead)
#
# On some machines/OSes, etc. the network takes a while to come up
# and start working right after you’ve been rebooted. As a result
# we have a separate dead time for when things first come up.
# It should be at least twice the normal dead time.
# 在某些机器/操作系统等中，网络在机器重启后需要花一定的时间启动并正常工作。因此我们必须分开他们初次起来的dead time，这个值应该最少设置为两倍的正常dead time。
#
#initdead 120
#
#
# What UDP port to use for bcast/ucast communication?
# 用于bacst/ucast通讯的UDP端口
#
#udpport 694
#
# Baud rate for serial ports…
# 串口的波特率
#baud 19200
#
# serial serialportname …
# serial 串口名称
#serial /dev/ttyS0 # Linux
#serial /dev/cuaa0 # FreeBSD
#serial /dev/cuad0 # FreeBSD 6.x
#serial /dev/cua/a # Solaris
#
#
# What interfaces to broadcast heartbeats over?
# 广播heartbeats的接口
#
#bcast eth0 # Linux
#bcast eth1 eth2 # Linux
#bcast le0 # Solaris
#bcast le1 le2 # Solaris
#
# Set up a multicast heartbeat medium
# 设置一个多播心跳介质
# mcast [dev] [mcast group] [port] [ttl] [loop]
#
# [dev] device to send/rcv heartbeats on 发送/接收heartbeats的设备
# [mcast group] multicast group to join (class D multicast address 224.0.0.0 – 239.255.255.255) 加入到的多播组（D类多播地址224.0.0.0 – 239.255.255.255）
# [port] udp port to sendto/rcvfrom udp(set this value to the same value as “udpport” above) 端口用于发送/接收udp（设置这个值跟上面的udpport为相同值）
# [ttl] the ttl value for outbound heartbeats. this effects how far the multicast packet will propagate. (0-255) Must be greater than zero.
# 外流的heartbeats的ttl值。这个影响多播包能传播多远。（0-255）必须要大于0 。
# [loop] toggles loopback for outbound multicast heartbeats.if enabled, an outbound packet will be looped back and received by the interface it was sent # on. (0 or 1) Set this value to zero.
# 为多播heartbeat开关loopback。如果enabled，一个外流的包将被回环到原处并由发送它的接口接收。（0或者1）设置这个值为0。
#
#mcast eth0 225.0.0.1 694 1 0
#
# Set up a unicast / udp heartbeat medium
# 配置一个unicast / udp heartbeat 介质
# ucast [dev] [peer-ip-addr]
#
# [dev] device to send/rcv heartbeats on 用于发送/接收heartbeat的设备
# [peer-ip-addr] IP address of peer to send packets to 包被发送到的对等的IP地址
#
#ucast eth0 192.168.1.2
#
#
# About boolean values…
# 关于boolean值
# Any of the following case-insensitive values will work for true:
# 下面的非大小写敏感的值将认为是true：
# true, on, yes, y, 1
# Any of the following case-insensitive values will work for false:
# 下面的非大小写敏感的值将认为是false：
# false, off, no, n, 0
#
#
#
# auto_failback: determines whether a resource will
# automatically fail back to its “primary” node, or remain
# on whatever node is serving it until that node fails, or
# an administrator intervenes.
# auto_failback: 决定一个resource是否自动恢复到它的primary节点，或者不管什么节点，都继续运行在上面直到节点出现故障或管# 理员进行干预。
#
#
# The possible values for auto_failback are:
# auto_failback 的可能值有：
# on – enable automatic failbacks
# on – 允许自动failbacks
# off – disable automatic failbacks
# off – 禁止自动failbacks
# legacy – enable automatic failbacks in systems where all nodes do not yet support the auto_failback option.
# legacy – 在所有节点都还不支持auto_failback的选项中允许自动failbacks
# auto_failback “on” and “off” are backwards compatible with the old “nice_failback on” setting.
# auto_failback “on”和”off”向后兼容旧的”nice_failback on”设置。
#
# See the FAQ for information on how to convert from “legacy” to “on” without a flash cut.
# (i.e., using a “rolling upgrade” process)
# 查看FAQ获取如何从”legacy”转为到”on”并不会闪断的信息。
#
#
# The default value for auto_failback is “legacy”, which
# will issue a warning at startup. So, make sure you put
# an auto_failback directive in your ha.cf file.
# (note: auto_failback can be any boolean or “legacy”)
# 缺省的auto_failback值是“legacy”，它在启动的时候会发送一个警告。因此，确保你在ha.cf文件中配置了auto_failback指令。
#
auto_failback on
#
#
# Basic STONITH support
# Using this directive assumes that there is one stonith
# device in the cluster. Parameters to this device are
# read from a configuration file. The format of this line is:
# 基本上STONITH支持
# 使用这个指令假设有一个stonith设备在集群中。这个设备的参数从一个配置文件中读取，这行的格式是：
#
# stonith
#
# NOTE: it is up to you to maintain this file on each node in the
# cluster!
# 注意：在集群中的每个节点上的这个文件都靠你去维护。
#
#stonith baytech /etc/ha.d/conf/stonith.baytech
#
# STONITH support
# You can configure multiple stonith devices using this directive.
# 你可以使用这个指令配置多个stonith设备：
# The format of the line is:
# 这行的格式是：
# stonith_host #
# is the machine the stonith device is attached to or * to mean it is accessible from any host.
# 表示stonith设备联结到的机器或者用*来表示从任何主机都可以访问。
# is the type of stonith device (a list of supported drives is in /usr/lib/stonith.)
# 是stonith设备的类型（支持的设备的列表在/usr/lib/stonith中）
# are driver specific parameters. To see the format for a particular device, run:
# 是驱动指定的参数，要查看特定设备的格式，运行：
# stonith -l -t
#
#
# Note that if you put your stonith device access information in
# here, and you make this file publically readable, you’re asking
# for a denial of service attack
# 需要注意如果你将你的stonith设备的访问信息放在这里，并且你让这个文件开放读权限，那么你是在召唤一个DoS攻击。
#
# To get a list of supported stonith devices, run
# 要得到支持的stonith设备的列表，运行
# stonith -L
#
# For detailed information on which stonith devices are supported
# and their detailed configuration options, run this command:
# 要哪个stonith设备是支持的详细信息和它们详细的配置选项，运行这个命令：
# stonith -h
#
#stonith_host * baytech 10.0.0.3 mylogin mysecretpassword
#stonith_host ken3 rps10 /dev/ttyS1 kathy 0
#stonith_host kathy rps10 /dev/ttyS1 ken3 0
#
# Watchdog is the watchdog timer. If our own heart doesn’t beat for
# a minute, then our machine will reboot.
# Watchdog是一个watchdog计时器，如果我们的心超过一分钟不跳，我们的机器将会reboot。
#
# NOTE: If you are using the software watchdog, you very likely
# wish to load the module with the parameter “nowayout=0″ or
# compile it without CONFIG_WATCHDOG_NOWAYOUT set. Otherwise even
# an orderly shutdown of heartbeat will trigger a reboot, which is
# very likely NOT what you want.
# 注意：如果你使用软件watchdog，你很可能希望用参数“nowayout=0”来加载这个模块或编译它的时候去掉
# CONFIG_WATCHDOG_NOWAYOUT设置。否则，即使一个有序的关闭heartbeat也会触发重启，这很可能不是你想要的。
#
#watchdog /dev/watchdog
#
# Tell what machines are in the cluster
# 说明说明机器在这个集群里面
# node nodename … — must match uname -n
# node nodename … –必须要匹配uname -n
#node ken3
#node kathy
#
# Less common options…
# 非常用的选项
# Treats 10.10.10.254 as a psuedo-cluster-member
# Used together with ipfail below…
# note: don’t use a cluster node as ping node
# 将10.10.10.254看成一个伪集群成员，与下面的ipfail一起使用。
# 注意：不要使用一个集群节点作为ping节点
#
#ping 10.10.10.254
#
# Treats 10.10.10.254 and 10.10.10.253 as a psuedo-cluster-member
# called group1. If either 10.10.10.254 or 10.10.10.253 are up
# then group1 is up
# Used together with ipfail below…
# 将10.10.10.254和10.10.10.254看成一个叫group1的伪集群成员。如果10.10.10.254或10.10.10.253是up的，那么group1为up
# 与下面的ipfail一起使用。
#
#ping_group group1 10.10.10.254 10.10.10.253
#
# HBA ping derective for Fiber Channel
# Treats fc-card-name as psudo-cluster-member
# used with ipfail below …
# 用于Fiber Channel的HBA ping指令，将fc-card-name看成是伪集群成员，与下面的ipfail一起使用。
#
# You can obtain HBAAPI from http://hbaapi.sourceforge.net. You need
# to get the library specific to your HBA directly from the vender
# To install HBAAPI stuff, all You need to do is to compile the common
# part you obtained from the sourceforge. This will produce libHBAAPI.so
# which you need to copy to /usr/lib. You need also copy hbaapi.h to
# /usr/include.
# 你可以从http://hbaapi.sourceforge.net获取HBAAPI，你需要从vender获得用于你的HBA指令的特定的库来安装HBAAPI。
# 你所需要做的是编译你从sourceforge获得的通用部分，它会生成libHBAAPI.so，然后你要将它拷贝到/usr/lib目录。同时
# 你也要吧hbaapi.h拷贝到/usr/include 。
#
# The fc-card-name is the name obtained from the hbaapitest program
# that is part of the hbaapi package. Running hbaapitest will produce
# a verbose output. One of the first line is similar to:
# Apapter number 0 is named: qlogic-qla2200-0
# Here fc-card-name is qlogic-qla2200-0.
# fc-card-name是从hbaapitest程序获取的名字，它是hbaapi包的一部分。运行hbaapitest将生成一个冗长的输出，其中第一行类似：
# Apapter number 0 is named: qlogic-qla2200-0
# 在这里fc-card-name是qlogic-qla2200-0
#
#hbaping fc-card-name
#
#
# Processes started and stopped with heartbeat. Restarted unless
# they exit with rc=100
# 与heartbeat一起启动和停止的进程。重启，除非它们的以rc=100退出。
#
#respawn userid /path/name/to/run
#respawn hacluster /usr/lib/heartbeat/ipfail
#
# Access control for client api
# default is no access
# 用于客户端api的访问控制，缺省为不可访问。
#
#apiauth client-name gid=gidlist uid=uidlist
#apiauth ipfail gid=haclient uid=hacluster
###########################
#
# Unusual options.
# 非常选项
###########################
#
# hopfudge maximum hop count minus number of nodes in config
#hopfudge 1
#
# deadping – dead time for ping nodes 上面设置的用来ping的节点的死亡时间
#deadping 30
#
# hbgenmethod – Heartbeat generation number creation method，Normally these are stored on disk and incremented as needed.
# hbgenmethod – Heartbeat产生数字的生产方法。通常执行存储在磁盘上并在需要时进行增量。
#
#hbgenmethod time
#
# realtime – enable/disable realtime execution (high priority, etc.) defaults to on
# realtime – 允许/禁止实时执行（高优先级）缺省为on
#realtime off
#
# debug – set debug level .defaults to zero
# debug – 设置debug等级，缺省为0
#debug 1
#
# API Authentication – replaces the fifo-permissions-based system of the past
# APT认证 – 代替以前的fifo-permission-base系统
#
# You can put a uid list and/or a gid list.If you put both, then a process is authorized if it qualifies under either the uid list, or under the gid list.
# 可以放上一个uid列表和/或gid列表。如果两个都放，那么符合uid列表或gid列表中的进程都将通过验证
#
#
# The groupname “default” has special meaning. If it is specified, then
# this will be used for authorizing groupless clients, and any client groups
# not otherwise specified.
# 组名“default”有特定的意思。如果它被指定，那么它将用于验证无组的客户端和任何没有另外指定的客户组
#
# There is a subtle exception to this. “default” will never be used in the
# following cases (actual default auth directives noted in brackets)
# 这是一个复杂的表达式，“default”将从不用于下面的情况（现实中缺省的验证指令记录在括号中）
# ipfail (uid=HA_CCMUSER)
# ccm (uid=HA_CCMUSER)
# ping (gid=HA_APIGROUP)
# cl_status (gid=HA_APIGROUP)
#
# This is done to avoid creating a gaping security hole and matches the most likely desired configuration.
# 它避免生成一个安全漏洞缺口并匹配到了可能很多人最渴望的配置。
#
#apiauth ipfail uid=hacluster
#apiauth ccm uid=hacluster
#apiauth cms uid=hacluster
#apiauth ping gid=haclient uid=alanr,root
#apiauth default gid=haclient
# message format in the wire, it can be classic or netstring,
# default: classic
# 网线中的信息格式，可以是classic或netstring
#
#msgfmt classic/netstring
#
# Do we use logging daemon?
# If logging daemon is used, logfile/debugfile/logfacility in this file
# are not meaningful any longer. You should check the config file for logging
# daemon (the default is /etc/logd.cf)
# more infomartion can be fould in http://www.linux-ha.org/ha_2ecf_2fUseLogdDirective
# Setting use_logd to “yes” is recommended
# 我们是否使用记录监控？
# 如果使用了记录监控，此文件里面的logfile/debugfile/logfacility将不再有意义。你应该检查在配置文件中是否有记录监控（缺省为/etc/logd.cf）
# 更多的信息可以在http://www.linux-ha.org/ha_2ecf_2fUseLogdDirective中找到。推荐配置use_logd为yes。
#
# use_logd yes/no
#
# the interval we reconnect to logging daemon if the previous connection failed
# default: 60 seconds
# 如果前一个连接失败了，我们再次连接到记录监控器的间隔。
#conn_logd_time 60
#
#
# Configure compression module
# It could be zlib or bz2, depending on whether u have the corresponding
# library in the system.
# 配置压缩模块
# 它可以为zlib或bz2，基于我们的系统中是否有相应的库。
#
#compression bz2
#
# Confiugre compression threshold
# This value determines the threshold to compress a message,
# e.g. if the threshold is 1, then any message with size greater than 1 KB
# will be compressed, the default is 2 (KB)
# 配置压缩的限度
# 这个值决定压缩一个信息的限度，例如：如果限度为1，那么任何大于1KB的消息都会被压缩，缺省为2（KB）
#compression_threshold 2

haresources

#
# This is a list of resources that move from machine to machine as
# nodes go down and come up in the cluster. Do not include
# “administrative” or fixed IP addresses in this file.
# 这是当集群中的节点拓机和启动时从一台机器转移到另一台机器的resources列表，不要包含管理或已用IP地址在这个文件中。
#
#
# The haresources files MUST BE IDENTICAL on all nodes of the cluster.
# 此haresources文件在所有的集群节点中都必须相同
# The node names listed in front of the resource group information
# is the name of the preferred node to run the service. It is
# not necessarily the name of the current machine. If you are running
# auto_failback ON (or legacy), then these services will be started
# up on the preferred nodes – any time they’re up.
# 列在resource组信息前的节点名称是优先运行服务的节点名称，它不需要是当前机器的名称，如果你运行auto_failback on(或者
# legacy)，那么这些服务将会在优先节点启动，只要它们是运行的。
#
# If you are running with auto_failback OFF, then the node information
# will be used in the case of a simultaneous start-up, or when using
# the hb_standby {foreign,local} command.
# 如果你运行auto_failback off，那么节点信息将使用在同时启动的情况，或当使用hb_standby {foreign,local}命令时。
#
# BUT FOR ALL OF THESE CASES, the haresources files MUST BE IDENTICAL.
# If your files are different then almost certainly something
# won’t work right.
# 但是对于所有的这些情况，此haresources文件都必须相同。如果你的文件不同那么肯定有某些东西将不能正常工作。
#
#
#
# We refer to this file when we’re coming up, and when a machine is being
# taken over after going down.
# 我们在起动的时候和一个机器停机后被接管的时候参考这个文件。
#
# You need to make this right for your installation, then install it in
# /etc/ha.d
# 你必须让它符合你的安装，然后安装它到/etc/ha.d目录。
#
# Each logical line in the file constitutes a “resource group”.
# A resource group is a list of resources which move together from
# one node to another – in the order listed. It is assumed that there
# is no relationship between different resource groups. These
# resource in a resource group are started left-to-right, and stopped
# right-to-left. Long lists of resources can be continued from line
# to line by ending the lines with backslashes (“”).
# 在文件里面的每个逻辑行组成一个“resource group”。一个resource group就是从一个节点移动到另一个的resources的列表。
# 可以假设不同的resource groups之间是没有关系的。resource group的resource启动时是从左到右的。关闭时是从右到左的。
# 长的resources列表可以以反斜杠（“”）结尾来续行。
#
# These resources in this file are either IP addresses, or the name
# of scripts to run to “start” or “stop” the given resource.
# 在这个文件里面的resources可以是IP地址，也可以是用于“start”或“stop”给定的resource的脚本名称
#
# The format is like this:
#
#node-name resource1 resource2 … resourceN
#
#
# If the resource name contains an :: in the middle of it, the
# part after the :: is passed to the resource script as an argument.
# Multiple arguments are separated by the :: delimeter
# 如果resource的名称包含一个::在它的中间，在::后面的部分会传递给resource的脚本中作为一个参数，多个参数会以::分割。
#
# In the case of IP addresses, the resource script name IPaddr is implied.
# 在IP地址的情况中，resource脚本名称IPaddr是隐含的。
#
# For example, the IP address 135.9.8.7 could also be represented
# as IPaddr::135.9.8.7
# 例如：IP地址135.9.8.7也可以被表现为IPaddr::135.9.8.7
#
# THIS IS IMPORTANT!! vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
#
# The given IP address is directed to an interface which has a route
# to the given address. This means you have to have a net route
# set up outside of the High-Availability structure. We don’t set it
# up here — we key off of it.
# 给定的IP地址会直接连到有路由到给定的地址的接口上，这也就意味着你必须要在 High-Availability 外部配置一个网络路由。我们不在这里配置，我们切断它。
#
# The broadcast address for the IP alias that is created to support
# an IP address defaults to the highest address on the subnet.
# IP别名的广播地址将被缺省创建为支持IP地址的子网里的最高地址
#
# The netmask for the IP alias that is created defaults to the same
# netmask as the route that it selected in in the step above.
# IP别名的子网掩码将被缺省创建为与上面选择的路由相同的子网掩码
#
# The base interface for the IPalias that is created defaults to the
# same netmask as the route that it selected in the step above.
# IP别名的基础接口将被缺省创建为与上面选择的路由相同的子网掩码
#
# If you want to specify that this IP address is to be brought up
# on a subnet with a netmask of 255.255.255.0, you would specify
# this as IPaddr::135.9.8.7/24 .
# 如果你想要指定某个IP地址用指定的子网掩码来启动，那么像这样指定它 IPaddr::135.9.8.7/24
#
# If you wished to tell it that the broadcast address for this subnet
# was 135.9.8.210, then you would specify that this way:
# IPaddr::135.9.8.7/24/135.9.8.210
# 如果你想要指明这个子网的广播地址为135.9.8.210，那么可以像这样指定 IPaddr::135.9.8.7/24/135.9.8.210
#
# If you wished to tell it that the interface to add the address to
# is eth0, then you would need to specify it this way:
# IPaddr::135.9.8.7/24/eth0
# 如果你希望指明要增加地址的接口是eth0，那么你需要像这样指定 IPaddr::135.9.8.7/24/eth0
#
# And this way to specify both the broadcast address and the
# interface:
# IPaddr::135.9.8.7/24/eth0/135.9.8.210
# 同时指定广播地址和接口的方法为：
# IPaddr::135.9.8.7/24/eth0/135.9.8.210
#
# The IP addresses you list in this file are called “service” addresses,
# since they’re the publicly advertised addresses that clients
# use to get at highly available services.
# 列表在这个文件中的IP地址叫做服务地址，它们是客户端用于获取高可用服务的公共通告地址
#
# For a hot/standby (non load-sharing) 2-node system with only a single service address,
# you will probably only put one system name and one IP address in here.
# The name you give the address to is the name of the default “hot”
# system.
# 对于一个hot/standby（非共享负载）单服务地址的双节点系统，你可能只需要放置一个系统名称和一个IP地址在这里。你给定的地址对应的名字就是缺省的hot系统的名字。
#
# Where the nodename is the name of the node which “normally” owns the
# resource. If this machine is up, it will always have the resource
# it is shown as owning.
# 节点名称就是正常情况下拥有resource的节点的名称。如果此机器是up的，他将一直拥有以拥有显示的resource。
#
# The string you put in for nodename must match the uname -n name
# of your machine. Depending on how you have it administered, it could
# be a short name or a FQDN.
# 设置作为节点名称的字符串必须匹配在机器上使用uname -n获得的名字。基于你如果进行管理，它可能是一个缩写名称或一个FQDN。
#
#——————————————————————-
#
# Simple case: One service address, default subnet and netmask
# No servers that go up and down with the IP address
# 简单情况：一个服务地址，缺省子网和掩码，没有服务与IP地址一起启动和关闭
#
#just.linux-ha.org 135.9.216.110
#
#——————————————————————-
#
# Assuming the adminstrative addresses are on the same subnet…
# A little more complex case: One service address, default subnet
# and netmask, and you want to start and stop http when you get
# the IP address…
# 假定管理地址在相同的子网…
# 稍微复杂一些的情况：一个服务地址，缺省子网和子网掩码，同时你要在获得IP地址的时候启动和停止http。
#
#just.linux-ha.org 135.9.216.110 http
#——————————————————————-
#
# A little more complex case: Three service addresses, default subnet
# and netmask, and you want to start and stop http when you get
# the IP address…
# 稍微复杂一些的情况：三个服务地址，缺省子网和掩码，同时你要在获得IP地址的时候启动和停止http。
#
#just.linux-ha.org 135.9.216.110 135.9.215.111 135.9.216.112 httpd
#——————————————————————-
#
# One service address, with the subnet, interface and bcast addr
# explicitly defined.
# 一个服务地址，显式指定子网，接口，广播地址
#
#just.linux-ha.org 135.9.216.3/28/eth0/135.9.216.12 httpd
#
#——————————————————————-
#
# An example where a shared filesystem is to be used.
# Note that multiple aguments are passed to this script using
# the delimiter ‘::’ to separate each argument.
# 一个使用共享文件系统的例子
# 需要注意用’::’分隔的多个参数被传递到了这个脚本
#
#node1 10.0.0.170 Filesystem::/dev/sda1::/data1::ext2
#
# Regarding the node-names in this file:
# 关于这个文件中的节点名称：
# They must match the names of the nodes listed in ha.cf, which in turn
# must match the `uname -n` of some node in the cluster. So they aren’t
# virtual in any sense of the word.
# 它们必须匹配在ha.cf中列出的节点名称，依次必须匹配集群中的某些节点’unmae -n’的结果。所以它们不是对于词的虚假感觉。
#

authkeys

#
# Authentication file. Must be mode 600
# 验证文件。模式必须为600
#
# Must have exactly one auth directive at the front.
# auth send authentication using this method-id
# 必须有且只有一个auth指令在前面
# auth method-id 使用这个方法id发送验证
#
# Then, list the method and key that go with that method-id
# 然后列出方法和该方法的密钥
#
# Available methods: crc sha1, md5. Crc doesn’t need/want a key.
# 可用的模块：crc、sha1、md5。其中crc不需要一个密钥。
#
# You normally only have one authentication method-id listed in this file
# 通常只放置一个验证方法id在这个文件中
#
# Put more than one to make a smooth transition when changing auth
# methods and/or keys.
# 可以放置多于一个来使得进行验证方法和/或密钥更改的过渡变得平滑
#
#
# sha1 is believed to be the “best”, md5 next best.
# sha1被认为是最好的，md5第二。
#
# crc adds no security, except from packet corruption.
# Use only on physically secure networks.
# 除了防止包格式改变，crc不加安全保护。只能使用在物理上的安全网络。
#
#auth 1
#1 crc
#2 sha1 HI!
#3 md5 Hello!
转自：HA配置文件中英对照之ha.cf
HA配置文件中英对照之haresources
HA配置文件中英对照之authkeys

heartbeat配置文件ha.cf haresources authkeys详解

在启用Heartbeat之前，安装后要配置三个文件（如没有可手动建立）：ha.cf、haresources、authkeys。这三个配置文件需要在/etc/ha.d目录下面，但是默认是没有这三个文件的，可以到官网上下这三个文件，也可以在源码包里找这三个文件，在源码目录下的DOC子目录里。

1 配置ha.cf

第一个是ha.cf该文件位于在安装后创建的/etc/ha.d目录中。该文件中包括为Heartbeat使用何种介质通路和如何配置他们的信息。在源代码目录中的ha.cf文件包含了您可以使用的全部选项，详述如下：
serial /dev/ttyS0
使用串口heartbeat－如果不使用串口heartbeat，则必须使用其他的介质，如bcast（以太网）heartbeat。用适当的设备文件代替/dev/ttyS0。
watchdog /dev/watchdog

该选项是可选配置。通过Watchdog 功能可以获得提供最少功能的系统，该系统不提供heartbeat，可以在持续一份钟的不正常状态后重新启动。该功能有助于避免一台机器在被认定已经死亡之后恢复heartbeat的情况。如果这种情况发生并且磁盘挂载因故障而迁移（fail over），便有可能有两个节点同时挂载一块磁盘。如果要使用这项功能，则除了这行之外，也需要加载“softdog”内核模块，并创建相应的设备文件。方法是使用命令“insmod softdog”加载模块。然后输入“grep misc /proc/devices”并记住得到的数字（应该是10）。然后输入”cat /proc/misc | grep watchdog”并记住输出的数字（应该是130）。根据以上得到的信息可以创建设备文件，“mknod /dev/watchdog c 10 130”。
bcast eth1
表示在eth1接口上使用广播heartbeat（将eth1替换为eth0，eth2，或者您使用的任何接口）。
keepalive 2
设定heartbeat之间的时间间隔为2秒。
warntime 10
在日志中发出“late heartbeat“警告之前等待的时间，单位为秒。
deadtime 30
在30秒后宣布节点死亡。
initdead 120
在某些配置下，重启后网络需要一些时间才能正常工作。这个单独的”deadtime”选项可以处理这种情况。它的取值至少应该为通常deadtime的两倍。
baud 19200
波特率，串口通信的速度。
udpport 694
使用端口694进行bcast和ucast通信。这是默认的，并且在IANA官方注册的端口号。
auto_failback on
该选项是必须配置的。对于那些熟悉Tru64 Unix的人来说，heartbeat的工作方式类似于“favored member“模式。在failover之前，haresources文件中列出的主节点掌握所有的资源，之后从节点接管这些资源。当auto_failback设置为on时，一旦主节点重新恢复联机，将从从节点取回所有资源。若该选项设置为off，主节点便不能重新获得资源。该选项与废弃的nice_failback选项类似。如果要从一个nice_failback设置为off的集群升级到这个或更新的版本，需要特别注意一些事项以防止flash cut。请参阅FAQ中关于如何处理这类情况的章节。
node primary.mydomain.com
该选项是必须配置的。集群中机器的主机名，与“uname –n”的输出相同。
node backup.mydomain.com
该选项是必须配置的。同上。
respawn
该选项是可选配置的：列出将要执行和监控的命令。例如：要执行ccm守护进程，则要添加如下的内容：
respawn hacluster /usr/lib/heartbeat/ccm
使得Heartbeat以userid（在本例中为hacluster）的身份来执行该进程并监视该进程的执行情况，如果其死亡便重启之。对于ipfail，则应该是：
respawn hacluster /usr/lib/heartbeat/ipfail
注意：如果结束进程的退出代码为100，则不会重启该进程。

2 配置haresources

配置好ha.cf文件之后，便是haresources文件。该文件列出集群所提供的服务以及服务的默认所有者。注意：两个集群节点上的该文件必须相同。集群的IP地址是该选项是必须配置的，不能在haresources文件以外配置该地址, haresources文件用于指定双机系统的主节点、集群IP、子网掩码、广播地址以及启动的服务等。其配置语句格式如下：
node-name network-config
其中node-name指定双机系统的主节点，取值必须匹配ha.cf文件中node选项设置的主机名中的一个，node选项设置的另一个主机名成为从节点。network-config用于网络设置，包括指定集群IP、子网掩码、广播地址等。resource-group用于设置heartbeat启动的服务，该服务最终由双机系统通过集群IP对外提供。在本文中我们假设要配置的HA服务为Apache和Samba。

在haresources文件中需要如下内容：

primary.mydomain.com 192.168.85.3 httpd smb

该行指定在启动时，节点linuxha1得到IP地址192.168.85.3，并启动Apache和Samba。在停止时，Heartbeat将首先停止smb，然后停止Apache，最后释放IP地址192.168.85.3。这里假设命令“uname –n”的输出为“primary.mydomain.com”－如果输出为“primary”，便应使用“primary”。

正确配置好haresources文件之后，将ha.cf和haresource拷贝到/etc/ha.d目录。
注意：资源文件中能执行的命令必须在/etc/ha.d/resource.d/ 中可见

3 配置Authkeys

需要配置的第三个文件authkeys决定了您的认证密钥。共有三种认证方式：crc，md5，和sha1。您可能会问：“我应该用哪个方法呢？”简而言之：如果您的Heartbeat运行于安全网络之上，如本例中的交叉线，可以使用crc，从资源的角度来看，这是代价最低的方法。如果网络并不安全，但您也希望降低CPU使用，则使用md5。最后，如果您想得到最好的认证，而不考虑CPU使用情况，则使用sha1，它在三者之中最难破解。

文件格式如下：

auth
[]

因此，对于sha1，示例的/etc/ha.d/authkeys可能是

auth 1
1 sha1 key-for-sha1-any-text-you-want

对于md5，只要将上面内容中的sha1换成md5就可以了。对于crc，可作如下配置：

auth 2
2 crc

不论您在关键字auth后面指定的是什么索引值，在后面必须要作为键值再次出现。如果您指定“auth 4”，则在后面一定要有一行的内容为“4 ”。

确保该文件的访问权限是安全的，如600。
转自:http://blog.csdn.net/ndcs_dhf2008/article/details/5570219

CentOS编译安装Heartbeat

安装cluster glue

安装heartbeat之前需要安装glue。

yum install autoconf automake libtool glib2-devel libxml2-devel bzip2-devel e2fsprogs-devel libxslt-devel
groupadd haclient
useradd -g haclient hacluster
cd /tmp
wget http://hg.linux-ha.org/glue/archive/glue-1.0.9.tar.bz2
tar xjf glue-1.0.9.tar.bz2
cd Reusable-Cluster-Components-glue–glue-1.0.9
./autogen.sh
./configure –prefix=/usr/local/heartbeat
make && make install

安装Resource Agents

cd /tmp
wget –no-check-certificate https://github.com/ClusterLabs/resource-agents/tarball/v3.9.2
tar xzf v3.9.2
cd ClusterLabs-resource-agents-b735277/
./autogen.sh
export CFLAGS="$CFLAGS -I/usr/local/heartbeat/include -L/usr/local/heartbeat/lib"
./configure –prefix=/usr/local/heartbeat
ln -s /usr/local/heartbeat/lib/* /lib/
make && make install

安装Heartbeat

cd /tmp
wget http://hg.linux-ha.org/heartbeat-STABLE_3_0/archive/7e3a82377fa8.tar.bz2
tar xjf 7e3a82377fa8.tar.bz2
cd Heartbeat-3-0-7e3a82377fa8/
./bootstrap
export CFLAGS="$CFLAGS -I/usr/local/heartbeat/include -L/usr/local/heartbeat/lib"
./configure –prefix=/usr/local/heartbeat
make && make install
cp doc/ha.cf /usr/local/heartbeat/etc/ha.d/
cp doc/haresources /usr/local/heartbeat/etc/ha.d/
cp doc/authkeys /usr/local/heartbeat/etc/ha.d/
cp heartbeat/init.d/heartbeat /etc/rc.d/init.d/
chkconfig –add heartbeat
chkconfig heartbeat on
chmod 600 /usr/local/heartbeat/etc/ha.d/authkeys
sed -i ‘s#/usr/lib/ocf#/usr/local/heartbeat/usr/lib/ocf#g’ /usr/local/heartbeat/etc/ha.d/shellfuncs
sed -i ‘s#/usr/lib/ocf#/usr/local/heartbeat/usr/lib/ocf#g’ /usr/local/heartbeat/usr/lib/ocf/lib//heartbeat/ocf-shellfuncs
sed -i ‘s#/usr/lib/ocf#/usr/local/heartbeat/usr/lib/ocf#g’ /usr/local/heartbeat/etc/ha.d/resource.d//hto-mapfuncs

除错

1、错误:glue_config.h:99:1: error: “HA_HBCONF_DIR” redefined
解决方法：http://devops.webres.wang/2012/02/glue_config-h991-error-ha_hbconf_dir-redefined/
2、错误configure.ac:9: error: Autoconf version 2.63 or higher is required
解决方法：http://devops.webres.wang/2012/03/configure-ac9-error-autoconf-version-2-63-or-higher-is-required/
3、错误configure.ac:63: require Automake 1.10.1, but have 1.9.6
解决方法：http://devops.webres.wang/2012/03/configure-ac63-require-automake-1-10-1-but-have-1-9-6/