etcd – Linux系统运维日志

docker-compose方式启动etcd

网上几乎所有启动etcd容器方式都是以 dokcer run的形式,但是由于生产上采用docker-compose的写法会更方便维护，于是通过不断尝试后，最终测试出了以dokcer-compose启动etcd的方式

1. 以docker run形式启动etcd

# 设置HostIP
export HostIP=192.168.1.102
# 执行etcd安装启动命令
docker run -d -v /usr/share/ca-certificates/:/etc/ssl/certs -p 4001:4001 -p 2380:2380 -p 2379:2379 
 --restart=always 
 --name etcd registry.cn-hangzhou.aliyuncs.com/coreos_etcd/etcd:v3 
 /usr/local/bin/etcd 
 -name etcd0 
 -advertise-client-urls http://${HostIP}:2379,http://${HostIP}:4001 
 -listen-client-urls http://0.0.0.0:2379,http://0.0.0.0:4001 
 -initial-advertise-peer-urls http://${HostIP}:2380 
 -listen-peer-urls http://0.0.0.0:2380 
 -initial-cluster-token etcd-cluster-1 
 -initial-cluster etcd0=http://${HostIP}:2380 
 -initial-cluster-state new

2. 以docker-compose启动的模式如下

etcd:
    container_name: etcd0
    image: registry.cn-hangzhou.aliyuncs.com/coreos_etcd/etcd:v3
    ports:
      - "2379:2379"
      - "4001:4001"
      - "2380:2380"
    environment:
      - TZ=CST-8
      - LANG=zh_CN.UTF-8
    command: 
      /usr/local/bin/etcd
      -name etcd0 
      -data-dir /etcd-data
      -advertise-client-urls http://${host_ip}:2379,http://${host_ip}:4001
      -listen-client-urls http://0.0.0.0:2379,http://0.0.0.0:4001
      -initial-advertise-peer-urls http://${host_ip}:2380
      -listen-peer-urls http://0.0.0.0:2380 
      -initial-cluster-token docker-etcd 
      -initial-cluster etcd0=http://${host_ip}:2380
      -initial-cluster-state new 
    volumes:
      - "/data/conf/etcd/data:/etcd-data"
      # - "/data/config/etcd/ca-certificates/:/etc/ssl/certs"
    labels:
      - project.source=
      - project.extra=public-image
      - project.depends=
      - project.owner=LHZ

1. etcd安装

rpm -ivh etcd-3.2.15-1.el7.x86_64.rpm
systemctl daemon-reload
systemctl enable etcd
systemctl start etcd
export ETCDCTL_API=3
systemctl status etcd

hosts如下

192.168.0.100 etcd01
192.168.0.101 etcd02
192.168.0.102 etcd03

2. etcd配置

etcd02配置如下，详细见kubernetes1.9版本集群配置向导

# egrep -v "^$|^#" /etc/etcd/etcd.conf 
ETCD_DATA_DIR="/var/lib/etcd/"
ETCD_LISTEN_PEER_URLS="https://192.168.0.101:2380"
ETCD_LISTEN_CLIENT_URLS="https://192.168.0.101:2379,http://127.0.0.1:2379"
ETCD_NAME="etcd02"
ETCD_INITIAL_ADVERTISE_PEER_URLS="https://192.168.0.101:2380"
ETCD_ADVERTISE_CLIENT_URLS="https://192.168.0.101:2379"
ETCD_INITIAL_CLUSTER="etcd01=https://192.168.0.100:2380,etcd02=https://192.168.0.101:2380,etcd03=https://192.168.0.102:2380"
ETCD_INITIAL_CLUSTER_TOKEN="etcd-cluster"
ETCD_INITIAL_CLUSTER_STATE="existing"
ETCD_CERT_FILE="/etc/kubernetes/ssl/etcd.pem"
ETCD_KEY_FILE="/etc/kubernetes/ssl/etcd-key.pem"
ETCD_CLIENT_CERT_AUTH="true"
ETCD_TRUSTED_CA_FILE="/etc/kubernetes/ssl/ca.pem"
ETCD_AUTO_TLS="true"
ETCD_PEER_CERT_FILE="/etc/kubernetes/ssl/etcd.pem"
ETCD_PEER_KEY_FILE="/etc/kubernetes/ssl/etcd-key.pem"
ETCD_PEER_CLIENT_CERT_AUTH="true"
ETCD_PEER_TRUSTED_CA_FILE="/etc/kubernetes/ssl/ca.pem"
ETCD_PEER_AUTO_TLS="true"

3. 故障报错

3个节点做集群，直接关机后，etcd02故障，报错：

etcd: advertise client URLs = https://192.168.0.101:2379
etcd: read wal error (wal: crc mismatch) and cannot be repaired
systemd: etcd.service: main process exited, code=exited, status=1/FAILURE

wal的cec校验出错，谷歌了一下，没什么结果，于是移除这个etcd，再恢复
在正常的etcd节点移除

# etcdctl member list
1ce6d6d01109192, started, etcd03, https://192.168.0.102:2380, https://192.168.0.102:2379
9b534175b46ea789, started, etcd01, https://192.168.0.100:2380, https://192.168.0.100:2379
ac2f188e97f50eb7, started, etcd02, https://192.168.0.101:2380, https://192.168.0.101:2379
# etcdctl member remove ac2f188e97f50eb7
Member ac2f188e97f50eb7 removed from cluster 194cd14a48430083

再启动etcd服务

# systemctl start etcd

报错：

etcd: error validating peerURLs {ClusterID:194cd14a48430083 Members:[&{ID:1ce6d6d01109192 RaftAttributes:{PeerURLs:[https://192.168.0.102:2380]} Attributes:{Name:etcd03 ClientURLs:[https://192.168.0.102:2379]}} &{ID:9b534175b46ea789 RaftAttributes:{PeerURLs:[https://192.168.0.100:2380]} Attributes:{Name:etcd01 ClientURLs:[https://192.168.0.100:2379]}}] RemovedMemberIDs:[]}: member count is unequal

报错：

etcd: the member has been permanently removed from the cluster the data-dir used by this member must be removed

4. etcd恢复数据

在etcd02节点恢复一下数据试试：

# mv /var/lib/etcd/member /var/lib/member
# rm -rf /var/lib/etcd/*
# etcdctl snapshot restore /var/lib/member/snap/db --skip-hash-check=true
2018-06-22 11:28:35.622666 I | mvcc: restore compact to 10177401
2018-06-22 11:28:35.659626 I | etcdserver/membership: added member 8e9e05c52164694d [http://localhost:2380] to cluster cdf818194e3a8c32
# systemctl start etcd

服务启动了，自己把自己选做主，服务倒是启动了，加入集群还是出错，用正常的节点备份再恢复

# etcdctl snapshot save etcdback.db

# etcdctl member add etcd02 http://192.168.0.101:2380
Error: member name not provided.

看看现在集群的其他2个etcd

# curl -k --key /etc/kubernetes/ssl/etcd-key.pem --cert /etc/kubernetes/ssl/etcd.pem https://192.168.0.100:2380/members
[{"id":130161754177048978,"peerURLs":["https://192.168.0.102:2380"],"name":"etcd03","clientURLs":["https://192.168.0.102:2379"]},{"id":11192361472739944329,"peerURLs":["https://192.168.0.100:2380"],"name":"etcd01","clientURLs":["https://192.168.0.100:2379"]}]

参考文档：

etcdctl member add etcd_name –peer-urls=”https://peerURLs”

再次添加

# etcdctl member add etcd02 --peer-urls="https://192.168.0.101:2380"
Member 41c2a7b938a5e387 added to cluster 194cd14a48430083

ETCD_NAME="etcd02"
ETCD_INITIAL_CLUSTER="etcd03=https://192.168.0.102:2380,etcd02=https://192.168.0.101:2380,etcd01=https://192.168.0.100:2380"
ETCD_INITIAL_CLUSTER_STATE="existing"

查看etcd member状态：

# etcdctl member list
1ce6d6d01109192, started, etcd03, https://192.168.0.102:2380, https://192.168.0.102:2379
9b534175b46ea789, started, etcd01, https://192.168.0.100:2380, https://192.168.0.100:2379
ad17c3da831c84c7, unstarted, , https://192.168.0.101:2380,

报错：

etcd: request cluster ID mismatch (got 194cd14a48430083 want cdf818194e3a8c32)

发现步骤顺序错误，应该是先添加到etcd集群，再启动etcd服务，我们现在先启动etcd服务，就是一个etcd单点

etcd节点加入集群

故障的etcd主机

# systemctl stop etcd

正常的etcd主机：

# etcdctl member remove ad17c3da831c84c7

# etcdctl member add etcd02 --peer-urls="https://192.168.0.101:2380"
Member 41c2a7b938a5e387 added to cluster 194cd14a48430083

ETCD_NAME="etcd02"
ETCD_INITIAL_CLUSTER="etcd03=https://192.168.0.102:2380,etcd02=https://192.168.0.101:2380,etcd01=https://192.168.0.100:2380"
ETCD_INITIAL_CLUSTER_STATE="existing"

故障的etcd主机，查看现在etcd状态：

# etcdctl endpoint health
127.0.0.1:2379 is healthy: successfully committed proposal: took = 24.52485ms

# etcdctl member list
1ce6d6d01109192, started, etcd03, https://192.168.0.102:2380, https://192.168.0.102:2379
41c2a7b938a5e387, started, etcd02, https://192.168.0.101:2380, https://192.168.0.101:2379
9b534175b46ea789, started, etcd01, https://192.168.0.100:2380, https://192.168.0.100:2379

到这里，etcd故障修复完毕

5. etcd常用命令

查看状态

# export ETCDCTL_API=3

# etcdctl endpoint status --write-out=table
+----------------+------------------+---------+---------+-----------+-----------+------------+
|    ENDPOINT    |        ID        | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+----------------+------------------+---------+---------+-----------+-----------+------------+
| 127.0.0.1:2379 | 41c2a7b938a5e387 |  3.2.15 |   15 MB |      true |       317 |   11051403 |
+----------------+------------------+---------+---------+-----------+-----------+------------+

备份及恢复

etcdctl snapshot save etcdback.db
etcdctl snapshot status etcdback.db --write-out=table
etcdctl snapshot restore etcdback.db --skip-hash-check=true

etcd监控

# curl -L http://localhost:2379/metrics
# HELP etcd_debugging_mvcc_keys_total Total number of keys.
# TYPE etcd_debugging_mvcc_keys_total gauge
etcd_debugging_mvcc_keys_total 776
# HELP etcd_debugging_mvcc_pending_events_total Total number of pending events to be sent.
# TYPE etcd_debugging_mvcc_pending_events_total gauge
etcd_debugging_mvcc_pending_events_total 0
# HELP etcd_debugging_mvcc_put_total Total number of puts seen by this member.
# TYPE etcd_debugging_mvcc_put_total counter
etcd_debugging_mvcc_put_total 9.548201e+06
# HELP etcd_debugging_mvcc_range_total Total number of ranges seen by this member.
# TYPE etcd_debugging_mvcc_range_total counter
etcd_debugging_mvcc_range_total 2.1052143e+07
# HELP etcd_debugging_mvcc_slow_watcher_total Total number of unsynced slow watchers.
# TYPE etcd_debugging_mvcc_slow_watcher_total gauge
etcd_debugging_mvcc_slow_watcher_total 0
# HELP etcd_debugging_mvcc_txn_total Total number of txns seen by this member.
# TYPE etcd_debugging_mvcc_txn_total counter
etcd_debugging_mvcc_txn_total 0
# HELP etcd_debugging_mvcc_watch_stream_total Total number of watch streams.
# TYPE etcd_debugging_mvcc_watch_stream_total gauge
etcd_debugging_mvcc_watch_stream_total 125
# HELP etcd_debugging_mvcc_watcher_total Total number of watchers.
# TYPE etcd_debugging_mvcc_watcher_total gauge
etcd_debugging_mvcc_watcher_total 125
# HELP etcd_debugging_server_lease_expired_total The total number of expired leases.
# TYPE etcd_debugging_server_lease_expired_total counter
etcd_debugging_server_lease_expired_total 3649

适合用prometheus监控

global:
  scrape_interval: 10s
scrape_configs:
  - job_name: etcd
    static_configs:
    - targets: ['192.168.0.100:2379','192.168.0.101:2379','192.168.0.102:2379']

图解raft算法 http://thesecretlivesofdata.com/raft/

etcd获取kubernetes的数据

# export ETCDCTL_API=3
# etcdctl get /registry/namespaces/default --prefix -w json|python -m json.tool
{
    "count": 1,
    "header": {
        "cluster_id": 1823062066148343939,
        "member_id": 11192361472739944329,
        "raft_term": 317,
        "revision": 10880816
    },
    "kvs": [
        {
            "create_revision": 6,
            "key": "L3JlZ2lzdHJ5L25hbWVzcGFjZXMvZGVmYXVsdA==",
            "mod_revision": 6,
            "value": "azhzAAoPCgJ2MRIJTmFtZXNwYWNlEl8KRQoHZGVmYXVsdBIAGgAiACokOTVlNzdjMWEtM2Q1Ny0xMWU4LTk5YzItMDA1MDU2YmU3NWEzMgA4AEIICK7qttYFEAB6ABIMCgprdWJlcm5ldGVzGggKBkFjdGl2ZRoAIgA=",
            "version": 1
        }
    ]
}

查看key的内容
# echo L3JlZ2lzdHJ5L25hbWVzcGFjZXMvZGVmYXVsdA== |base64 -d
/registry/namespaces/default

#!/bin/bash
# Get kubernetes keys from etcd
export ETCDCTL_API=3
keys=`etcdctl get /registry --prefix -w json|python -m json.tool|grep key|cut -d ":" -f2|tr -d '"'|tr -d ","`
for x in $keys;do
  echo $x|base64 -d|sort
done

获取etcd中kubernetes所有对象的key

ETCD数据库异常：mvcc: database space exceeded解决

ETCD数据库异常：mvcc: database space exceeded解决

问题来源：在k8s集群中给node打标签发现报错

[root@master1]# kubectl label node  30.4.228.20 env=prod
Error from server: etcdserver: mvcc: database space exceeded

环境信息

etcd集群：30.4.228.19，30.4.228.20，30.4.228.22 （配置了安全加密）

原因分析

etcd服务未设置自动压缩参数（auto-compact）
etcd 默认不会自动 compact，需要设置启动参数，或者通过命令进行compact，如果变更频繁建议设置，否则会导致空间和内存的浪费以及错误。Etcd v3 的默认的 backend quota 2GB，如果不 compact，boltdb 文件大小超过这个限制后，就会报错：”Error: etcdserver: mvcc: database space exceeded”，导致数据无法写入。

处理过程

1、获取旧版本号:

[root@etcd1]# rev=$(/usr/local/bin/etcdctl --cacert=/etc/kubernetes/ssl/ca.pem 
--cert=/etc/etcd/ssl/etcd.pem  
--key=/etc/etcd/ssl/etcd-key.pem  
--endpoints="https://127.0.0.1:2379" 
endpoint status --write-out="json"  
| egrep -o '"revision":[0-9]*'  
| egrep -o '[0-9].*') 

[root@etcd1]# echo $rev

2 、压缩旧版本

[root@etcd1]# /usr/local/bin/etcdctl --cacert=/etc/kubernetes/ssl/ca.pem --cert=/etc/etcd/ssl/etcd.pem --key=/etc/etcd/ssl/etcd-key.pem --endpoints="https://30.4.228.20:2379" compact $rev

[root@etcd1]# /usr/local/bin/etcdctl --cacert=/etc/kubernetes/ssl/ca.pem --cert=/etc/etcd/ssl/etcd.pem --key=/etc/etcd/ssl/etcd-key.pem --endpoints="https://30.4.228.20:2379" defrag 

[root@etcd1]# /usr/local/bin/etcdctl --cacert=/etc/kubernetes/ssl/ca.pem --cert=/etc/etcd/ssl/etcd.pem --key=/etc/etcd/ssl/etcd-key.pem --endpoints="https://30.4.228.20:2379" alarm list 

[root@etcd1]# /usr/local/bin/etcdctl --cacert=/etc/kubernetes/ssl/ca.pem --cert=/etc/etcd/ssl/etcd.pem --key=/etc/etcd/ssl/etcd-key.pem --endpoints="https://30.4.228.19:2379" compact $rev

[root@etcd1]# /usr/local/bin/etcdctl --cacert=/etc/kubernetes/ssl/ca.pem --cert=/etc/etcd/ssl/etcd.pem --key=/etc/etcd/ssl/etcd-key.pem --endpoints="https://30.4.228.19:2379" defrag 

[root@etcd1]# /usr/local/bin/etcdctl --cacert=/etc/kubernetes/ssl/ca.pem --cert=/etc/etcd/ssl/etcd.pem --key=/etc/etcd/ssl/etcd-key.pem --endpoints="https://30.4.228.22:2379" compact $rev

[root@etcd1]# /usr/local/bin/etcdctl --cacert=/etc/kubernetes/ssl/ca.pem --cert=/etc/etcd/ssl/etcd.pem --key=/etc/etcd/ssl/etcd-key.pem --endpoints="https://30.4.228.22:2379" defrag 

[root@etcd1]# /usr/local/bin/etcdctl --cacert=/etc/kubernetes/ssl/ca.pem --cert=/etc/etcd/ssl/etcd.pem --key=/etc/etcd/ssl/etcd-key.pem --endpoints="https://30.4.228.22:2379" alarm list 

[root@etcd1]# /usr/local/bin/etcdctl --cacert=/etc/kubernetes/ssl/ca.pem --cert=/etc/etcd/ssl/etcd.pem --key=/etc/etcd/ssl/etcd-key.pem --endpoints="https://30.4.228.22:2379" alarm disarm

3、查看告警

[root@etcd1]# /usr/local/bin/etcdctl --cacert=/etc/kubernetes/ssl/ca.pem --cert=/etc/etcd/ssl/etcd.pem --key=/etc/etcd/ssl/etcd-key.pem --endpoints="https://30.4.228.22:2379" alarm list

[root@etcd1]#  /usr/local/bin/etcdctl --cacert=/etc/kubernetes/ssl/ca.pem --cert=/etc/etcd/ssl/etcd.pem --key=/etc/etcd/ssl/etcd-key.pem --endpoints="https://30.4.228.19:2379" alarm list 

[root@etcd1]# /usr/local/bin/etcdctl --cacert=/etc/kubernetes/ssl/ca.pem --cert=/etc/etcd/ssl/etcd.pem --key=/etc/etcd/ssl/etcd-key.pem --endpoints="https://30.4.228.20:2379" alarm list

etcd相关命令

1、设置etcd配额：

# 设置16MB的配额
etcd --quota-backend-bytes=$((16*1024*1024))

2、触发配额耗尽：

while [1];do dd if=dev/urandom bs=1024 count=1024 
 | ETCDCTL_API=3 etcdctl put key || break; done
...
Error: rpc error: code = 8 desc = etcdserver: mvcc:database space exceeded

3、确认数据空间超出配额：

ETCDCTL_API=3 etcdctl --write-out=table endpoint status
----------------+------------------+-----------+---------+-----------+-----------+------------+
|    ENDPOINT    |        ID        |  VERSION  | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+----------------+------------------+-----------+---------+-----------+-----------+------------+
| 127.0.0.1:2379 | bf9071f4639c75cc | 2.3.0+git | 18 MB   | true      |         2 |       3332 |
+----------------+------------------+-----------+---------+-----------+-----------+------------+

4、查看告警：

ETCDCTL_API=3 etcdctl alarm list

5、整合压缩、碎片整理：

1) 获取当前etcd数据的修订版本(revision)

rev=$(ETCDCTL_API=3 etcdctl --endpoints=:2379 endpoint status --write-out="json" | egrep -o '"revision":[0-9]*' | egrep -o '[0-9]*')

2) 整合压缩旧版本数据

ETCDCTL_API=3 etcdctl compact $rev

3) 执行碎片整理

ETCDCTL_API=3 etcdctl defrag

4) 解除告警

ETCDCTL_API=3 etcdctl alarm disarm

5) 备份以及查看备份数据信息

ETCDCTL_API=3 etcdctl snapshot save backup.db
ETCDCTL_API=3 etcdctl snapshot status backup.db

etcd集群增加节点

主要两步：

添加节点
启动新节点

原本etcd集群

# etcdctl member list
b7124c8d88451: name=myetcd1 peerURLs=http://192.168.9.100:2380 clientURLs=http://192.168.9.100:2379 isLeader=true
235dcf74ed6248d5: name=myetcd3 peerURLs=http://192.168.9.100:2382 clientURLs=http://192.168.9.100:2399 isLeader=false
e35665335259ca10: name=myetcd2 peerURLs=http://192.168.9.100:2381 clientURLs=http://192.168.9.100:2389 isLeader=false

新增节点： 192.168.9.101：2383

# etcdctl member add myetcd4  http://192.168.9.101:2383

Added member named myetcd4 with ID 2644935c1a10c721 to cluster

ETCD_NAME="myetcd4"
ETCD_INITIAL_CLUSTER="myetcd1=http://192.168.9.100:2380,myetcd3=http://192.168.9.100:2382,myetcd4=http://192.168.9.101:2383,myetcd2=http://192.168.9.100:2381"
ETCD_INITIAL_CLUSTER_STATE="existing"

启动新节点：

# etcd --name myetcd4  --listen-client-urls http://0.0.0.0:2409 --advertise-client-urls http://192.168.9.101:2409 --listen-peer-urls http://0.0.0.0:2383 --initial-advertise-peer-urls http://192.168.9.101:2383  --initial-cluster-token etcd-cluster-test --initial-cluster-state existing --initial-cluster myetcd1=http://192.168.9.100:2380,myetcd2=http://192.168.9.100:2381,myetcd3=http://192.168.9.100:2382,myetcd4=http://192.168.9.101:2383
1

下载安装

从这下载https://github.com/coreos/etcd/releases/download/v3.3.2/etcd-v3.3.2-linux-amd64.tar.gz
tar xzvf etcd-v3.3.2-linux-amd64.tar.gz
cd etcd-v3.3.2-linux-amd64; cp etcd* /user/local/bin/
这样即成功添加etcd命令
etcd –version

运行与搭建

常见命令演示

etcd –version
etcdctl –version

API3的要这样
ETCDCTL_API=3 etcdctl version

启动：etcd
写一个数据与读取一个数据：key:key1 value:helloworld
ETCDCTL_API=3 etcdctl –endpoints=localhost:2379 put helloworld
ETCDCTL_API=3 etcdctl –endpoints=localhost:2379 get key1

单机启动：

etcd --name myetcd1  --listen-client-urls http://0.0.0.0:2379 --advertise-client-urls http://0.0.0.0:2379 --listen-peer-urls http://0.0.0.0:2380 --initial-advertise-peer-urls http://0.0.0.0:2380  --initial-cluster my-etcd-1=http://0.0.0.0:2380

集群启动：

etcd --name myetcd1  --listen-client-urls http://0.0.0.0:2379 --advertise-client-urls http://192.168.9.100:2379 --listen-peer-urls http://0.0.0.0:2380 --initial-advertise-peer-urls http://192.168.9.100:2380  --initial-cluster-token etcd-cluster-test --initial-cluster-state new --initial-cluster myetcd1=http://192.168.9.100:2380,myetcd2=http://192.168.9.100:2381,myetcd3=http://192.168.9.100:2382

etcd --name myetcd2  --listen-client-urls http://0.0.0.0:2389 --advertise-client-urls http://192.168.9.100:2389 --listen-peer-urls http://0.0.0.0:2381 --initial-advertise-peer-urls http://192.168.9.100:2381  --initial-cluster-token etcd-cluster-test --initial-cluster-state new --initial-cluster myetcd1=http://192.168.9.100:2380,myetcd2=http://192.168.9.100:2381,myetcd3=http://192.168.9.100:2382

etcd --name myetcd3  --listen-client-urls http://0.0.0.0:2399 --advertise-client-urls http://192.168.9.100:2399 --listen-peer-urls http://0.0.0.0:2382 --initial-advertise-peer-urls http://192.168.9.100:2382  --initial-cluster-token etcd-cluster-test --initial-cluster-state new --initial-cluster myetcd1=http://192.168.9.100:2380,myetcd2=http://192.168.9.100:2381,myetcd3=http://192.168.9.100:2382

用三个端口2380,2381,2382来模拟集群(这三个是成员之间通信),2379,2389,2399是给客户端连接的.服务器IP：192.168.9.100, 如果在本机模拟集群, 可以将192.168.9.100改为0.0.0.0

带advertise参数是广播参数: 如–listen-client-urls和–advertise-client-urls, 前者是Etcd端监听客户端的url,后者是Etcd客户端请求的url, 两者端口是相同的, 只不过后者一般为公网IP, 暴露给外部使用.

查看成员:

etcdctl member list

使用时需要指定endpoints(默认本地端口2379), 集群时数据会迅速同步:

ETCDCTL_API=3 etcdctl –endpoints=127.0.0.1:2389 put key1 xx 
ETCDCTL_API=3 etcdctl –endpoints=127.0.0.1:2379 get key1

参数说明

–name etcd0：本member的名字

–initial-advertise-peer-urls http://192.168.9.100:2380：其他member使用，其他member通过该地址与本member交互信息。一定要保证从其他member能可访问该地址。静态配置方式下，该参数的value一定要同时在–initial-cluster参数中存在。memberID的生成受–initial-cluster-token和–initial-advertise-peer-urls影响。

–listen-peer-urls http://0.0.0.0:2380：本member侧使用，用于监听其他member发送信息的地址。ip为全0代表监听本member侧所有接口

–listen-client-urls http://0.0.0.0:2379：本member侧使用，用于监听etcd客户发送信息的地址。ip为全0代表监听本member侧所有接口

–advertise-client-urls http://192.168.9.100:2379： etcd客户使用，客户通过该地址与本member交互信息。一定要保证从客户侧能可访问该地址

–initial-cluster-token etcd-cluster-2：用于区分不同集群。本地如有多个集群要设为不同。

–initial-cluster myetcd0=http://192.168.9.100:2380,myetcd1=http://192.168.9.100:2381,myetcd2=http://192.168.9.100:2382：本member侧使用。描述集群中所有节点的信息，本member根据此信息去联系其他member。memberID的生成受–initial-cluster-token和–initial-advertise-peer-urls影响。

–initial-cluster-state new：用于指示本次是否为新建集群。有两个取值new和existing。如果填为existing，则该member启动时会尝试与其他member交互。集群初次建立时，要填为new，经尝试最后一个节点填existing也正常，其他节点不能填为existing。集群运行过程中，一个member故障后恢复时填为existing，经尝试填为new也正常。

–data-dir：指定节点的数据存储目录，这些数据包括节点ID，集群ID，集群初始化配置，Snapshot文件，若未指定-wal-dir，还会存储WAL文件；如果不指定会用缺省目录。

–discovery http://192.168.9.100:20003/v2/keys/discovery/78b12ad7-2c1d-40db-9416-3727baf686cb：用于自发现模式下，指定第三方etcd上key地址，要建立的集群各member都会向其注册自己的地址。

使用详细说明

ETCD API有两种, 一种是3, 一种是2, 默认为2, 我们主要用3:

API3:

[root@guods-1 centos]# ETCDCTL_API=3 etcdctl --endpoints=localhost:2379 put key1 helloworld
OK
[root@guods-1 centos]# ETCDCTL_API=3 etcdctl --endpoints=localhost:2379 get key1
key1
helloworld

API2:

[root@guods-1 centos]# etcdctl set key2 helloworld
helloworld
[root@guods-1 centos]# etcdctl get key2
helloworld

命令详解：

[root@guods-1 centos]# ETCDCTL_API=2 etcdctl
NAME:
   etcdctl - A simple command line client for etcd.

USAGE:
   etcdctl [global options] command [command options] [arguments...]

VERSION:
   3.3.2

COMMANDS:
     backup          backup an etcd directory
     cluster-health  check the health of the etcd cluster
     mk              make a new key with a given value
     mkdir           make a new directory
     rm              remove a key or a directory
     rmdir           removes the key if it is an empty directory or a key-value pair
     get             retrieve the value of a key
     ls              retrieve a directory
     set             set the value of a key
     setdir          create a new directory or update an existing directory TTL
     update          update an existing key with a given value
     updatedir       update an existing directory
     watch           watch a key for changes
     exec-watch      watch a key for changes and exec an executable
     member          member add, remove and list subcommands
     user            user add, grant and revoke subcommands
     role            role add, grant and revoke subcommands
     auth            overall auth controls
     help, h         Shows a list of commands or help for one command

GLOBAL OPTIONS:
   --debug                          output cURL commands which can be used to reproduce the request
   --no-sync                        don't synchronize cluster information before sending request
   --output simple, -o simple       output response in the given format (simple, `extended` or `json`) (default: "simple")
   --discovery-srv value, -D value  domain name to query for SRV records describing cluster endpoints
   --insecure-discovery             accept insecure SRV records describing cluster endpoints
   --peers value, -C value          DEPRECATED - "--endpoints" should be used instead
   --endpoint value                 DEPRECATED - "--endpoints" should be used instead
   --endpoints value                a comma-delimited list of machine addresses in the cluster (default: "http://127.0.0.1:2379,http://127.0.0.1:4001")
   --cert-file value                identify HTTPS client using this SSL certificate file
   --key-file value                 identify HTTPS client using this SSL key file
   --ca-file value                  verify certificates of HTTPS-enabled servers using this CA bundle
   --username value, -u value       provide username[:password] and prompt if password is not supplied.
   --timeout value                  connection timeout per request (default: 2s)
   --total-timeout value            timeout for the command execution (except watch) (default: 5s)
   --help, -h                       show help
   --version, -v                    print the version

[root@guods-1 centos]# ETCDCTL_API=3 etcdctl
NAME:
    etcdctl - A simple command line client for etcd3.

USAGE:
    etcdctl

VERSION:
    3.3.2

API VERSION:
    3.3


COMMANDS:
    get         Gets the key or a range of keys
    put         Puts the given key into the store
    del         Removes the specified key or range of keys [key, range_end)
    txn         Txn processes all the requests in one transaction
    compaction      Compacts the event history in etcd
    alarm disarm        Disarms all alarms
    alarm list      Lists all alarms
    defrag          Defragments the storage of the etcd members with given endpoints
    endpoint health     Checks the healthiness of endpoints specified in `--endpoints` flag
    endpoint status     Prints out the status of endpoints specified in `--endpoints` flag
    endpoint hashkv     Prints the KV history hash for each endpoint in --endpoints
    move-leader     Transfers leadership to another etcd cluster member.
    watch           Watches events stream on keys or prefixes
    version         Prints the version of etcdctl
    lease grant     Creates leases
    lease revoke        Revokes leases
    lease timetolive    Get lease information
    lease list      List all active leases
    lease keep-alive    Keeps leases alive (renew)
    member add      Adds a member into the cluster
    member remove       Removes a member from the cluster
    member update       Updates a member in the cluster
    member list     Lists all members in the cluster
    snapshot save       Stores an etcd node backend snapshot to a given file
    snapshot restore    Restores an etcd member snapshot to an etcd directory
    snapshot status     Gets backend snapshot status of a given file
    make-mirror     Makes a mirror at the destination etcd cluster
    migrate         Migrates keys in a v2 store to a mvcc store
    lock            Acquires a named lock
    elect           Observes and participates in leader election
    auth enable     Enables authentication
    auth disable        Disables authentication
    user add        Adds a new user
    user delete     Deletes a user
    user get        Gets detailed information of a user
    user list       Lists all users
    user passwd     Changes password of user
    user grant-role     Grants a role to a user
    user revoke-role    Revokes a role from a user
    role add        Adds a new role
    role delete     Deletes a role
    role get        Gets detailed information of a role
    role list       Lists all roles
    role grant-permission   Grants a key to a role
    role revoke-permission  Revokes a key from a role
    check perf      Check the performance of the etcd cluster
    help            Help about any command

OPTIONS:
      --cacert=""               verify certificates of TLS-enabled secure servers using this CA bundle
      --cert=""                 identify secure client using this TLS certificate file
      --command-timeout=5s          timeout for short running command (excluding dial timeout)
      --debug[=false]               enable client-side debug logging
      --dial-timeout=2s             dial timeout for client connections
  -d, --discovery-srv=""            domain name to query for SRV records describing cluster endpoints
      --endpoints=[127.0.0.1:2379]      gRPC endpoints
  -h, --help[=false]                help for etcdctl
      --hex[=false]             print byte strings as hex encoded strings
      --insecure-discovery[=true]       accept insecure SRV records describing cluster endpoints
      --insecure-skip-tls-verify[=false]    skip server certificate verification
      --insecure-transport[=true]       disable transport security for client connections
      --keepalive-time=2s           keepalive time for client connections
      --keepalive-timeout=6s            keepalive timeout for client connections
      --key=""                  identify secure client using this TLS key file
      --user=""                 username[:password] for authentication (prompt if password is not supplied)
  -w, --write-out="simple"          set the output format (fields, json, protobuf, simple, table)

Raft算法实现之状态存储——基于etcd

Paxos算法也许是最著名的分布式一致性算法，而Raft则大概是最流行的分布式一致性算法。由于经验和水平所限，单纯看论文感觉并不能达到更进一步的理解。前面听闻Kubernetes, Docker Swarm, CockroachDB等等牛逼的项目都在用Raft。毕竟是经过大规模生产环境考验的技术，我觉得很有必要学习一下。而且etcd的Raft实现是开源的，毕竟“源码之前，了无秘密”。

未分类

无论是Paxos还是Raft，它们都是致力于维护一个RSM(Replicated State Machine)，如上图所示。对于RSM来说，状态存储是非常关键的。在这篇博客里，我准备基于etcd的实现分析一下Raft的状态存储。Raft状态的存储主要靠Snapshot和WAL（write ahead log）实现。

和很多数据库一样，为了保证数据的安全性（crash或者宕机下的恢复），都会使用WAL，etcd也不例外。etcd中的每一个事务操作（即写操作），都会预先写到事务文件中，这种文件就是WAL。
此外，etcd作为一个高可用的KV存储系统，不可能只依靠log replay来实现数据恢复。因此，etcd还提供了snapshot（快照）功能。snapshot即是定期把整个数据库保存成一个单独的快照文件，这样一来，不但缩短了日志重放的时间，也减轻了WAL的存储量，过早的WAL可以删除掉。

etcd使用了protobuf来定义协议格式，snapshot和log也在其中。raft/raft.proto文件部分内容如下：

enum EntryType {
    EntryNormal     = 0;
    EntryConfChange = 1;
}

message Entry {
    optional uint64     Term  = 2 [(gogoproto.nullable) = false]; // must be 64-bit aligned for atomic operations
    optional uint64     Index = 3 [(gogoproto.nullable) = false]; // must be 64-bit aligned for atomic operations
    optional EntryType  Type  = 1 [(gogoproto.nullable) = false];
    optional bytes      Data  = 4;
}

message SnapshotMetadata {
    optional ConfState conf_state = 1 [(gogoproto.nullable) = false];
    optional uint64    index      = 2 [(gogoproto.nullable) = false];
    optional uint64    term       = 3 [(gogoproto.nullable) = false];
}

message Snapshot {
    optional bytes            data     = 1;
    optional SnapshotMetadata metadata = 2 [(gogoproto.nullable) = false];
}

其中，entry即是logEntry，表示一条log。

1. Raft library提供的接口

etcd的Raft library其实也不是开箱即用，应用程序需要实现存储io和网络通信。存储io在Raft library被定义为一个Storage接口，这个Storage接口是Raft library用来读取log、snapshot等等数据的接口。Raft library本身提供了一个MemoryStorage的实现，这个实现是基于内存存储的，不能仅仅依靠它来保存持久化数据。

这个Storage的接口定义如下：

type Storage interface {
    // InitialState returns the saved HardState and ConfState information.
    InitialState() (pb.HardState, pb.ConfState, error)
    // Entries returns a slice of log entries in the range [lo,hi).
    // MaxSize limits the total size of the log entries returned, but
    // Entries returns at least one entry if any.
    Entries(lo, hi, maxSize uint64) ([]pb.Entry, error)
    // Term returns the term of entry i, which must be in the range
    // [FirstIndex()-1, LastIndex()]. The term of the entry before
    // FirstIndex is retained for matching purposes even though the
    // rest of that entry may not be available.
    Term(i uint64) (uint64, error)
    // LastIndex returns the index of the last entry in the log.
    LastIndex() (uint64, error)
    // FirstIndex returns the index of the first log entry that is
    // possibly available via Entries (older entries have been incorporated
    // into the latest Snapshot; if storage only contains the dummy entry the
    // first log entry is not available).
    FirstIndex() (uint64, error)
    // Snapshot returns the most recent snapshot.
    // If snapshot is temporarily unavailable, it should return ErrSnapshotTemporarilyUnavailable,
    // so raft state machine could know that Storage needs some time to prepare
    // snapshot and call Snapshot later.
    Snapshot() (pb.Snapshot, error)
}

既然仅仅依靠memoryStorage是不够用的，那么我们还是来看看etcd本身是如何使用Raft libray的。etcd的Storage接口其实也复用了memoryStorage，但是仅仅把它当做一层内存的cache。每一次事务性操作中，etcd都会事先将存储内容flush到持久化存储设备上，然后写入memoryStorage。正如前文所述，Storage仅仅是用做汇报内容给Raft library的，只要能保证它和持久化内容的一致即可。而这一点在单机上很容易保证。此外，Raft library是通过raftlog来操作Storage的，详情见 etcd/raft/raft.go 。

2. etcd的具体实现

etcd server是通过WAL和snapshot实现持久化存储的。etcd使用了一个包裹层，一个叫storage的struct。为了避免混淆，贴一点代码（etcd/etcdserver/storage.go）。

type Storage interface {
    // Save function saves ents and state to the underlying stable storage.
    // Save MUST block until st and ents are on stable storage.
    Save(st raftpb.HardState, ents []raftpb.Entry) error
    // SaveSnap function saves snapshot to the underlying stable storage.
    SaveSnap(snap raftpb.Snapshot) error
    // Close closes the Storage and performs finalization.
    Close() error
}

type storage struct {
    *wal.WAL
    *snap.Snapshotter
}

func NewStorage(w *wal.WAL, s *snap.Snapshotter) Storage {
    return &storage{w, s}
}

注意，这个Storage和之前的那个Storage并没有关系，千万不要搞混淆了。

由于golang的语言特性，storage struct可以直接使用WAL和Snapshotter的方法，因为没有声明成员变量的名字。那么etcd是如何结合使用Raft library的memoryStorage和这里的storage struct的呢？答案就在etcd/etcdserver/raft.go。etcd对Raft library进行了进一步的封装，称之为raftNode，raftNode包含了一个raftNodeConfig的匿名成员。raftNodeConfig的定义如下所示：

type raftNodeConfig struct {
    // to check if msg receiver is removed from cluster
    isIDRemoved func(id uint64) bool
    raft.Node
    raftStorage *raft.MemoryStorage
    storage     Storage
    heartbeat   time.Duration // for logging
    // transport specifies the transport to send and receive msgs to members.
    // Sending messages MUST NOT block. It is okay to drop messages, since
    // clients should timeout and reissue their messages.
    // If transport is nil, server will panic.
    transport rafthttp.Transporter
}

源码看起来就是一目了然，raftStorage就是提供给Raft library的，而storage则是etcd实现的持久化存贮。在使用中，etcd以连续调用的方式实现二者一致的逻辑。以etcd server重启为例，我们看看同步是如何实现的，且看restartNode()的实现。

func restartNode(cfg ServerConfig, snapshot *raftpb.Snapshot) (types.ID, *membership.RaftCluster, raft.Node, *raft.MemoryStorage, *wal.WAL) {
    var walsnap walpb.Snapshot
    if snapshot != nil {
        walsnap.Index, walsnap.Term = snapshot.Metadata.Index, snapshot.Metadata.Term
    }
    w, id, cid, st, ents := readWAL(cfg.WALDir(), walsnap)

    plog.Infof("restarting member %s in cluster %s at commit index %d", id, cid, st.Commit)
    cl := membership.NewCluster("")
    cl.SetID(cid)
    s := raft.NewMemoryStorage()
    if snapshot != nil {
        s.ApplySnapshot(*snapshot)
    }
    s.SetHardState(st)
    s.Append(ents)
    c := &raft.Config{
        ID:              uint64(id),
        ElectionTick:    cfg.ElectionTicks,
        HeartbeatTick:   1,
        Storage:         s,
        MaxSizePerMsg:   maxSizePerMsg,
        MaxInflightMsgs: maxInflightMsgs,
        CheckQuorum:     true,
    }

    n := raft.RestartNode(c)
    raftStatusMu.Lock()
    raftStatus = n.Status
    raftStatusMu.Unlock()
    advanceTicksForElection(n, c.ElectionTick)
    return id, cl, n, s, w
}

这个函数的主要逻辑就是通过读取snapshot和WAL，然后通过s.SetHardState()和s.Append()使得memoryStrorage的状态得到恢复。在etcd的工作过程中也是类似的形式，不信请看raft.go的start()方法：

    if err := r.storage.Save(rd.HardState, rd.Entries); err != nil {
        plog.Fatalf("raft save state and entries error: %v", err)
    }
    if !raft.IsEmptyHardState(rd.HardState) {
        proposalsCommitted.Set(float64(rd.HardState.Commit))
    }
    // gofail: var raftAfterSave struct{}
    r.raftStorage.Append(rd.Entries)

代码我删减了一部分，总体的逻辑可以看得更清楚。r.storage.Save()和r.raftStorage.Append()这种连续调用保证了storage和raftStorage的一致性。

好吧！状态存储就到这里了，但这仅仅是Raft的基本内容，后边继续探索Raft的日志复制、leader选举以及事务提交的实现，当然还有RPC。

Docker 搭建 etcd 集群

etcd 是 CoreOS 团队发起的一个开源项目（Go 语言，其实很多这类项目都是 Go 语言实现的，只能说很强大），实现了分布式键值存储和服务发现，etcd 和 ZooKeeper/Consul 非常相似，都提供了类似的功能，以及 REST API 的访问操作，具有以下特点：

简单：安装和使用简单，提供了 REST API 进行操作交互
安全：支持 HTTPS SSL 证书
快速：支持并发 10 k/s 的读写操作
可靠：采用 raft 算法，实现分布式系统数据的可用性和一致性

etcd 可以单个实例使用，也可以进行集群配置，因为很多项目都是以 etcd 作为服务发现，比如 CoreOS 和 Kubernetes，所以，下面我们使用 Docker 简单搭建一下 etcd 集群。

未分类

1. 主机安装

如果不使用 Docker 的话，etcd 在主机上安装，也非常简单。

Linux 安装命令：

$ curl -L  https://github.com/coreos/etcd/releases/download/v3.3.0-rc.0/etcd-v3.3.0-rc.0-linux-amd64.tar.gz -o etcd-v3.3.0-rc.0-linux-amd64.tar.gz && 
sudo tar xzvf etcd-v3.3.0-rc.0-linux-amd64.tar.gz && 
cd etcd-v3.3.0-rc.0-linux-amd64 && 
sudo cp etcd* /usr/local/bin/

其实就是将编译后的二进制文件，拷贝到/usr/local/bin/目录，各个版本的二进制文件，可以从 https://github.com/coreos/etcd/releases/ 中查找下载。

Mac OS 安装命令：

$ ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)" < /dev/null 2> /dev/null
$ brew install etcd

执行下面命令，查看 etcd 是否安装成功：

$ etcd --version
etcd Version: 3.2.12
Git SHA: GitNotFound
Go Version: go1.9.2
Go OS/Arch: darwin/amd64

2. 集群搭建

搭建 etcd 集群，需要借助下 Docker Machine 创建三个 Docker 主机，命令：

$ docker-machine create -d virtualbox manager1 && 
docker-machine create -d virtualbox worker1 && 
docker-machine create -d virtualbox worker2

$ docker-machine ls
NAME       ACTIVE   DRIVER       STATE     URL                         SWARM   DOCKER        ERRORS
manager1   -        virtualbox   Running   tcp://192.168.99.100:2376           v17.11.0-ce   
worker1    -        virtualbox   Running   tcp://192.168.99.101:2376           v17.11.0-ce   
worker2    -        virtualbox   Running   tcp://192.168.99.102:2376           v17.11.0-ce

为防止 Docker 主机中垃取官方镜像，速度慢的问题，我们还需要将 etcd 镜像打包推送到私有仓库中，命令：

$ docker tag quay.io/coreos/etcd 192.168.99.1:5000/quay.io/coreos/etcd:latest && 
docker push 192.168.99.1:5000/quay.io/coreos/etcd:latest && 
docker pull 192.168.99.1:5000/quay.io/coreos/etcd:latest

另外，还需要将私有仓库地址配置在 Docker 主机中，并重启三个 Docker 主机，具体配置参考：Docker 三剑客之 Docker Swarm

Docker 主机配置好之后，我们需要使用docker-machine ssh命令，分别进入三个 Docker 主机中，执行 Docker etcd 配置命令。

manager1 主机（node1 192.168.99.100）：

$ docker run -d --name etcd 
    -p 2379:2379 
    -p 2380:2380 
    --volume=etcd-data:/etcd-data 
    192.168.99.1:5000/quay.io/coreos/etcd 
    /usr/local/bin/etcd 
    --data-dir=/etcd-data --name node1 
    --initial-advertise-peer-urls http://192.168.99.100:2380 --listen-peer-urls http://0.0.0.0:2380 
    --advertise-client-urls http://192.168.99.100:2379 --listen-client-urls http://0.0.0.0:2379 
    --initial-cluster-state new 
    --initial-cluster-token docker-etcd 
    --initial-cluster node1=http://192.168.99.100:2380,node2=http://192.168.99.101:2380,node3=http://192.168.99.102:2380

worker1 主机（node2 192.168.99.101）：

$ docker run -d --name etcd 
    -p 2379:2379 
    -p 2380:2380 
    --volume=etcd-data:/etcd-data 
    192.168.99.1:5000/quay.io/coreos/etcd 
    /usr/local/bin/etcd 
    --data-dir=/etcd-data --name node2 
    --initial-advertise-peer-urls http://192.168.99.101:2380 --listen-peer-urls http://0.0.0.0:2380 
    --advertise-client-urls http://192.168.99.101:2379 --listen-client-urls http://0.0.0.0:2379 
    --initial-cluster-state new 
    --initial-cluster-token docker-etcd 
    --initial-cluster node1=http://192.168.99.100:2380,node2=http://192.168.99.101:2380,node3=http://192.168.99.102:2380

worker2 主机（node1 192.168.99.102）：

$ docker run -d --name etcd 
    -p 2379:2379 
    -p 2380:2380 
    --volume=etcd-data:/etcd-data 
    192.168.99.1:5000/quay.io/coreos/etcd 
    /usr/local/bin/etcd 
    --data-dir=/etcd-data --name node3 
    --initial-advertise-peer-urls http://192.168.99.102:2380 --listen-peer-urls http://0.0.0.0:2380 
    --advertise-client-urls http://192.168.99.102:2379 --listen-client-urls http://0.0.0.0:2379 
    --initial-cluster-state existing 
    --initial-cluster-token docker-etcd 
    --initial-cluster node1=http://192.168.99.100:2380,node2=http://192.168.99.101:2380,node3=http://192.168.99.102:2380

先来说明下 etcd 各个配置参数的意思（参考自 etcd 使用入门）：

–name：节点名称，默认为 default。
–data-dir：服务运行数据保存的路径，默认为${name}.etcd。
–snapshot-count：指定有多少事务（transaction）被提交时，触发截取快照保存到磁盘。
–heartbeat-interval：leader 多久发送一次心跳到 followers。默认值是 100ms。
–eletion-timeout：重新投票的超时时间，如果 follow 在该时间间隔没有收到心跳包，会触发重新投票，默认为 1000 ms。
–listen-peer-urls：和同伴通信的地址，比如http://ip:2380，如果有多个，使用逗号分隔。需要所有节点都能够访问，所以不要使用 localhost！
–listen-client-urls：对外提供服务的地址：比如http://ip:2379,http://127.0.0.1:2379，客户端会连接到这里和 etcd 交互。
–advertise-client-urls：对外公告的该节点客户端监听地址，这个值会告诉集群中其他节点。
–initial-advertise-peer-urls：该节点同伴监听地址，这个值会告诉集群中其他节点。
–initial-cluster：集群中所有节点的信息，格式为node1=http://ip1:2380,node2=http://ip2:2380,…，注意：这里的 node1 是节点的 –name 指定的名字；后面的 ip1:2380 是 –initial-advertise-peer-urls 指定的值。
–initial-cluster-state：新建集群的时候，这个值为 new；假如已经存在的集群，这个值为 existing。
–initial-cluster-token：创建集群的 token，这个值每个集群保持唯一。这样的话，如果你要重新创建集群，即使配置和之前一样，也会再次生成新的集群和节点 uuid；否则会导致多个集群之间的冲突，造成未知的错误。

上述配置也可以设置配置文件，默认为/etc/etcd/etcd.conf。

我们可以使用docker ps，查看 Docker etcd 是否配置成功：

$ docker ps
CONTAINER ID        IMAGE                                   COMMAND                  CREATED             STATUS              PORTS                              NAMES
463380d23dfe        192.168.99.1:5000/quay.io/coreos/etcd   "/usr/local/bin/et..."   2 hours ago         Up 2 hours          0.0.0.0:2379-2380->2379-2380/tcp   etcd

然后进入其中一个 Docker 主机：

$ docker exec -it etcd bin/sh

执行下面命令（查看集群成员）：

$ etcdctl member list
773d30c9fc6640b4: name=node2 peerURLs=http://192.168.99.101:2380 clientURLs=http://192.168.99.101:2379 isLeader=true
b2b0bca2e0cfcc19: name=node3 peerURLs=http://192.168.99.102:2380 clientURLs=http://192.168.99.102:2379 isLeader=false
c88e2cccbb287a01: name=node1 peerURLs=http://192.168.99.100:2380 clientURLs=http://192.168.99.100:2379 isLeader=false

可以看到，集群里面有三个成员，并且node2为管理员，node1和node3为普通成员。

etcdctl 是 ectd 的客户端命令工具（也是 go 语言实现），里面封装了 etcd 的 REST API 执行命令，方便我们进行操作 etcd，后面再列出 etcdctl 的命令详细说明。

上面命令的 etcd API 版本为 2.0，我们可以手动设置版本为 3.0，命令：

$ export ETCDCTL_API=3 && /usr/local/bin/etcdctl put foo bar
OK

部分命令和执行结果还是和 2.0 版本，有很多不同的，比如同是查看集群成员，3.0 版本的执行结果：

$ etcdctl member list
773d30c9fc6640b4, started, node2, http://192.168.99.101:2380, http://192.168.99.101:2379
b2b0bca2e0cfcc19, started, node3, http://192.168.99.102:2380, http://192.168.99.102:2379
c88e2cccbb287a01, started, node1, http://192.168.99.100:2380, http://192.168.99.100:2379

好了，我们现在再演示一种情况，就是从集群中移除一个节点，然后再把它添加到集群中，为演示 etcd 中使用 Raft 算法，我们将node2管理节点，作为操作对象。

我们在随便一个主机 etcd 容器中（node2除外），执行成员移除集群命令（必须使用 ID，使用别名会报错）：

$ etcdctl member remove 773d30c9fc6640b4
Member 773d30c9fc6640b4 removed from cluster f84185fa5f91bdf6

我们再执行下查看集群成员命令（v2 版本）：

$ etcdctl member list
b2b0bca2e0cfcc19: name=node3 peerURLs=http://192.168.99.102:2380 clientURLs=http://192.168.99.102:2379 isLeader=true
c88e2cccbb287a01: name=node1 peerURLs=http://192.168.99.100:2380 clientURLs=http://192.168.99.100:2379 isLeader=false

会发现node2管理节点被移除集群了，并且通过 Raft 算法，node3被推举为管理节点。

在将node2节点重新加入集群之前，我们需要执行下面命令：

$ etcdctl member add node2 --peer-urls="http://192.168.99.101:2380"
Member 22b0de6ffcd98f00 added to cluster f84185fa5f91bdf6

ETCD_NAME="node2"
ETCD_INITIAL_CLUSTER="node2=http://192.168.99.101:2380,node3=http://192.168.99.102:2380,node1=http://192.168.99.100:2380"
ETCD_INITIAL_CLUSTER_STATE="existing"

可以看到，ETCD_INITIAL_CLUSTER_STATE 值为existing，也就是我们配置的–initial-cluster-state参数。

我们再执行下查看集群成员命令（v2 版本）：

$ etcdctl member list
22b0de6ffcd98f00[unstarted]: peerURLs=http://192.168.99.101:2380
b2b0bca2e0cfcc19: name=node3 peerURLs=http://192.168.99.102:2380 clientURLs=http://192.168.99.102:2379 isLeader=true
c88e2cccbb287a01: name=node1 peerURLs=http://192.168.99.100:2380 clientURLs=http://192.168.99.100:2379 isLeader=false

会发现22b0de6ffcd98f00成员状态变为了unstarted。

我们在node2节点，执行 Docker etcd 集群配置命令：

$ docker run -d --name etcd 
    -p 2379:2379 
    -p 2380:2380 
    --volume=etcd-data:/etcd-data 
    192.168.99.1:5000/quay.io/coreos/etcd 
    /usr/local/bin/etcd 
    --data-dir=/etcd-data --name node2 
    --initial-advertise-peer-urls http://192.168.99.101:2380 --listen-peer-urls http://0.0.0.0:2380 
    --advertise-client-urls http://192.168.99.101:2379 --listen-client-urls http://0.0.0.0:2379 
    --initial-cluster-state existing 
    --initial-cluster-token docker-etcd 
    --initial-cluster node1=http://192.168.99.100:2380,node2=http://192.168.99.101:2380,node3=http://192.168.99.102:2380

结果并不像我们想要的那样成功，执行查看日志：

$ docker logs etcd
2017-12-25 08:19:30.160967 I | etcdmain: etcd Version: 3.2.12
2017-12-25 08:19:30.161062 I | etcdmain: Git SHA: b19dae0
2017-12-25 08:19:30.161082 I | etcdmain: Go Version: go1.8.5
2017-12-25 08:19:30.161092 I | etcdmain: Go OS/Arch: linux/amd64
2017-12-25 08:19:30.161105 I | etcdmain: setting maximum number of CPUs to 1, total number of available CPUs is 1
2017-12-25 08:19:30.161144 N | etcdmain: the server is already initialized as member before, starting as etcd member...
2017-12-25 08:19:30.161195 I | embed: listening for peers on http://0.0.0.0:2380
2017-12-25 08:19:30.161232 I | embed: listening for client requests on 0.0.0.0:2379
2017-12-25 08:19:30.165269 I | etcdserver: name = node2
2017-12-25 08:19:30.165317 I | etcdserver: data dir = /etcd-data
2017-12-25 08:19:30.165335 I | etcdserver: member dir = /etcd-data/member
2017-12-25 08:19:30.165347 I | etcdserver: heartbeat = 100ms
2017-12-25 08:19:30.165358 I | etcdserver: election = 1000ms
2017-12-25 08:19:30.165369 I | etcdserver: snapshot count = 100000
2017-12-25 08:19:30.165385 I | etcdserver: advertise client URLs = http://192.168.99.101:2379
2017-12-25 08:19:30.165593 I | etcdserver: restarting member 773d30c9fc6640b4 in cluster f84185fa5f91bdf6 at commit index 14
2017-12-25 08:19:30.165627 I | raft: 773d30c9fc6640b4 became follower at term 11
2017-12-25 08:19:30.165647 I | raft: newRaft 773d30c9fc6640b4 [peers: [], term: 11, commit: 14, applied: 0, lastindex: 14, lastterm: 11]
2017-12-25 08:19:30.169277 W | auth: simple token is not cryptographically signed
2017-12-25 08:19:30.170424 I | etcdserver: starting server... [version: 3.2.12, cluster version: to_be_decided]
2017-12-25 08:19:30.171732 I | etcdserver/membership: added member 773d30c9fc6640b4 [http://192.168.99.101:2380] to cluster f84185fa5f91bdf6
2017-12-25 08:19:30.171845 I | etcdserver/membership: added member c88e2cccbb287a01 [http://192.168.99.100:2380] to cluster f84185fa5f91bdf6
2017-12-25 08:19:30.171877 I | rafthttp: starting peer c88e2cccbb287a01...
2017-12-25 08:19:30.171902 I | rafthttp: started HTTP pipelining with peer c88e2cccbb287a01
2017-12-25 08:19:30.175264 I | rafthttp: started peer c88e2cccbb287a01
2017-12-25 08:19:30.175339 I | rafthttp: added peer c88e2cccbb287a01
2017-12-25 08:19:30.178326 I | etcdserver/membership: added member cbd7fa8d01297113 [http://192.168.99.102:2380] to cluster f84185fa5f91bdf6
2017-12-25 08:19:30.178383 I | rafthttp: starting peer cbd7fa8d01297113...
2017-12-25 08:19:30.178410 I | rafthttp: started HTTP pipelining with peer cbd7fa8d01297113
2017-12-25 08:19:30.179794 I | rafthttp: started peer cbd7fa8d01297113
2017-12-25 08:19:30.179835 I | rafthttp: added peer cbd7fa8d01297113
2017-12-25 08:19:30.180062 N | etcdserver/membership: set the initial cluster version to 3.0
2017-12-25 08:19:30.180132 I | etcdserver/api: enabled capabilities for version 3.0
2017-12-25 08:19:30.180255 N | etcdserver/membership: updated the cluster version from 3.0 to 3.2
2017-12-25 08:19:30.180430 I | etcdserver/api: enabled capabilities for version 3.2
2017-12-25 08:19:30.183979 I | rafthttp: started streaming with peer c88e2cccbb287a01 (writer)
2017-12-25 08:19:30.184139 I | rafthttp: started streaming with peer c88e2cccbb287a01 (writer)
2017-12-25 08:19:30.184232 I | rafthttp: started streaming with peer c88e2cccbb287a01 (stream MsgApp v2 reader)
2017-12-25 08:19:30.185142 I | rafthttp: started streaming with peer c88e2cccbb287a01 (stream Message reader)
2017-12-25 08:19:30.186518 I | etcdserver/membership: removed member cbd7fa8d01297113 from cluster f84185fa5f91bdf6
2017-12-25 08:19:30.186573 I | rafthttp: stopping peer cbd7fa8d01297113...
2017-12-25 08:19:30.186614 I | rafthttp: started streaming with peer cbd7fa8d01297113 (writer)
2017-12-25 08:19:30.186786 I | rafthttp: stopped streaming with peer cbd7fa8d01297113 (writer)
2017-12-25 08:19:30.186815 I | rafthttp: started streaming with peer cbd7fa8d01297113 (writer)
2017-12-25 08:19:30.186831 I | rafthttp: stopped streaming with peer cbd7fa8d01297113 (writer)
2017-12-25 08:19:30.186876 I | rafthttp: started streaming with peer cbd7fa8d01297113 (stream MsgApp v2 reader)
2017-12-25 08:19:30.187224 I | rafthttp: started streaming with peer cbd7fa8d01297113 (stream Message reader)
2017-12-25 08:19:30.187647 I | rafthttp: stopped HTTP pipelining with peer cbd7fa8d01297113
2017-12-25 08:19:30.187682 I | rafthttp: stopped streaming with peer cbd7fa8d01297113 (stream MsgApp v2 reader)
2017-12-25 08:19:30.187873 I | rafthttp: stopped streaming with peer cbd7fa8d01297113 (stream Message reader)
2017-12-25 08:19:30.187895 I | rafthttp: stopped peer cbd7fa8d01297113
2017-12-25 08:19:30.187911 I | rafthttp: removed peer cbd7fa8d01297113
2017-12-25 08:19:30.188034 I | etcdserver/membership: added member b2b0bca2e0cfcc19 [http://192.168.99.102:2380] to cluster f84185fa5f91bdf6
2017-12-25 08:19:30.188059 I | rafthttp: starting peer b2b0bca2e0cfcc19...
2017-12-25 08:19:30.188075 I | rafthttp: started HTTP pipelining with peer b2b0bca2e0cfcc19
2017-12-25 08:19:30.188510 I | rafthttp: started peer b2b0bca2e0cfcc19
2017-12-25 08:19:30.188533 I | rafthttp: added peer b2b0bca2e0cfcc19
2017-12-25 08:19:30.188795 I | etcdserver/membership: removed member 773d30c9fc6640b4 from cluster f84185fa5f91bdf6
2017-12-25 08:19:30.193643 I | rafthttp: started streaming with peer b2b0bca2e0cfcc19 (writer)
2017-12-25 08:19:30.193730 I | rafthttp: started streaming with peer b2b0bca2e0cfcc19 (writer)
2017-12-25 08:19:30.193797 I | rafthttp: started streaming with peer b2b0bca2e0cfcc19 (stream MsgApp v2 reader)
2017-12-25 08:19:30.194782 I | rafthttp: started streaming with peer b2b0bca2e0cfcc19 (stream Message reader)
2017-12-25 08:19:30.195663 I | raft: 773d30c9fc6640b4 [term: 11] received a MsgHeartbeat message with higher term from b2b0bca2e0cfcc19 [term: 12]
2017-12-25 08:19:30.195716 I | raft: 773d30c9fc6640b4 became follower at term 12
2017-12-25 08:19:30.195736 I | raft: raft.node: 773d30c9fc6640b4 elected leader b2b0bca2e0cfcc19 at term 12
2017-12-25 08:19:30.196617 E | rafthttp: streaming request ignored (ID mismatch got 22b0de6ffcd98f00 want 773d30c9fc6640b4)
2017-12-25 08:19:30.197064 E | rafthttp: streaming request ignored (ID mismatch got 22b0de6ffcd98f00 want 773d30c9fc6640b4)
2017-12-25 08:19:30.197846 E | rafthttp: streaming request ignored (ID mismatch got 22b0de6ffcd98f00 want 773d30c9fc6640b4)
2017-12-25 08:19:30.198242 E | rafthttp: streaming request ignored (ID mismatch got 22b0de6ffcd98f00 want 773d30c9fc6640b4)
2017-12-25 08:19:30.201771 E | etcdserver: the member has been permanently removed from the cluster
2017-12-25 08:19:30.202060 I | etcdserver: the data-dir used by this member must be removed.
2017-12-25 08:19:30.202307 E | etcdserver: publish error: etcdserver: request cancelled
2017-12-25 08:19:30.202338 I | etcdserver: aborting publish because server is stopped
2017-12-25 08:19:30.202364 I | rafthttp: stopping peer b2b0bca2e0cfcc19...
2017-12-25 08:19:30.202482 I | rafthttp: stopped streaming with peer b2b0bca2e0cfcc19 (writer)
2017-12-25 08:19:30.202504 I | rafthttp: stopped streaming with peer b2b0bca2e0cfcc19 (writer)
2017-12-25 08:19:30.204143 I | rafthttp: stopped HTTP pipelining with peer b2b0bca2e0cfcc19
2017-12-25 08:19:30.204186 I | rafthttp: stopped streaming with peer b2b0bca2e0cfcc19 (stream MsgApp v2 reader)
2017-12-25 08:19:30.204205 I | rafthttp: stopped streaming with peer b2b0bca2e0cfcc19 (stream Message reader)
2017-12-25 08:19:30.204217 I | rafthttp: stopped peer b2b0bca2e0cfcc19
2017-12-25 08:19:30.204228 I | rafthttp: stopping peer c88e2cccbb287a01...
2017-12-25 08:19:30.204241 I | rafthttp: stopped streaming with peer c88e2cccbb287a01 (writer)
2017-12-25 08:19:30.204255 I | rafthttp: stopped streaming with peer c88e2cccbb287a01 (writer)
2017-12-25 08:19:30.204824 I | rafthttp: stopped HTTP pipelining with peer c88e2cccbb287a01
2017-12-25 08:19:30.204860 I | rafthttp: stopped streaming with peer c88e2cccbb287a01 (stream MsgApp v2 reader)
2017-12-25 08:19:30.204878 I | rafthttp: stopped streaming with peer c88e2cccbb287a01 (stream Message reader)
2017-12-25 08:19:30.204891 I | rafthttp: stopped peer c88e2cccbb287a01

这么长的日志，说明啥问题呢，就是说我们虽然重新执行的 etcd 创建命令，但因为读取之前配置文件的关系，etcd 会恢复之前的集群成员，但之前的集群节点已经被移除了，所以集群节点就一直处于停止状态。

怎么解决呢？很简单，就是将我们之前创建的etcd-data数据卷轴删掉，命令：

$ docker volume ls
DRIVER              VOLUME NAME
local               etcd-data

$ docker volume rm etcd-data
etcd-data

然后，再在node2节点，重新执行 Docker etcd 集群配置命令（上面），会发现执行是成功的。

我们再执行下查看集群成员命令（v2 版本）：

$ etcdctl member list
22b0de6ffcd98f00: name=node2 peerURLs=http://192.168.99.101:2380 clientURLs=http://192.168.99.101:2379 isLeader=false
b2b0bca2e0cfcc19: name=node3 peerURLs=http://192.168.99.102:2380 clientURLs=http://192.168.99.102:2379 isLeader=true
c88e2cccbb287a01: name=node1 peerURLs=http://192.168.99.100:2380 clientURLs=http://192.168.99.100:2379 isLeader=false

3. API 操作

etcd REST API 被用于键值操作和集群成员操作，这边就简单说几个，详细的 API 查看附录说明。

3.1 键值管理

设置键值命令：

$ curl http://127.0.0.1:2379/v2/keys/hello -XPUT -d value="hello world"
{"action":"set","node":{"key":"/hello","value":"hello world","modifiedIndex":17,"createdIndex":17}}

查看键值命令：

$ curl http://127.0.0.1:2379/v2/keys/hello
{"action":"get","node":{"key":"/hello","value":"hello world","modifiedIndex":17,"createdIndex":17}}

删除键值命令：

$ curl http://127.0.0.1:2379/v2/keys/hello -XDELETE
{"action":"delete","node":{"key":"/hello","modifiedIndex":19,"createdIndex":17},"prevNode":{"key":"/hello","value":"hello world","modifiedIndex":17,"createdIndex":17}}

3.2 成员管理

列出集群中的所有成员：

$ curl http://127.0.0.1:2379/v2/members
{"members":[{"id":"22b0de6ffcd98f00","name":"node2","peerURLs":["http://192.168.99.101:2380"],"clientURLs":["http://192.168.99.101:2379"]},{"id":"b2b0bca2e0cfcc19","name":"node3","peerURLs":["http://192.168.99.102:2380"],"clientURLs":["http://192.168.99.102:2379"]},{"id":"c88e2cccbb287a01","name":"node1","peerURLs":["http://192.168.99.100:2380"],"clientURLs":["http://192.168.99.100:2379"]}]}

查看当前节点是否为管理节点：

$ curl http://127.0.0.1:2379/v2/stats/leader
{"leader":"b2b0bca2e0cfcc19","followers":{"22b0de6ffcd98f00":{"latency":{"current":0.001051,"average":0.0029195000000000002,"standardDeviation":0.001646769458667484,"minimum":0.001051,"maximum":0.006367},"counts":{"fail":0,"success":10}},"c88e2cccbb287a01":{"latency":{"current":0.000868,"average":0.0022389999999999997,"standardDeviation":0.0011402923601720172,"minimum":0.000868,"maximum":0.004725},"counts":{"fail":0,"success":12}}}}

查看当前节点信息：

$ curl http://127.0.0.1:2379/v2/stats/self
{"name":"node3","id":"b2b0bca2e0cfcc19","state":"StateLeader","startTime":"2017-12-25T06:00:28.803429523Z","leaderInfo":{"leader":"b2b0bca2e0cfcc19","uptime":"36m45.45263851s","startTime":"2017-12-25T08:13:02.103896843Z"},"recvAppendRequestCnt":6,"sendAppendRequestCnt":22}

查看集群状态：

$ curl http://127.0.0.1:2379/v2/stats/store
{"getsSuccess":9,"getsFail":4,"setsSuccess":9,"setsFail":0,"deleteSuccess":3,"deleteFail":0,"updateSuccess":0,"updateFail":0,"createSuccess":7,"createFail":0,"compareAndSwapSuccess":0,"compareAndSwapFail":0,"compareAndDeleteSuccess":0,"compareAndDeleteFail":0,"expireCount":0,"watchers":0}

当然也可以通过 API 添加和删除集群成员。

4. API 说明和 etcdctl 命令说明

etcd REST API 说明（v2 版本）：

未分类

更多 API 请查看：https://coreos.com/etcd/docs/latest/v2/api.html 和 https://coreos.com/etcd/docs/latest/v2/members_api.html

etcdctl 命令说明：

未分类

zookeeper和etcd有状态服务部署实践

一. 概述

kubernetes通过statefulset为zookeeper、etcd等这类有状态的应用程序提供完善支持，statefulset具备以下特性：

为pod提供稳定的唯一的网络标识
稳定值持久化存储：通过pv/pvc来实现
启动和停止pod保证有序：优雅的部署和伸缩性

本文阐述了如何在k8s集群上部署zookeeper和etcd有状态服务，并结合ceph实现数据持久化。

二. 总结

使用k8s的statefulset、storageclass、pv、pvc和ceph的rbd，能够很好的支持zookeeper、etcd这样的有状态服务部署到kubernetes集群上。
k8s不会主动删除已经创建的pv、pvc对象，防止出现误删。

如果用户确定删除pv、pvc对象，同时还需要手动删除ceph段的rbd镜像。

遇到的坑

storageclass中引用的ceph客户端用户，必须要有mon rw，rbd rwx权限。如果没有mon write权限，会导致释放rbd锁失败，无法将rbd镜像挂载到其他的k8s worker节点。

zookeeper使用探针检查zookeeper节点的健康状态，如果节点不健康，k8s将删除pod，并自动重建该pod，达到自动重启zookeeper节点的目的。

因zookeeper 3.4版本的集群配置，是通过静态加载文件zoo.cfg来实现的，所以当zookeeper节点pod ip变动后，需要重启zookeeper集群中的所有节点。

etcd部署方式有待优化

本次试验中使用静态方式部署etcd集群，如果etcd节点变迁时，需要执行etcdctl member remove/add等命令手动配置etcd集群，严重限制了etcd集群自动故障恢复、扩容缩容的能力。因此，需要考虑对部署方式优化，改为使用DNS或者etcd descovery的动态方式部署etcd，才能让etcd更好的运行在k8s上。

三. zookeeper集群部署

1. 下载镜像

docker pull gcr.mirrors.ustc.edu.cn/google_containers/kubernetes-zookeeper:1.0-3.4.10
docker tag gcr.mirrors.ustc.edu.cn/google_containers/kubernetes-zookeeper:1.0-3.4.10 172.16.18.100:5000/gcr.io/google_containers/kubernetes-zookeeper:1.0-3.4.10
docker push  172.16.18.100:5000/gcr.io/google_containers/kubernetes-zookeeper:1.0-3.4.10

2. 定义ceph secret

cat << EOF | kubectl create -f -
apiVersion: v1
data:
  key: QVFBYy9ndGFRUno4QlJBQXMxTjR3WnlqN29PK3VrMzI1a05aZ3c9PQo=
kind: Secret
metadata:
  creationTimestamp: 2017-11-20T10:29:05Z
  name: ceph-secret
  namespace: default
  resourceVersion: "2954730"
  selfLink: /api/v1/namespaces/default/secrets/ceph-secret
  uid: a288ff74-cddd-11e7-81cc-000c29f99475
type: kubernetes.io/rbd
EOF

3. 定义storageclass rbd存储

cat << EOF | kubectl create -f -
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: ceph
parameters:
  adminId: admin
  adminSecretName: ceph-secret
  adminSecretNamespace: default
  fsType: ext4
  imageFormat: "2"
  imagefeatures: layering
  monitors: 172.16.13.223
  pool: k8s
  userId: admin
  userSecretName: ceph-secret
provisioner: kubernetes.io/rbd
reclaimPolicy: Delete
EOF

4. 创建zookeeper集群

使用rbd存储zookeeper节点数据

cat << EOF | kubectl create -f -
---
apiVersion: v1
kind: Service
metadata:
  name: zk-hs
  labels:
    app: zk
spec:
  ports:
  - port: 2888
    name: server
  - port: 3888
    name: leader-election
  clusterIP: None
  selector:
    app: zk
---
apiVersion: v1
kind: Service
metadata:
  name: zk-cs
  labels:
    app: zk
spec:
  ports:
  - port: 2181
    name: client
  selector:
    app: zk
---
apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
  name: zk-pdb
spec:
  selector:
    matchLabels:
      app: zk
  maxUnavailable: 1
---
apiVersion: apps/v1beta2 # for versions before 1.8.0 use apps/v1beta1
kind: StatefulSet
metadata:
  name: zk
spec:
  selector:
    matchLabels:
      app: zk
  serviceName: zk-hs
  replicas: 3
  updateStrategy:
    type: RollingUpdate
  podManagementPolicy: Parallel
  template:
    metadata:
      labels:
        app: zk
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchExpressions:
                  - key: "app"
                    operator: In
                    values:
                    - zk
              topologyKey: "kubernetes.io/hostname"
      containers:
      - name: kubernetes-zookeeper
        imagePullPolicy: Always
        image: "172.16.18.100:5000/gcr.io/google_containers/kubernetes-zookeeper:1.0-3.4.10"
        ports:
        - containerPort: 2181
          name: client
        - containerPort: 2888
          name: server
        - containerPort: 3888
          name: leader-election
        command:
        - sh
        - -c
        - "start-zookeeper 
          --servers=3 
          --data_dir=/var/lib/zookeeper/data 
          --data_log_dir=/var/lib/zookeeper/data/log 
          --conf_dir=/opt/zookeeper/conf 
          --client_port=2181 
          --election_port=3888 
          --server_port=2888 
          --tick_time=2000 
          --init_limit=10 
          --sync_limit=5 
          --heap=512M 
          --max_client_cnxns=60 
          --snap_retain_count=3 
          --purge_interval=12 
          --max_session_timeout=40000 
          --min_session_timeout=4000 
          --log_level=INFO"
        readinessProbe:
          exec:
            command:
            - sh
            - -c
            - "zookeeper-ready 2181"
          initialDelaySeconds: 10
          timeoutSeconds: 5
        livenessProbe:
          exec:
            command:
            - sh
            - -c
            - "zookeeper-ready 2181"
          initialDelaySeconds: 10
          timeoutSeconds: 5
        volumeMounts:
        - name: datadir
          mountPath: /var/lib/zookeeper
      securityContext:
        runAsUser: 1000
        fsGroup: 1000
  volumeClaimTemplates:
  - metadata:
      name: datadir
      annotations:
        volume.beta.kubernetes.io/storage-class: ceph
    spec:
      accessModes: [ "ReadWriteOnce" ]
      resources:
        requests:
          storage: 1Gi
EOF

查看创建结果

[root@172 zookeeper]# kubectl get no
NAME           STATUS    ROLES     AGE       VERSION
172.16.20.10   Ready     <none>    50m       v1.8.2
172.16.20.11   Ready     <none>    2h        v1.8.2
172.16.20.12   Ready     <none>    1h        v1.8.2

[root@172 zookeeper]# kubectl get po -owide 
NAME      READY     STATUS    RESTARTS   AGE       IP              NODE
zk-0      1/1       Running   0          8m        192.168.5.162   172.16.20.10
zk-1      1/1       Running   0          1h        192.168.2.146   172.16.20.11

[root@172 zookeeper]# kubectl get pv,pvc
NAME                                          CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS    CLAIM                  STORAGECLASS   REASON    AGE
pv/pvc-226cb8f0-d322-11e7-9581-000c29f99475   1Gi        RWO            Delete           Bound     default/datadir-zk-0   ceph                     1h
pv/pvc-22703ece-d322-11e7-9581-000c29f99475   1Gi        RWO            Delete           Bound     default/datadir-zk-1   ceph                     1h

NAME               STATUS    VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
pvc/datadir-zk-0   Bound     pvc-226cb8f0-d322-11e7-9581-000c29f99475   1Gi        RWO            ceph           1h
pvc/datadir-zk-1   Bound     pvc-22703ece-d322-11e7-9581-000c29f99475   1Gi        RWO            ceph           1h

zk-0 pod的rbd的锁信息为

[root@ceph1 ceph]# rbd lock list kubernetes-dynamic-pvc-227b45e5-d322-11e7-90ab-000c29f99475 -p k8s --user admin
There is 1 exclusive lock on this image.
Locker       ID                              Address                   
client.24146 kubelet_lock_magic_172.16.20.10 172.16.20.10:0/1606152350

5. 测试pod迁移

尝试将172.16.20.10节点设置为污点，让zk-0 pod自动迁移到172.16.20.12

kubectl cordon 172.16.20.10

[root@172 zookeeper]# kubectl get no
NAME           STATUS                     ROLES     AGE       VERSION
172.16.20.10   Ready,SchedulingDisabled   <none>    58m       v1.8.2
172.16.20.11   Ready                      <none>    2h        v1.8.2
172.16.20.12   Ready                      <none>    1h        v1.8.2

kubectl delete po zk-0

观察zk-0的迁移过程

[root@172 zookeeper]# kubectl get po -owide -w
NAME      READY     STATUS    RESTARTS   AGE       IP              NODE
zk-0      1/1       Running   0          14m       192.168.5.162   172.16.20.10
zk-1      1/1       Running   0          1h        192.168.2.146   172.16.20.11
zk-0      1/1       Terminating   0         16m       192.168.5.162   172.16.20.10
zk-0      0/1       Terminating   0         16m       <none>    172.16.20.10
zk-0      0/1       Terminating   0         16m       <none>    172.16.20.10
zk-0      0/1       Terminating   0         16m       <none>    172.16.20.10
zk-0      0/1       Terminating   0         16m       <none>    172.16.20.10
zk-0      0/1       Terminating   0         16m       <none>    172.16.20.10
zk-0      0/1       Pending   0         0s        <none>    <none>
zk-0      0/1       Pending   0         0s        <none>    172.16.20.12
zk-0      0/1       ContainerCreating   0         0s        <none>    172.16.20.12
zk-0      0/1       Running   0         3s        192.168.3.4   172.16.20.12

此时zk-0正常迁移到172.16.20.12
再查看rbd的锁定信息

[root@ceph1 ceph]# rbd lock list kubernetes-dynamic-pvc-227b45e5-d322-11e7-90ab-000c29f99475 -p k8s --user admin
There is 1 exclusive lock on this image.
Locker       ID                              Address                   
client.24146 kubelet_lock_magic_172.16.20.10 172.16.20.10:0/1606152350 
[root@ceph1 ceph]# rbd lock list kubernetes-dynamic-pvc-227b45e5-d322-11e7-90ab-000c29f99475 -p k8s --user admin
There is 1 exclusive lock on this image.
Locker       ID                              Address                   
client.24154 kubelet_lock_magic_172.16.20.12 172.16.20.12:0/3715989358

之前在另外一个ceph集群测试这个zk pod迁移的时候，总是报错无法释放lock，经分析应该是使用的ceph账号没有相应的权限，所以导致释放lock失败。记录的报错信息如下：

Nov 27 10:45:55 172 kubelet: W1127 10:45:55.551768   11556 rbd_util.go:471] rbd: no watchers on kubernetes-dynamic-pvc-f35a411e-d317-11e7-90ab-000c29f99475
Nov 27 10:45:55 172 kubelet: I1127 10:45:55.694126   11556 rbd_util.go:181] remove orphaned locker kubelet_lock_magic_172.16.20.12 from client client.171490: err exit status 13, output: 2017-11-27 10:45:55.570483 7fbdbe922d40 -1 did not load config file, using default settings.
Nov 27 10:45:55 172 kubelet: 2017-11-27 10:45:55.600816 7fbdbe922d40 -1 Errors while parsing config file!
Nov 27 10:45:55 172 kubelet: 2017-11-27 10:45:55.600824 7fbdbe922d40 -1 parse_file: cannot open /etc/ceph/ceph.conf: (2) No such file or directory
Nov 27 10:45:55 172 kubelet: 2017-11-27 10:45:55.600825 7fbdbe922d40 -1 parse_file: cannot open ~/.ceph/ceph.conf: (2) No such file or directory
Nov 27 10:45:55 172 kubelet: 2017-11-27 10:45:55.600825 7fbdbe922d40 -1 parse_file: cannot open ceph.conf: (2) No such file or directory
Nov 27 10:45:55 172 kubelet: 2017-11-27 10:45:55.602492 7fbdbe922d40 -1 Errors while parsing config file!
Nov 27 10:45:55 172 kubelet: 2017-11-27 10:45:55.602494 7fbdbe922d40 -1 parse_file: cannot open /etc/ceph/ceph.conf: (2) No such file or directory
Nov 27 10:45:55 172 kubelet: 2017-11-27 10:45:55.602495 7fbdbe922d40 -1 parse_file: cannot open ~/.ceph/ceph.conf: (2) No such file or directory
Nov 27 10:45:55 172 kubelet: 2017-11-27 10:45:55.602496 7fbdbe922d40 -1 parse_file: cannot open ceph.conf: (2) No such file or directory
Nov 27 10:45:55 172 kubelet: 2017-11-27 10:45:55.651594 7fbdbe922d40 -1 auth: unable to find a keyring on /etc/ceph/ceph.client.k8s.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin,: (2) No such file or directory
Nov 27 10:45:55 172 kubelet: rbd: releasing lock failed: (13) Permission denied
Nov 27 10:45:55 172 kubelet: 2017-11-27 10:45:55.682470 7fbdbe922d40 -1 librbd: unable to blacklist client: (13) Permission denied

k8s rbd volume的实现代码：

if lock {
            // check if lock is already held for this host by matching lock_id and rbd lock id
            if strings.Contains(output, lock_id) {
                // this host already holds the lock, exit
                glog.V(1).Infof("rbd: lock already held for %s", lock_id)
                return nil
            }
            // clean up orphaned lock if no watcher on the image
            used, statusErr := util.rbdStatus(&b)
            if statusErr == nil && !used {
                re := regexp.MustCompile("client.* " + kubeLockMagic + ".*")
                locks := re.FindAllStringSubmatch(output, -1)
                for _, v := range locks {
                    if len(v) > 0 {
                        lockInfo := strings.Split(v[0], " ")
                        if len(lockInfo) > 2 {
                            args := []string{"lock", "remove", b.Image, lockInfo[1], lockInfo[0], "--pool", b.Pool, "--id", b.Id, "-m", mon}
                            args = append(args, secret_opt...)
                            cmd, err = b.exec.Run("rbd", args...)
                            # 执行rbd lock remove命令时返回了错误信息
                            glog.Infof("remove orphaned locker %s from client %s: err %v, output: %s", lockInfo[1], lockInfo[0], err, string(cmd))
                        }
                    }
                }
            }

            // hold a lock: rbd lock add
            args := []string{"lock", "add", b.Image, lock_id, "--pool", b.Pool, "--id", b.Id, "-m", mon}
            args = append(args, secret_opt...)
            cmd, err = b.exec.Run("rbd", args...)
        }

可以看到，rbd lock remove操作被拒绝了，原因是没有权限rbd: releasing lock failed: (13) Permission denied。

6. 测试扩容

zookeeper集群节点数从2个扩为3个。
集群节点数为2时，zoo.cfg的配置中定义了两个实例

zookeeper@zk-0:/opt/zookeeper/conf$ cat zoo.cfg 
#This file was autogenerated DO NOT EDIT
clientPort=2181
dataDir=/var/lib/zookeeper/data
dataLogDir=/var/lib/zookeeper/data/log
tickTime=2000
initLimit=10
syncLimit=5
maxClientCnxns=60
minSessionTimeout=4000
maxSessionTimeout=40000
autopurge.snapRetainCount=3
autopurge.purgeInteval=12
server.1=zk-0.zk-hs.default.svc.cluster.local:2888:3888
server.2=zk-1.zk-hs.default.svc.cluster.local:2888:3888

使用kubectl edit statefulset zk命令修改replicas=3，start-zookeeper –servers=3,
此时观察pod的变化

[root@172 zookeeper]# kubectl get po -owide -w
NAME      READY     STATUS    RESTARTS   AGE       IP              NODE
zk-0      1/1       Running   0          1h        192.168.5.170   172.16.20.10
zk-1      1/1       Running   0          1h        192.168.3.12    172.16.20.12
zk-2      0/1       Pending   0         0s        <none>    <none>
zk-2      0/1       Pending   0         0s        <none>    172.16.20.11
zk-2      0/1       ContainerCreating   0         0s        <none>    172.16.20.11
zk-2      0/1       Running   0         1s        192.168.2.154   172.16.20.11
zk-2      1/1       Running   0         11s       192.168.2.154   172.16.20.11
zk-1      1/1       Terminating   0         1h        192.168.3.12   172.16.20.12
zk-1      0/1       Terminating   0         1h        <none>    172.16.20.12
zk-1      0/1       Terminating   0         1h        <none>    172.16.20.12
zk-1      0/1       Terminating   0         1h        <none>    172.16.20.12
zk-1      0/1       Terminating   0         1h        <none>    172.16.20.12
zk-1      0/1       Pending   0         0s        <none>    <none>
zk-1      0/1       Pending   0         0s        <none>    172.16.20.12
zk-1      0/1       ContainerCreating   0         0s        <none>    172.16.20.12
zk-1      0/1       Running   0         2s        192.168.3.13   172.16.20.12
zk-1      1/1       Running   0         20s       192.168.3.13   172.16.20.12
zk-0      1/1       Terminating   0         1h        192.168.5.170   172.16.20.10
zk-0      0/1       Terminating   0         1h        <none>    172.16.20.10
zk-0      0/1       Terminating   0         1h        <none>    172.16.20.10
zk-0      0/1       Terminating   0         1h        <none>    172.16.20.10
zk-0      0/1       Terminating   0         1h        <none>    172.16.20.10
zk-0      0/1       Pending   0         0s        <none>    <none>
zk-0      0/1       Pending   0         0s        <none>    172.16.20.10
zk-0      0/1       ContainerCreating   0         0s        <none>    172.16.20.10
zk-0      0/1       Running   0         2s        192.168.5.171   172.16.20.10
zk-0      1/1       Running   0         12s       192.168.5.171   172.16.20.10

可以看到zk-0/zk-1都重启了，这样可以加载新的zoo.cfg配置文件，保证集群正确配置。
新的zoo.cfg配置文件记录了3个实例：

[root@172 ~]# kubectl exec zk-0 -- cat /opt/zookeeper/conf/zoo.cfg
#This file was autogenerated DO NOT EDIT
clientPort=2181
dataDir=/var/lib/zookeeper/data
dataLogDir=/var/lib/zookeeper/data/log
tickTime=2000
initLimit=10
syncLimit=5
maxClientCnxns=60
minSessionTimeout=4000
maxSessionTimeout=40000
autopurge.snapRetainCount=3
autopurge.purgeInteval=12
server.1=zk-0.zk-hs.default.svc.cluster.local:2888:3888
server.2=zk-1.zk-hs.default.svc.cluster.local:2888:3888
server.3=zk-2.zk-hs.default.svc.cluster.local:2888:3888

7. 测试缩容

缩容的时候，zk集群也自动重启了所有的zk节点，缩容过程如下：

[root@172 ~]# kubectl get po -owide -w
NAME      READY     STATUS    RESTARTS   AGE       IP              NODE
zk-0      1/1       Running   0          5m        192.168.5.171   172.16.20.10
zk-1      1/1       Running   0          6m        192.168.3.13    172.16.20.12
zk-2      1/1       Running   0          7m        192.168.2.154   172.16.20.11
zk-2      1/1       Terminating   0         7m        192.168.2.154   172.16.20.11
zk-1      1/1       Terminating   0         7m        192.168.3.13   172.16.20.12
zk-2      0/1       Terminating   0         8m        <none>    172.16.20.11
zk-1      0/1       Terminating   0         7m        <none>    172.16.20.12
zk-2      0/1       Terminating   0         8m        <none>    172.16.20.11
zk-1      0/1       Terminating   0         7m        <none>    172.16.20.12
zk-1      0/1       Terminating   0         7m        <none>    172.16.20.12
zk-1      0/1       Terminating   0         7m        <none>    172.16.20.12
zk-1      0/1       Pending   0         0s        <none>    <none>
zk-1      0/1       Pending   0         0s        <none>    172.16.20.12
zk-1      0/1       ContainerCreating   0         0s        <none>    172.16.20.12
zk-1      0/1       Running   0         2s        192.168.3.14   172.16.20.12
zk-2      0/1       Terminating   0         8m        <none>    172.16.20.11
zk-2      0/1       Terminating   0         8m        <none>    172.16.20.11
zk-1      1/1       Running   0         19s       192.168.3.14   172.16.20.12
zk-0      1/1       Terminating   0         7m        192.168.5.171   172.16.20.10
zk-0      0/1       Terminating   0         7m        <none>    172.16.20.10
zk-0      0/1       Terminating   0         7m        <none>    172.16.20.10
zk-0      0/1       Terminating   0         7m        <none>    172.16.20.10
zk-0      0/1       Pending   0         0s        <none>    <none>
zk-0      0/1       Pending   0         0s        <none>    172.16.20.10
zk-0      0/1       ContainerCreating   0         0s        <none>    172.16.20.10
zk-0      0/1       Running   0         3s        192.168.5.172   172.16.20.10
zk-0      1/1       Running   0         13s       192.168.5.172   172.16.20.10

四. etcd集群部署

1. 创建etcd集群

cat << EOF | kubectl create -f -
apiVersion: v1
kind: Service
metadata:
  name: "etcd"
  annotations:
    # Create endpoints also if the related pod isn't ready
    service.alpha.kubernetes.io/tolerate-unready-endpoints: "true"
spec:
  ports:
  - port: 2379
    name: client
  - port: 2380
    name: peer
  clusterIP: None
  selector:
    component: "etcd"
---
apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
  name: "etcd"
  labels:
    component: "etcd"
spec:
  serviceName: "etcd"
  # changing replicas value will require a manual etcdctl member remove/add
  # command (remove before decreasing and add after increasing)
  replicas: 3
  template:
    metadata:
      name: "etcd"
      labels:
        component: "etcd"
    spec:
      containers:
      - name: "etcd"
        image: "172.16.18.100:5000/quay.io/coreos/etcd:v3.2.3"
        ports:
        - containerPort: 2379
          name: client
        - containerPort: 2380
          name: peer
        env:
        - name: CLUSTER_SIZE
          value: "3"
        - name: SET_NAME
          value: "etcd"
        volumeMounts:
        - name: data
          mountPath: /var/run/etcd
        command:
          - "/bin/sh"
          - "-ecx"
          - |
            IP=$(hostname -i)
            for i in $(seq 0 $((${CLUSTER_SIZE} - 1))); do
              while true; do
                echo "Waiting for ${SET_NAME}-${i}.${SET_NAME} to come up"
                ping -W 1 -c 1 ${SET_NAME}-${i}.${SET_NAME}.default.svc.cluster.local > /dev/null && break
                sleep 1s
              done
            done
            PEERS=""
            for i in $(seq 0 $((${CLUSTER_SIZE} - 1))); do
                PEERS="${PEERS}${PEERS:+,}${SET_NAME}-${i}=http://${SET_NAME}-${i}.${SET_NAME}.default.svc.cluster.local:2380"
            done
            # start etcd. If cluster is already initialized the `--initial-*` options will be ignored.
            exec etcd --name ${HOSTNAME} 
              --listen-peer-urls http://${IP}:2380 
              --listen-client-urls http://${IP}:2379,http://127.0.0.1:2379 
              --advertise-client-urls http://${HOSTNAME}.${SET_NAME}:2379 
              --initial-advertise-peer-urls http://${HOSTNAME}.${SET_NAME}:2380 
              --initial-cluster-token etcd-cluster-1 
              --initial-cluster ${PEERS} 
              --initial-cluster-state new 
              --data-dir /var/run/etcd/default.etcd
## We are using dynamic pv provisioning using the "standard" storage class so
## this resource can be directly deployed without changes to minikube (since
## minikube defines this class for its minikube hostpath provisioner). In
## production define your own way to use pv claims.
  volumeClaimTemplates:
  - metadata:
      name: data
      annotations:
        volume.beta.kubernetes.io/storage-class: ceph
    spec:
      accessModes:
        - "ReadWriteOnce"
      resources:
        requests:
          storage: 1Gi
EOF

创建完成之后的po,pv,pvc清单如下：

[root@172 etcd]# kubectl get po -owide 
NAME      READY     STATUS    RESTARTS   AGE       IP              NODE
etcd-0    1/1       Running   0          15m       192.168.5.174   172.16.20.10
etcd-1    1/1       Running   0          15m       192.168.3.16    172.16.20.12
etcd-2    1/1       Running   0          5s        192.168.5.176   172.16.20.10

2. 测试缩容

kubectl scale statefulset etcd --replicas=2

[root@172 ~]# kubectl get po -owide -w
NAME      READY     STATUS    RESTARTS   AGE       IP              NODE
etcd-0    1/1       Running   0          17m       192.168.5.174   172.16.20.10
etcd-1    1/1       Running   0          17m       192.168.3.16    172.16.20.12
etcd-2    1/1       Running   0          1m        192.168.5.176   172.16.20.10
etcd-2    1/1       Terminating   0         1m        192.168.5.176   172.16.20.10
etcd-2    0/1       Terminating   0         1m        <none>    172.16.20.10

检查集群健康

kubectl exec etcd-0 -- etcdctl cluster-health

failed to check the health of member 42c8b94265b9b79a on http://etcd-2.etcd:2379: Get http://etcd-2.etcd:2379/health: dial tcp: lookup etcd-2.etcd on 10.96.0.10:53: no such host
member 42c8b94265b9b79a is unreachable: [http://etcd-2.etcd:2379] are all unreachable
member 9869f0647883a00d is healthy: got healthy result from http://etcd-1.etcd:2379
member c799a6ef06bc8c14 is healthy: got healthy result from http://etcd-0.etcd:2379
cluster is healthy

发现缩容后，etcd-2并没有从etcd集群中自动删除，可见这个etcd镜像对自动扩容缩容的支持并不够好。
我们手工删除掉etcd-2

[root@172 etcd]# kubectl exec etcd-0 -- etcdctl member remove 42c8b94265b9b79a
Removed member 42c8b94265b9b79a from cluster
[root@172 etcd]# kubectl exec etcd-0 -- etcdctl cluster-health                
member 9869f0647883a00d is healthy: got healthy result from http://etcd-1.etcd:2379
member c799a6ef06bc8c14 is healthy: got healthy result from http://etcd-0.etcd:2379
cluster is healthy

3. 测试扩容

从etcd.yaml的启动脚本中可以看出，扩容时新启动一个etcd pod时参数–initial-cluster-state new，该etcd镜像并不支持动态扩容，可以考虑使用基于dns动态部署etcd集群的方式来修改启动脚本，这样才能支持etcd cluster动态扩容。

etcd使用之ttl不准确问题

问题现象

部署有一个etcd集群，分别是10.8.65.106，10.8.65.107和10.8.65.108。

然后我使用etcdctl为一个值设置ttl，然后通过watch观察，发现失效时间不准确，而且时间随机。

比如我设置/mytest/test的ttl时间为10秒

[root@node-106 ~]# date &&etcdctl set --ttl 10 /mytest/test hello &&date
Fri Sep  2 05:31:10 EDT 2016
hello
Fri Sep  2 05:31:10 EDT 2016

这里采用的是东八区时间，所以UTC时间应该为2016-09-02T09:31:20

但是通过watch查看时候，发现etcd将其失效时间设置为了2016-09-02T09:31:18，而不是2016-09-02T09:31:20。

[root@node-106 ~]# curl -X GET "http://10.8.65.108:2379/v2/keys/mytest/test1?recursive=false&wait=true&stream=true"
{"action":"set","node":{"key":"/mytest/test","value":"hello","expiration":"2016-09-02T09:31:18.221701998Z","ttl":17,"modifiedIndex":306840,"createdIndex":306840}}
{"action":"expire","node":{"key":"/mytest/test","modifiedIndex":306844,"createdIndex":306840},"prevNode":{"key":"/mytest/test","value":"hello","expiration":"2016-09-02T09:31:18.221701998Z","ttl":9,"modifiedIndex":306840,"createdIndex":306840}}
{"action":"expire","node":{"key":"/mytest/test","modifiedIndex":306844,"createdIndex":306840},"prevNode":{"key":"/mytest/test","value":"hello","expiration":"2016-09-02T09:31:18.221701998Z","ttl":9,"modifiedIndex":306840,"createdIndex":306840}}

这个反复实验多次，发现理论失效时间10秒与实际失效时间的误差，最多可能到9秒，也有0秒。误差似乎是随机的。

问题分析

打开debug模式，进行详细分析。

[root@node-106 ~]# date && etcdctl --debug set --ttl 10 /mytest/test1 hello && date
Fri Sep  2 05:57:20 EDT 2016
start to sync cluster using endpoints(http://127.0.0.1:4001,http://127.0.0.1:2379)
cURL Command: curl -X GET http://127.0.0.1:4001/v2/members
cURL Command: curl -X GET http://127.0.0.1:2379/v2/members
got endpoints(http://10.8.65.107:2379,http://10.8.65.106:2379,http://10.8.65.108:2379) after sync
Cluster-Endpoints: http://10.8.65.107:2379, http://10.8.65.106:2379, http://10.8.65.108:2379
cURL Command: curl -X PUT http://10.8.65.107:2379/v2/keys/mytest/test1 -d "ttl=10&value=hello"
hello
Fri Sep  2 05:57:20 EDT 2016

可以看到etcdctl发起设置请求时，会首先获得集群的members，然后向其中发送一个set mytest/test1的请求。而这个请求会是随机的。如上是请求定位到了10.8.65.107之上。

之后我分别查看了三台机器的时间，发现三台时间不同步。初步判断是时间不同步导致的，因此这里使用ntp进行同步。

[root@node-106 ~]# ntpdate pool.ntp.org
 2 Sep 05:45:23 ntpdate[24846]: adjust time server 120.25.108.11 offset -0.000273 sec

之后再进行ttl设置，失效时间恢复准确。

回顾与解决

回顾整个问题，主要原因还是时间不同步。之后再出现该问题时，可以根据返回值进行判断。

[root@node-106 ~]# curl -X GET "http://10.8.65.108:2379/v2/keys/mytest/test1?recursive=false&wait=true&stream=true"
{"action":"set","node":{"key":"/mytest/test","value":"hello","expiration":"2016-09-02T09:31:18.221701998Z","ttl":17,"modifiedIndex":306840,"createdIndex":306840}}

返回的action为set的值，其中的ttl值应与自己设置的ttl值一致。如果该值与设置的ttl值不一致，就极有可能是时间不同步原因造成的。

所以解决方法是将三台机器进行时间同步，就不再出现ttl失效时间不准确的问题。

Centos7下Etcd集群搭建

一、简介

“A highly-available key value store for shared configuration and service discovery.”

Etcd是coreos开发的分布式服务系统，内部采用raft协议作为一致性算法。作为一个高可用的配置共享、服务发现的键值存储系统，Etcd有以下的特点：

简单：安装配置简单，而且提供了 HTTP API 进行交互，使用也很简单
安全：支持 SSL 证书验证
快速：根据官方提供的数据，单实例支持每秒2k+读操作、1k写操作
可靠：采用raft算法，实现分布式系统数据的可用性和一致性

Etcd构建自身高可用集群主要有三种形式:

静态发现: 预先已知 Etcd 集群中有哪些节点，在启动时直接指定好Etcd的各个node节点地址
Etcd动态发现: 通过已有的Etcd集群作为数据交互点，然后在扩展新的集群时实现通过已有集群进行服务发现的机制
DNS动态发现: 通过DNS查询方式获取其他节点地址信息

本文主要介绍第一种方式，后续会陆续介绍剩下的两种方式。(直接docker安装请移步：quay.io/coreos/etcd 基于Docker镜像的集群搭建)

二、环境介绍

三台虚拟机，系统环境均为Centos7，对应节点名称及IP地址如下：

node1:192.168.7.163
node2:192.168.7.57
etcd2:192.168.7.58

首先将这个信息添加到三台主机的hosts文件中，编辑/etc/hosts，填入以下信息：

192.168.7.163 node1
192.168.7.57 node2
192.168.7.58 etcd2

三、安装、配置Etcd

# yum install etcd -y

yum安装的etcd默认配置文件在/etc/etcd/etcd.conf，以下将三个节点上的配置贴出来，请注意不同点（未贴出的，则表明不需要更改）

node1

# [member]
# 节点名称
ETCD_NAME=node1
# 数据存放位置
ETCD_DATA_DIR="/var/lib/etcd/default.etcd"
#ETCD_WAL_DIR=""
#ETCD_SNAPSHOT_COUNT="10000"
#ETCD_HEARTBEAT_INTERVAL="100"
#ETCD_ELECTION_TIMEOUT="1000"
# 监听其他 Etcd 实例的地址
ETCD_LISTEN_PEER_URLS="http://0.0.0.0:2380"
# 监听客户端地址
ETCD_LISTEN_CLIENT_URLS="http://0.0.0.0:2379,http://0.0.0.0:4001"
#ETCD_MAX_SNAPSHOTS="5"
#ETCD_MAX_WALS="5"
#ETCD_CORS=""
#
#[cluster]
# 通知其他 Etcd 实例地址
ETCD_INITIAL_ADVERTISE_PEER_URLS="http://node1:2380"
# if you use different ETCD_NAME (e.g. test), set ETCD_INITIAL_CLUSTER value for this name, i.e. "test=http://..."
# 初始化集群内节点地址
ETCD_INITIAL_CLUSTER="node1=http://node1:2380,node2=http://node2:2380,etcd2=http://etcd2:2380"
# 初始化集群状态，new 表示新建
ETCD_INITIAL_CLUSTER_STATE="new"
# 初始化集群 token
ETCD_INITIAL_CLUSTER_TOKEN="mritd-etcd-cluster"
# 通知 客户端地址
ETCD_ADVERTISE_CLIENT_URLS="http://node1:2379,http://node1:4001"

node2

# [member]
ETCD_NAME=node2
ETCD_DATA_DIR="/var/lib/etcd/default.etcd"
#ETCD_WAL_DIR=""
#ETCD_SNAPSHOT_COUNT="10000"
#ETCD_HEARTBEAT_INTERVAL="100"
#ETCD_ELECTION_TIMEOUT="1000"
ETCD_LISTEN_PEER_URLS="http://0.0.0.0:2380"
ETCD_LISTEN_CLIENT_URLS="http://0.0.0.0:2379,http://0.0.0.0:4001"
#ETCD_MAX_SNAPSHOTS="5"
#ETCD_MAX_WALS="5"
#ETCD_CORS=""
#
#[cluster]

ETCD_INITIAL_ADVERTISE_PEER_URLS="http://node2:2380"
# if you use different ETCD_NAME (e.g. test), set ETCD_INITIAL_CLUSTER value for this name, i.e. "test=http://..."
ETCD_INITIAL_CLUSTER="node1=http://node1:2380,node2=http://node2:2380,etcd2=http://etcd2:2380"
ETCD_INITIAL_CLUSTER_STATE="new"
ETCD_INITIAL_CLUSTER_TOKEN="mritd-etcd-cluster"
ETCD_ADVERTISE_CLIENT_URLS="http://node2:2379,http://node2:4001"

etcd2

# [member]
ETCD_NAME=etcd2
ETCD_DATA_DIR="/var/lib/etcd/default.etcd"
#ETCD_WAL_DIR=""
#ETCD_SNAPSHOT_COUNT="10000"
#ETCD_HEARTBEAT_INTERVAL="100"
#ETCD_ELECTION_TIMEOUT="1000"
ETCD_LISTEN_PEER_URLS="http://0.0.0.0:2380"
ETCD_LISTEN_CLIENT_URLS="http://0.0.0.0:2379,http://0.0.0.0:4001"
#ETCD_MAX_SNAPSHOTS="5"
#ETCD_MAX_WALS="5"
#ETCD_CORS=""
#
#[cluster]
ETCD_INITIAL_ADVERTISE_PEER_URLS="http://etcd2:2380"
# if you use different ETCD_NAME (e.g. test), set ETCD_INITIAL_CLUSTER value for this name, i.e. "test=http://..."
ETCD_INITIAL_CLUSTER="node1=http://node1:2380,node2=http://node2:2380,etcd2=http://etcd2:2380"
ETCD_INITIAL_CLUSTER_STATE="new"
ETCD_INITIAL_CLUSTER_TOKEN="mritd-etcd-cluster"
ETCD_ADVERTISE_CLIENT_URLS="http://etcd2:2379,http://etcd2:4001"

改好配置之后，在各个节点上开启etcd服务：

# systemctl restart etcd

四、测试验证

[root@localhost ~]# etcdctl set testdir/testkey0 0
0
[root@localhost ~]# etcdctl set testdir/testkey1 1
1
[root@localhost ~]# etcdctl set testdir/testkey2 2
2
[root@localhost ~]# etcdctl ls
/test
/testdir
[root@localhost ~]# etcdctl ls testdir
/testdir/testkey0
/testdir/testkey1
/testdir/testkey2
[root@localhost ~]# etcdctl get testdir/testkey2
2
[root@localhost ~]# etcdctl member list
377aa10974e8238d: name=node1 peerURLs=http://node1:2380 clientURLs=http://node1:2379,http://node1:4001 isLeader=true
9de2d4fdbbd835b6: name=etcd2 peerURLs=http://etcd2:2380 clientURLs=http://etcd2:2379,http://etcd2:4001 isLeader=false
f75ed833c7cbbe65: name=node2 peerURLs=http://node2:2380 clientURLs=http://node2:2379,http://node2:4001 isLeader=false