From 22c0f518f7801190e2bee4828491ba53e83df663 Mon Sep 17 00:00:00 2001 From: Gyuho Lee Date: Thu, 17 May 2018 13:04:44 -0700 Subject: [PATCH] Documentation/upgrades: improve server checklists with zap logger Signed-off-by: Gyuho Lee --- Documentation/upgrades/upgrade_3_4.md | 242 ++++++++++++++++++------- Documentation/upgrades/upgrade_3_5.md | 243 ++++++++++++++++++++------ 2 files changed, 367 insertions(+), 118 deletions(-) diff --git a/Documentation/upgrades/upgrade_3_4.md b/Documentation/upgrades/upgrade_3_4.md index 6e4d936c8..90f24da10 100644 --- a/Documentation/upgrades/upgrade_3_4.md +++ b/Documentation/upgrades/upgrade_3_4.md @@ -6,6 +6,8 @@ In the general case, upgrading from etcd 3.3 to 3.4 can be a zero-downtime, roll Before [starting an upgrade](#upgrade-procedure), read through the rest of this guide to prepare. + + ### Upgrade checklists **NOTE:** When [migrating from v2 with no v3 data](https://github.com/coreos/etcd/issues/9480), etcd server v3.2+ panics when etcd restores from existing snapshots but no v3 `ETCD_DATA_DIR/member/snap/db` file. This happens when the server had migrated from v2 with no previous v3 data. This also prevents accidental v3 data loss (e.g. `db` file might have been moved). etcd requires that post v3 migration can only happen with v3 data. Do not upgrade to newer v3 versions until v3.0 server contains v3 data. @@ -140,6 +142,8 @@ curl -L http://localhost:2379/v3/kv/put \ Requests to `/v3beta` endpoints will redirect to `/v3`, and `/v3beta` will be removed in 3.5 release. + + ### Server upgrade checklists #### Upgrade requirements @@ -152,7 +156,7 @@ Also, to ensure a smooth rolling upgrade, the running cluster must be healthy. C Before upgrading etcd, always test the services relying on etcd in a staging environment before deploying the upgrade to the production environment. -Before beginning, [backup the etcd data](../op-guide/maintenance.md#snapshot-backup). Should something go wrong with the upgrade, it is possible to use this backup to [downgrade](#downgrade) back to existing etcd version. Please note that the `snapshot` command only backs up the v3 data. For v2 data, see [backing up v2 datastore](../v2/admin_guide.md#backing-up-the-datastore). +Before beginning, [download the snapshot backup](../op-guide/maintenance.md#snapshot-backup). Should something go wrong with the upgrade, it is possible to use this backup to [downgrade](#downgrade) back to existing etcd version. Please note that the `snapshot` command only backs up the v3 data. For v2 data, see [backing up v2 datastore](../v2/admin_guide.md#backing-up-the-datastore). #### Mixed versions @@ -170,97 +174,215 @@ For a much larger total data size, 100MB or more , this one-time process might t If all members have been upgraded to v3.4, the cluster will be upgraded to v3.4, and downgrade from this completed state is **not possible**. If any single member is still v3.3, however, the cluster and its operations remains "v3.3", and it is possible from this mixed cluster state to return to using a v3.3 etcd binary on all members. -Please [backup the data directory](../op-guide/maintenance.md#snapshot-backup) of all etcd members to make downgrading the cluster possible even after it has been completely upgraded. +Please [download the snapshot backup](../op-guide/maintenance.md#snapshot-backup) to make downgrading the cluster possible even after it has been completely upgraded. ### Upgrade procedure This example shows how to upgrade a 3-member v3.3 ectd cluster running on a local machine. -#### 1. Check upgrade requirements +#### Step 1: check upgrade requirements Is the cluster healthy and running v3.3.x? -``` -$ ETCDCTL_API=3 etcdctl endpoint health --endpoints=localhost:2379,localhost:22379,localhost:32379 -localhost:2379 is healthy: successfully committed proposal: took = 6.600684ms -localhost:22379 is healthy: successfully committed proposal: took = 8.540064ms -localhost:32379 is healthy: successfully committed proposal: took = 8.763432ms +```bash +etcdctl --endpoints=localhost:2379,localhost:22379,localhost:32379 endpoint health +<} +WARNING: 2018/05/17 12:45:21 grpc: addrConn.transportMonitor exits due to: grpc: the connection is closing +21.193589 I | raft: 7339c4e5e833c029 [term: 8] received a MsgVote message with higher term from 729934363faa4a24 [term: 9] +21.193626 I | raft: 7339c4e5e833c029 became follower at term 9 +21.193651 I | raft: 7339c4e5e833c029 [logterm: 8, index: 9, vote: 0] cast MsgVote for 729934363faa4a24 [logterm: 8, index: 9] at term 9 +21.193675 I | raft: raft.node: 7339c4e5e833c029 lost leader 7339c4e5e833c029 at term 9 +21.194424 I | raft: raft.node: 7339c4e5e833c029 elected leader 729934363faa4a24 at term 9 +21.292898 I | etcdserver: 7339c4e5e833c029 finished leadership transfer from 7339c4e5e833c029 to 729934363faa4a24 (took 100.436391ms) +21.292975 I | rafthttp: stopping peer 729934363faa4a24... +21.293206 I | rafthttp: closed the TCP streaming connection with peer 729934363faa4a24 (stream MsgApp v2 writer) +21.293225 I | rafthttp: stopped streaming with peer 729934363faa4a24 (writer) +21.293437 I | rafthttp: closed the TCP streaming connection with peer 729934363faa4a24 (stream Message writer) +21.293459 I | rafthttp: stopped streaming with peer 729934363faa4a24 (writer) +21.293514 I | rafthttp: stopped HTTP pipelining with peer 729934363faa4a24 +21.293590 W | rafthttp: lost the TCP streaming connection with peer 729934363faa4a24 (stream MsgApp v2 reader) +21.293610 I | rafthttp: stopped streaming with peer 729934363faa4a24 (stream MsgApp v2 reader) +21.293680 W | rafthttp: lost the TCP streaming connection with peer 729934363faa4a24 (stream Message reader) +21.293700 I | rafthttp: stopped streaming with peer 729934363faa4a24 (stream Message reader) +21.293711 I | rafthttp: stopped peer 729934363faa4a24 +21.293720 I | rafthttp: stopping peer b548c2511513015... +21.293987 I | rafthttp: closed the TCP streaming connection with peer b548c2511513015 (stream MsgApp v2 writer) +21.294063 I | rafthttp: stopped streaming with peer b548c2511513015 (writer) +21.294467 I | rafthttp: closed the TCP streaming connection with peer b548c2511513015 (stream Message writer) +21.294561 I | rafthttp: stopped streaming with peer b548c2511513015 (writer) +21.294742 I | rafthttp: stopped HTTP pipelining with peer b548c2511513015 +21.294867 W | rafthttp: lost the TCP streaming connection with peer b548c2511513015 (stream MsgApp v2 reader) +21.294892 I | rafthttp: stopped streaming with peer b548c2511513015 (stream MsgApp v2 reader) +21.294990 W | rafthttp: lost the TCP streaming connection with peer b548c2511513015 (stream Message reader) +21.295004 E | rafthttp: failed to read b548c2511513015 on stream Message (context canceled) +21.295013 I | rafthttp: peer b548c2511513015 became inactive +21.295024 I | rafthttp: stopped streaming with peer b548c2511513015 (stream Message reader) +21.295035 I | rafthttp: stopped peer b548c2511513015 ``` -It's a good idea at this point to [backup the etcd data](../op-guide/maintenance.md#snapshot-backup) to provide a downgrade path should any problems occur: +#### Step 4: restart the etcd server with same configuration -``` -$ etcdctl snapshot save backup.db +Restart the etcd server with same configuration but with the new etcd binary. + +```diff +-etcd-old --name s1 \ ++etcd-new --name s1 \ + --data-dir /tmp/etcd/s1 \ + --listen-client-urls http://localhost:2379 \ + --advertise-client-urls http://localhost:2379 \ + --listen-peer-urls http://localhost:2380 \ + --initial-advertise-peer-urls http://localhost:2380 \ + --initial-cluster s1=http://localhost:2380,s2=http://localhost:22380,s3=http://localhost:32380 \ + --initial-cluster-token tkn \ ++ --initial-cluster-state new \ ++ --logger zap \ ++ --log-outputs stderr ``` -#### 3. Drop-in etcd v3.4 binary and start the new etcd process +The new v3.4 etcd will publish its information to the cluster. At this point, cluster still operates as v3.3 protocol, which is the lowest common version. -The new v3.4 etcd will publish its information to the cluster: +> `{"level":"info","ts":1526586617.1647713,"caller":"membership/cluster.go:485","msg":"set initial cluster version","cluster-id":"7dee9ba76d59ed53","local-member-id":"7339c4e5e833c029","cluster-version":"3.0"}` -``` -14:14:25.363225 I | etcdserver: published {Name:s1 ClientURLs:[http://localhost:2379]} to cluster a9ededbffcb1b1f1 -``` +> `{"level":"info","ts":1526586617.1648536,"caller":"api/capability.go:76","msg":"enabled capabilities for version","cluster-version":"3.0"}` + +> `{"level":"info","ts":1526586617.1649303,"caller":"membership/cluster.go:473","msg":"updated cluster version","cluster-id":"7dee9ba76d59ed53","local-member-id":"7339c4e5e833c029","from":"3.0","from":"3.3"}` + +> `{"level":"info","ts":1526586617.1649797,"caller":"api/capability.go:76","msg":"enabled capabilities for version","cluster-version":"3.3"}` + +> `{"level":"info","ts":1526586617.2107732,"caller":"etcdserver/server.go:1770","msg":"published local member to cluster through raft","local-member-id":"7339c4e5e833c029","local-member-attributes":"{Name:s1 ClientURLs:[http://localhost:2379]}","request-path":"/0/members/7339c4e5e833c029/attributes","cluster-id":"7dee9ba76d59ed53","publish-timeout":7}` Verify that each member, and then the entire cluster, becomes healthy with the new v3.4 etcd binary: -``` -$ ETCDCTL_API=3 /etcdctl endpoint health --endpoints=localhost:2379,localhost:22379,localhost:32379 -localhost:22379 is healthy: successfully committed proposal: took = 5.540129ms -localhost:32379 is healthy: successfully committed proposal: took = 7.321771ms -localhost:2379 is healthy: successfully committed proposal: took = 10.629901ms +```bash +etcdctl endpoint health --endpoints=localhost:2379,localhost:22379,localhost:32379 +< `{"level":"info","ts":1526586949.0920913,"caller":"api/capability.go:76","msg":"enabled capabilities for version","cluster-version":"3.4"}` +> `{"level":"info","ts":1526586949.0921566,"caller":"etcdserver/server.go:2272","msg":"cluster version is updated","cluster-version":"3.4"}` + +Member 2: + +> `{"level":"info","ts":1526586949.092117,"caller":"membership/cluster.go:473","msg":"updated cluster version","cluster-id":"7dee9ba76d59ed53","local-member-id":"729934363faa4a24","from":"3.3","from":"3.4"}` +> `{"level":"info","ts":1526586949.0923078,"caller":"api/capability.go:76","msg":"enabled capabilities for version","cluster-version":"3.4"}` + +Member 3: + +> `{"level":"info","ts":1526586949.0921423,"caller":"membership/cluster.go:473","msg":"updated cluster version","cluster-id":"7dee9ba76d59ed53","local-member-id":"b548c2511513015","from":"3.3","from":"3.4"}` +> `{"level":"info","ts":1526586949.0922918,"caller":"api/capability.go:76","msg":"enabled capabilities for version","cluster-version":"3.4"}` + + +```bash +endpoint health --endpoints=localhost:2379,localhost:22379,localhost:32379 +<127.0.0.1:32380: use of closed network connection"} +{"level":"info","ts":1526587299.1778402,"caller":"rafthttp/stream.go:459","msg":"stopped stream reader with remote peer","stream-reader-type":"stream MsgApp v2","local-member-id":"7339c4e5e833c029","remote-peer-id":"b548c2511513015"} +{"level":"warn","ts":1526587299.1780295,"caller":"rafthttp/stream.go:436","msg":"lost TCP streaming connection with remote peer","stream-reader-type":"stream Message","local-member-id":"7339c4e5e833c029","remote-peer-id":"b548c2511513015","error":"read tcp 127.0.0.1:34634->127.0.0.1:32380: use of closed network connection"} +{"level":"info","ts":1526587299.1780987,"caller":"rafthttp/stream.go:459","msg":"stopped stream reader with remote peer","stream-reader-type":"stream Message","local-member-id":"7339c4e5e833c029","remote-peer-id":"b548c2511513015"} +{"level":"info","ts":1526587299.1781602,"caller":"rafthttp/peer.go:340","msg":"stopped remote peer","remote-peer-id":"b548c2511513015"} +{"level":"info","ts":1526587299.1781986,"caller":"rafthttp/peer.go:333","msg":"stopping remote peer","remote-peer-id":"729934363faa4a24"} +{"level":"warn","ts":1526587299.1802843,"caller":"rafthttp/stream.go:291","msg":"closed TCP streaming connection with remote peer","stream-writer-type":"stream MsgApp v2","remote-peer-id":"729934363faa4a24"} +{"level":"warn","ts":1526587299.1803446,"caller":"rafthttp/stream.go:301","msg":"stopped TCP streaming connection with remote peer","stream-writer-type":"stream MsgApp v2","remote-peer-id":"729934363faa4a24"} +{"level":"warn","ts":1526587299.1824749,"caller":"rafthttp/stream.go:291","msg":"closed TCP streaming connection with remote peer","stream-writer-type":"stream Message","remote-peer-id":"729934363faa4a24"} +{"level":"warn","ts":1526587299.18255,"caller":"rafthttp/stream.go:301","msg":"stopped TCP streaming connection with remote peer","stream-writer-type":"stream Message","remote-peer-id":"729934363faa4a24"} +{"level":"info","ts":1526587299.18261,"caller":"rafthttp/pipeline.go:86","msg":"stopped HTTP pipelining with remote peer","local-member-id":"7339c4e5e833c029","remote-peer-id":"729934363faa4a24"} +{"level":"warn","ts":1526587299.1827736,"caller":"rafthttp/stream.go:436","msg":"lost TCP streaming connection with remote peer","stream-reader-type":"stream MsgApp v2","local-member-id":"7339c4e5e833c029","remote-peer-id":"729934363faa4a24","error":"read tcp 127.0.0.1:51482->127.0.0.1:22380: use of closed network connection"} +{"level":"info","ts":1526587299.182845,"caller":"rafthttp/stream.go:459","msg":"stopped stream reader with remote peer","stream-reader-type":"stream MsgApp v2","local-member-id":"7339c4e5e833c029","remote-peer-id":"729934363faa4a24"} +{"level":"warn","ts":1526587299.1830168,"caller":"rafthttp/stream.go:436","msg":"lost TCP streaming connection with remote peer","stream-reader-type":"stream Message","local-member-id":"7339c4e5e833c029","remote-peer-id":"729934363faa4a24","error":"context canceled"} +{"level":"warn","ts":1526587299.1831107,"caller":"rafthttp/peer_status.go:65","msg":"peer became inactive","peer-id":"729934363faa4a24","error":"failed to read 729934363faa4a24 on stream Message (context canceled)"} +{"level":"info","ts":1526587299.1831737,"caller":"rafthttp/stream.go:459","msg":"stopped stream reader with remote peer","stream-reader-type":"stream Message","local-member-id":"7339c4e5e833c029","remote-peer-id":"729934363faa4a24"} +{"level":"info","ts":1526587299.1832306,"caller":"rafthttp/peer.go:340","msg":"stopped remote peer","remote-peer-id":"729934363faa4a24"} +{"level":"warn","ts":1526587299.1837125,"caller":"rafthttp/http.go:424","msg":"failed to find remote peer in cluster","local-member-id":"7339c4e5e833c029","remote-peer-id-stream-handler":"7339c4e5e833c029","remote-peer-id-from":"b548c2511513015","cluster-id":"7dee9ba76d59ed53"} +{"level":"warn","ts":1526587299.1840093,"caller":"rafthttp/http.go:424","msg":"failed to find remote peer in cluster","local-member-id":"7339c4e5e833c029","remote-peer-id-stream-handler":"7339c4e5e833c029","remote-peer-id-from":"b548c2511513015","cluster-id":"7dee9ba76d59ed53"} +{"level":"warn","ts":1526587299.1842315,"caller":"rafthttp/http.go:424","msg":"failed to find remote peer in cluster","local-member-id":"7339c4e5e833c029","remote-peer-id-stream-handler":"7339c4e5e833c029","remote-peer-id-from":"729934363faa4a24","cluster-id":"7dee9ba76d59ed53"} +{"level":"warn","ts":1526587299.1844475,"caller":"rafthttp/http.go:424","msg":"failed to find remote peer in cluster","local-member-id":"7339c4e5e833c029","remote-peer-id-stream-handler":"7339c4e5e833c029","remote-peer-id-from":"729934363faa4a24","cluster-id":"7dee9ba76d59ed53"} +{"level":"info","ts":1526587299.2056687,"caller":"embed/etcd.go:473","msg":"stopping serving peer traffic","address":"127.0.0.1:2380"} +{"level":"info","ts":1526587299.205819,"caller":"embed/etcd.go:480","msg":"stopped serving peer traffic","address":"127.0.0.1:2380"} +{"level":"info","ts":1526587299.2058413,"caller":"embed/etcd.go:289","msg":"closed etcd server","name":"s1","data-dir":"/tmp/etcd/s1","advertise-peer-urls":["http://localhost:2380"],"advertise-client-urls":["http://localhost:2379"]} ``` -It's a good idea at this point to [backup the etcd data](../op-guide/maintenance.md#snapshot-backup) to provide a downgrade path should any problems occur: +#### Step 4: restart the etcd server with same configuration -``` -$ etcdctl snapshot save backup.db +Restart the etcd server with same configuration but with the new etcd binary. + +```diff +-etcd-old --name s1 \ ++etcd-new --name s1 \ + --data-dir /tmp/etcd/s1 \ + --listen-client-urls http://localhost:2379 \ + --advertise-client-urls http://localhost:2379 \ + --listen-peer-urls http://localhost:2380 \ + --initial-advertise-peer-urls http://localhost:2380 \ + --initial-cluster s1=http://localhost:2380,s2=http://localhost:22380,s3=http://localhost:32380 \ + --initial-cluster-token tkn \ + --initial-cluster-state new ``` -#### 3. Drop-in etcd v3.5 binary and start the new etcd process +The new v3.5 etcd will publish its information to the cluster. At this point, cluster still operates as v3.4 protocol, which is the lowest common version. -The new v3.5 etcd will publish its information to the cluster: +> `{"level":"info","ts":1526586617.1647713,"caller":"membership/cluster.go:485","msg":"set initial cluster version","cluster-id":"7dee9ba76d59ed53","local-member-id":"7339c4e5e833c029","cluster-version":"3.0"}` -``` -14:14:25.363225 I | etcdserver: published {Name:s1 ClientURLs:[http://localhost:2379]} to cluster a9ededbffcb1b1f1 -``` +> `{"level":"info","ts":1526586617.1648536,"caller":"api/capability.go:76","msg":"enabled capabilities for version","cluster-version":"3.0"}` + +> `{"level":"info","ts":1526586617.1649303,"caller":"membership/cluster.go:473","msg":"updated cluster version","cluster-id":"7dee9ba76d59ed53","local-member-id":"7339c4e5e833c029","from":"3.0","from":"3.4"}` + +> `{"level":"info","ts":1526586617.1649797,"caller":"api/capability.go:76","msg":"enabled capabilities for version","cluster-version":"3.4"}` + +> `{"level":"info","ts":1526586617.2107732,"caller":"etcdserver/server.go:1770","msg":"published local member to cluster through raft","local-member-id":"7339c4e5e833c029","local-member-attributes":"{Name:s1 ClientURLs:[http://localhost:2379]}","request-path":"/0/members/7339c4e5e833c029/attributes","cluster-id":"7dee9ba76d59ed53","publish-timeout":7}` Verify that each member, and then the entire cluster, becomes healthy with the new v3.5 etcd binary: -``` -$ ETCDCTL_API=3 /etcdctl endpoint health --endpoints=localhost:2379,localhost:22379,localhost:32379 -localhost:22379 is healthy: successfully committed proposal: took = 5.540129ms -localhost:32379 is healthy: successfully committed proposal: took = 7.321771ms -localhost:2379 is healthy: successfully committed proposal: took = 10.629901ms +```bash +etcdctl endpoint health --endpoints=localhost:2379,localhost:22379,localhost:32379 +< `{"level":"info","ts":1526586949.0920913,"caller":"api/capability.go:76","msg":"enabled capabilities for version","cluster-version":"3.5"}` +> `{"level":"info","ts":1526586949.0921566,"caller":"etcdserver/server.go:2272","msg":"cluster version is updated","cluster-version":"3.5"}` + +Member 2: + +> `{"level":"info","ts":1526586949.092117,"caller":"membership/cluster.go:473","msg":"updated cluster version","cluster-id":"7dee9ba76d59ed53","local-member-id":"729934363faa4a24","from":"3.4","from":"3.5"}` +> `{"level":"info","ts":1526586949.0923078,"caller":"api/capability.go:76","msg":"enabled capabilities for version","cluster-version":"3.5"}` + +Member 3: + +> `{"level":"info","ts":1526586949.0921423,"caller":"membership/cluster.go:473","msg":"updated cluster version","cluster-id":"7dee9ba76d59ed53","local-member-id":"b548c2511513015","from":"3.4","from":"3.5"}` +> `{"level":"info","ts":1526586949.0922918,"caller":"api/capability.go:76","msg":"enabled capabilities for version","cluster-version":"3.5"}` + + +```bash +endpoint health --endpoints=localhost:2379,localhost:22379,localhost:32379 +<