From 069578c29c61ef6dfb8ae64551183765a40307a2 Mon Sep 17 00:00:00 2001 From: Xiang Li Date: Mon, 8 Dec 2014 13:58:59 -0800 Subject: [PATCH] doc: add doc for member migration --- Documentation/0.5/admin_guide.md | 13 +++++++++++++ Documentation/0.5/runtime-configuration.md | 8 ++++++-- 2 files changed, 19 insertions(+), 2 deletions(-) diff --git a/Documentation/0.5/admin_guide.md b/Documentation/0.5/admin_guide.md index b7d09a620..307e3abce 100644 --- a/Documentation/0.5/admin_guide.md +++ b/Documentation/0.5/admin_guide.md @@ -32,6 +32,19 @@ The data directory has two sub-directories in it: If you are spinning up multiple clusters for testing it is recommended that you specify a unique initial-cluster-token for the different clusters. This can protect you from cluster corruption in case of mis-configuration because two members started with different cluster tokens will refuse members from each other. +### Member Migration + +When there is a scheduled machine maintenance or retirement, you might want to migrate an etcd member to another machine without losing the data and changing the member ID. + +The data directory contains all the data to recover a member to its point-in-time state. To migrate a member: + +* Stop the member process +* Copy the data directory of the now-idle member to the new machine +* Update the peer URLs for that member to reflect the new machine according to the [member api] [change peer url] +* Start etcd on the new machine, using the same configuration and the copy of the data directory + +[change peer url]: https://github.com/coreos/etcd/blob/master/Documentation/0.5/other_apis.md#change-the-peer-urls-of-a-member + ### Disaster Recovery etcd is designed to be resilient to machine failures. An etcd cluster can automatically recover from any number of temporary failures (for example, machine reboots), and a cluster of N members can tolerate up to _(N/2)-1_ permanent failures (where a member can no longer access the cluster, due to hardware failure or disk corruption). However, in extreme circumstances, a cluster might permanently lose enough members such that quorum is irrevocably lost. For example, if a three-node cluster suffered two simultaneous and unrecoverable machine failures, it would be normally impossible for the cluster to restore quorum and continue functioning. diff --git a/Documentation/0.5/runtime-configuration.md b/Documentation/0.5/runtime-configuration.md index 951a0fad2..51646fedc 100644 --- a/Documentation/0.5/runtime-configuration.md +++ b/Documentation/0.5/runtime-configuration.md @@ -6,12 +6,16 @@ etcd comes with support for incremental runtime reconfiguration, which allows us Let us walk through the four use cases for re-configuring a cluster: replacing a member, increasing or decreasing cluster size, and restarting a cluster from a majority failure. -### Replace a Member +### Replace a Non-recoverable Member -The most common use case of cluster reconfiguration is to replace a member because of a permanent failure of the existing member: for example, hardware failure, loss of network address, or data directory corruption. +The most common use case of cluster reconfiguration is to replace a member because of a permanent failure of the existing member: for example, hardware failure or data directory corruption. It is important to replace failed members as soon as the failure is detected. If etcd falls below a simple majority of members it can no longer accept writes: e.g. in a 3 member cluster the loss of two members will cause writes to fail and the cluster to stop operating. +If you want to migrate an running member to another machine, please refer [member migration section][member migration]. + +[member migration]: https://github.com/coreos/etcd/blob/master/Documentation/0.5/admin_guide.md#member-migration + ### Increase Cluster Size To make your cluster more resilient to machine failure you can increase the size of the cluster.