diff --git a/Documentation/2.0/admin_guide.md b/Documentation/2.0/admin_guide.md index 798d4dcc0..48009c3c3 100644 --- a/Documentation/2.0/admin_guide.md +++ b/Documentation/2.0/admin_guide.md @@ -27,11 +27,34 @@ The data directory has two sub-directories in it: [wal-pkg]: http://godoc.org/github.com/coreos/etcd/wal [snap-pkg]: http://godoc.org/github.com/coreos/etcd/snap -### Cluster Lifecycle +### Cluster Management + +#### Lifecycle If you are spinning up multiple clusters for testing it is recommended that you specify a unique initial-cluster-token for the different clusters. This can protect you from cluster corruption in case of mis-configuration because two members started with different cluster tokens will refuse members from each other. +#### Optimal Cluster Size + +The recommended etcd cluster size is 3, 5 or 7, which is decided by the fault tolerance requirement. A 7-member cluster can provide enough fault tolerance in most cases. While larger cluster provides better fault tolerance, its write performance becomes lower since data needs to be replicated to more machines. + +#### Fault Tolerance Table + +It is recommended to have an odd number of members in a cluster. Having an odd cluster size doesn't change the number needed for majority, but you gain a higher tolerance for failure by adding the extra member. You can see this in practice when comparing even and odd sized clusters: + +| Cluster Size | Majority | Failure Tolerance | +|--------------|------------|-------------------| +| 1 | 1 | 0 | +| 3 | 2 | 1 | +| 4 | 3 | 1 | +| 5 | 3 | **2** | +| 6 | 4 | 2 | +| 7 | 4 | **3** | +| 8 | 5 | 3 | +| 9 | 5 | **4** | + +As you can see, adding another member to bring the size of cluster up to an odd size is always worth it. During a network partition, an odd number of members also guarantees that there will almost always be a majority of the cluster that can continue to operate and be the source of truth when the partition ends. + ### Member Migration When there is a scheduled machine maintenance or retirement, you might want to migrate an etcd member to another machine without losing the data and changing the member ID. diff --git a/Documentation/optimal-cluster-size.md b/Documentation/optimal-cluster-size.md deleted file mode 100644 index 2aa7c95bb..000000000 --- a/Documentation/optimal-cluster-size.md +++ /dev/null @@ -1,38 +0,0 @@ -# Optimal etcd Cluster Size - -etcd's Raft consensus algorithm is most efficient in small clusters between 3 and 9 peers. For clusters larger than 9, etcd will select a subset of instances to participate in the algorithm in order to keep it efficient. The end of this document briefly explores how etcd works internally and why these choices have been made. - -## Cluster Management - -You can manage the active cluster size through the [cluster config API](https://github.com/coreos/etcd/blob/master/Documentation/api.md#cluster-config). `activeSize` represents the etcd peers allowed to actively participate in the consensus algorithm. - -If the total number of etcd instances exceeds this number, additional peers are started as [standbys](https://github.com/coreos/etcd/blob/master/Documentation/design/standbys.md), which can be promoted to active participation if one of the existing active instances has failed or been removed. - -## Internals of etcd - -### Writing to etcd - -Writes to an etcd peer are always redirected to the leader of the cluster and distributed to all of the peers immediately. A write is only considered successful when a majority of the peers acknowledge the write. - -For example, in a cluster with 5 peers, a write operation is only as fast as the 3rd fastest machine. This is the main reason for keeping the number of active peers below 9. In practice, you only need to worry about write performance in high latency environments such as a cluster spanning multiple data centers. - -### Leader Election - -The leader election process is similar to writing a key — a majority of the active peers must acknowledge the new leader before cluster operations can continue. The longer each peer takes to elect a new leader means you have to wait longer before you can write to the cluster again. In low latency environments this process takes milliseconds. - -### Odd Active Cluster Size - -The other important cluster optimization is to always have an odd active cluster size (i.e. `activeSize`). Adding an odd node to the number of peers doesn't change the size of the majority and therefore doesn't increase the total latency of the majority as described above. But, you gain a higher tolerance for peer failure by adding the extra machine. You can see this in practice when comparing two even and odd sized clusters: - -| Active Peers | Majority | Failure Tolerance | -|--------------|------------|-------------------| -| 1 peers | 1 peers | None | -| 3 peers | 2 peers | 1 peer | -| 4 peers | 3 peers | 1 peer | -| 5 peers | 3 peers | **2 peers** | -| 6 peers | 4 peers | 2 peers | -| 7 peers | 4 peers | **3 peers** | -| 8 peers | 5 peers | 3 peers | -| 9 peers | 5 peers | **4 peers** | - -As you can see, adding another peer to bring the number of active peers up to an odd size is always worth it. During a network partition, an odd number of active peers also guarantees that there will almost always be a majority of the cluster that can continue to operate and be the source of truth when the partition ends.