
An etcd member being down is an important failure state - while normal admin operations may cause transient outages to rotate, when any member is down the cluster is operating in a degraded fashion. Add an alert that records when any members are down so that administrators know whether the next failure is fatal. The rule is more complicated than `up{...} == 0` because not all failure modes for etcd may have an `up{...}` entry for each member. For instance, a Kubernetes service in front of an etcd cluster might only have 2 endpoints recorded in `up` because the third pod is evicted by the kubelet - the cluster is degraded but `count(up{...})` would not return the full quorum size. Instead, use network peer send failures as a failure detector and attempt to return the max of down services or failing peers. We may undercount the number of total failures, but we will at least alert that a member is down.
Documentation
etcd is a distributed key-value store designed to reliably and quickly preserve and provide access to critical data. It enables reliable distributed coordination through distributed locking, leader elections, and write barriers. An etcd cluster is intended for high availability and permanent data storage and retrieval.
Getting started
New etcd users and developers should get started by downloading and building etcd. After getting etcd, follow this quick demo to see the basics of creating and working with an etcd cluster.
Developing with etcd
The easiest way to get started using etcd as a distributed key-value store is to set up a local cluster.
- Setting up local clusters
- Interacting with etcd
- gRPC etcd core and etcd concurrency API references
- HTTP JSON API through the gRPC gateway
- gRPC naming and discovery
- Client and proxy namespacing
- Embedding etcd
- Experimental features and APIs
- System limits
Operating etcd clusters
Administrators who need a fault-tolerant etcd cluster for either development or production should begin with a cluster on multiple machines.
Setting up etcd
System configuration
Platform guides
Security
Maintenance and troubleshooting
Learning
To learn more about the concepts and internals behind etcd, read the following pages: