Mirroristas/etcd

Fork 0

mirror of https://github.com/etcd-io/etcd.git synced 2024-09-27 06:25:44 +00:00

Commit Graph

Author	SHA1	Message	Date
Clayton Coleman	465592a718	Documentation/etcd-mixin: Add an alert for down etcd members An etcd member being down is an important failure state - while normal admin operations may cause transient outages to rotate, when any member is down the cluster is operating in a degraded fashion. Add an alert that records when any members are down so that administrators know whether the next failure is fatal. The rule is more complicated than `up{...} == 0` because not all failure modes for etcd may have an `up{...}` entry for each member. For instance, a Kubernetes service in front of an etcd cluster might only have 2 endpoints recorded in `up` because the third pod is evicted by the kubelet - the cluster is degraded but `count(up{...})` would not return the full quorum size. Instead, use network peer send failures as a failure detector and attempt to return the max of down services or failing peers. We may undercount the number of total failures, but we will at least alert that a member is down.	2019-07-30 14:39:50 -04:00
Christian Beneke	c75ba98f81	Documentation/etcd-mixin: Fix EtcdInsufficientMembers alerting Currently the EtcdInsufficientMembers alert fires, when more than (X/2)-1 instances are unavailable. This fixes it to fire at the correct limit of (X-1)/2 unavailable instances and $value now contains the number of available instances instead of unavailable ones. Added unit test for EtcdInsufficientMembers alert.	2018-10-15 19:23:43 +02:00

Author

SHA1

Message

Date

Clayton Coleman

465592a718

Documentation/etcd-mixin: Add an alert for down etcd members

An etcd member being down is an important failure state - while
normal admin operations may cause transient outages to rotate,
when any member is down the cluster is operating in a degraded
fashion. Add an alert that records when any members are down
so that administrators know whether the next failure is fatal.

The rule is more complicated than `up{...} == 0` because not all
failure modes for etcd may have an `up{...}` entry for each member.
For instance, a Kubernetes service in front of an etcd cluster
might only have 2 endpoints recorded in `up` because the third
pod is evicted by the kubelet - the cluster is degraded but
`count(up{...})` would not return the full quorum size. Instead,
use network peer send failures as a failure detector and attempt
to return the max of down services or failing peers. We may
undercount the number of total failures, but we will at least
alert that a member is down.

2019-07-30 14:39:50 -04:00

Christian Beneke

c75ba98f81

Documentation/etcd-mixin: Fix EtcdInsufficientMembers alerting

Currently the EtcdInsufficientMembers alert fires, when more than (X/2)-1
instances are unavailable. This fixes it to fire at the correct limit of (X-1)/2
unavailable instances and $value now contains the number of available instances
instead of unavailable ones. Added unit test for EtcdInsufficientMembers alert.

2018-10-15 19:23:43 +02:00

2 Commits