11 Commits

Author SHA1 Message Date
Clayton Coleman
465592a718 Documentation/etcd-mixin: Add an alert for down etcd members
An etcd member being down is an important failure state - while
normal admin operations may cause transient outages to rotate,
when any member is down the cluster is operating in a degraded
fashion. Add an alert that records when any members are down
so that administrators know whether the next failure is fatal.

The rule is more complicated than `up{...} == 0` because not all
failure modes for etcd may have an `up{...}` entry for each member.
For instance, a Kubernetes service in front of an etcd cluster
might only have 2 endpoints recorded in `up` because the third
pod is evicted by the kubelet - the cluster is degraded but
`count(up{...})` would not return the full quorum size. Instead,
use network peer send failures as a failure detector and attempt
to return the max of down services or failing peers. We may
undercount the number of total failures, but we will at least
alert that a member is down.
2019-07-30 14:39:50 -04:00
paulfantom
886d30d223 Documentation: provide better user experience with autorefreshing grafana dashboard 2019-05-08 06:58:28 -04:00
Povilas Versockas
eb8e94c4ed etcd-mixin: Improve etcdHighNumberOfLeaderChanges,etcdHighNumberOfFailedProposals message
Currently alert messages state that we detect issue
within the last 1 hour, although we check
for last 15min and wait for 15min for this alert to keep firing.
This fix changes the message to be 30minutes.
2019-02-04 09:28:23 +02:00
Dmitry Verkhoturov
0929080834 doc: exclude 404 error because kubelet generating false positive 2018-12-17 11:57:12 +03:00
Dmitry Verkhoturov
830d064903 doc: convert etcd to lower-case everywhere 2018-12-17 11:57:12 +03:00
Dmitry Verkhoturov
358cc1a8fa doc: sync prometheus rules with prometheus-operator version
(and remove non-etcd specific FdExhaustionClose)
https://github.com/coreos/prometheus-operator/blob/master/helm/exporter-kube-etcd/templates/etcd3.rules.yaml
sync etcd alert rules with libsonnet

Signed-off-by: Dmitry Verkhoturov <paskal.07@gmail.com>
2018-12-17 11:57:12 +03:00
Christian Beneke
c75ba98f81 Documentation/etcd-mixin: Fix EtcdInsufficientMembers alerting
Currently the EtcdInsufficientMembers alert fires, when more than (X/2)-1
instances are unavailable. This fixes it to fire at the correct limit of (X-1)/2
unavailable instances and $value now contains the number of available instances
instead of unavailable ones. Added unit test for EtcdInsufficientMembers alert.
2018-10-15 19:23:43 +02:00
Joonyoung Park
bd74c10fdb Documentation/etcd-mixin: fix typo in README.md
Promethues -> Prometheus
2018-07-19 19:10:46 +09:00
Joshua Olson
3826107af6 Documentation: removing alerts that were specific to etcd v2 2018-07-18 12:31:46 -04:00
Frederic Branczyk
778bfe1c82
Documentation: Add Grafana dashboard to etcd monitoring mixin 2018-05-30 13:42:36 +02:00
Tom Wilkie
13d4e1509b Documentation: add Prometheus monitoring-mixin
Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>
2018-05-29 09:52:40 -07:00