Commit Graph

17 Commits

Author SHA1 Message Date
Björn Rabenstein
c9a5889915 Documentation/etcd-mixin: Reformulate alerting rules to use without rather than by (#12122)
* etcd-mixin: Reformulate alerting rules to use `without` rather than `by`

With aggregations using `by`, all additional target labels that a user
might have configured, are aggregated away. However, those target
labels are useful for e.g. alert routing. With this commit, nothing
should change for vanilla job/instance target labels, but whoever has
more target labels can now still make use of them.

Signed-off-by: beorn7 <beorn@grafana.com>

* etcd-mixin: Parametrize instance labels to aggregate away

Signed-off-by: beorn7 <beorn@grafana.com>
2020-07-23 16:02:26 -07:00
Dan Mace
2aa5684ada Documentation: Tweak etcdMembersDown to reduce false negatives
Before this change, during a reboot in which etcd recovers quickly (e.g. 1 min),
the etcdMembersDown alert tends to fire even when etcd is fully healthy because
the averaging function can take more than 3 minutes to average back down below
the 0.01 threshold.

This change tries to reduce the possibility of a false negative by considering a
shorter (1 min) failure rate window which tends to average down below the
threshold far more quickly (within 1 min). The `for` clause of the alert should
ensure that the alert still fires if the poor conditions are sustained for an
unreasonable overall time (3 min).
2020-07-13 08:58:21 -04:00
W. Trevor King
0c5cffc60b Documentation/etcd-mixin: Raise etcdHighNumberOfLeaderChanges threshold to 4
A cluster with three members could see three leader changes during a
healthy rolling reboot, and we don't want to alert on that.  Growing
to 4 reduces false-alarms for clusters with three or fewer members,
and that's probably most clusters.  It will also slightly increase the
risk of false-negatives, but if the cluster is struggling with high
latency, it seems likely that it would quickly pass the new threshold
too.

The hard-coded threshold means that we are still likely to get
false-positives during rolling reboots of clusters with four or more
members.  Ideally we'd scale this with the cluster size, or something,
but I'm not sure how to do that.  Three members is the minimum size
for high availability, so reducing false positives for that case seems
worth addressing even if we leave larger clusters largely unchanges.

Also manually catch etcd3_alert.rules up to speed, since it seems to
have been passed over by 16fc8a2b4b (Documentation/op-guide:
Re-generate alert rules and dashboard from mixin, 2020-04-07, #11768).
2020-06-25 15:38:15 -07:00
Frederic Branczyk
2c4877064e Documentation/etcd-mixin: Use etcd_mvcc_db_total_size_in_bytes metric 2020-04-07 18:14:23 +02:00
Frederic Branczyk
68c5f6066f Documentation/etcd-mixin: Set unique UID for Grafana dashboard 2020-04-07 18:13:41 +02:00
Clayton Coleman
322c38e169 Documentation/etcd-mixin: Fix etcdHighNumberOfLeaderChanges (#11448)
The `etcdHighNumberOfLeaderChanges` alert had a copy and paste
error when it was converted from docs to mixin in 10244 - we moved
from "increase over 15m > 3" to "rate over 15m > 3" which is not
the same (rate is measured per second, so it should have been
"rate over 15m > (3 / 60 / 15)").  As part of fixing that, we
need to capture when prometheus starts or when new etcd clusters
are captured with a high leader change - i.e. if you start a new
etcd cluster and at the moment prometheus first scrapes you are
already at 5 leader changes, we should fire on that transition.

This alert is also now more responsive, so if you get a quick
burst of 3 leader changes we'll alert within 5m rather than 15m.
2019-12-13 16:00:11 -08:00
Clayton Coleman
465592a718 Documentation/etcd-mixin: Add an alert for down etcd members
An etcd member being down is an important failure state - while
normal admin operations may cause transient outages to rotate,
when any member is down the cluster is operating in a degraded
fashion. Add an alert that records when any members are down
so that administrators know whether the next failure is fatal.

The rule is more complicated than `up{...} == 0` because not all
failure modes for etcd may have an `up{...}` entry for each member.
For instance, a Kubernetes service in front of an etcd cluster
might only have 2 endpoints recorded in `up` because the third
pod is evicted by the kubelet - the cluster is degraded but
`count(up{...})` would not return the full quorum size. Instead,
use network peer send failures as a failure detector and attempt
to return the max of down services or failing peers. We may
undercount the number of total failures, but we will at least
alert that a member is down.
2019-07-30 14:39:50 -04:00
paulfantom
886d30d223 Documentation: provide better user experience with autorefreshing grafana dashboard 2019-05-08 06:58:28 -04:00
Povilas Versockas
eb8e94c4ed etcd-mixin: Improve etcdHighNumberOfLeaderChanges,etcdHighNumberOfFailedProposals message
Currently alert messages state that we detect issue
within the last 1 hour, although we check
for last 15min and wait for 15min for this alert to keep firing.
This fix changes the message to be 30minutes.
2019-02-04 09:28:23 +02:00
Dmitry Verkhoturov
0929080834 doc: exclude 404 error because kubelet generating false positive 2018-12-17 11:57:12 +03:00
Dmitry Verkhoturov
830d064903 doc: convert etcd to lower-case everywhere 2018-12-17 11:57:12 +03:00
Dmitry Verkhoturov
358cc1a8fa doc: sync prometheus rules with prometheus-operator version
(and remove non-etcd specific FdExhaustionClose)
https://github.com/coreos/prometheus-operator/blob/master/helm/exporter-kube-etcd/templates/etcd3.rules.yaml
sync etcd alert rules with libsonnet

Signed-off-by: Dmitry Verkhoturov <paskal.07@gmail.com>
2018-12-17 11:57:12 +03:00
Christian Beneke
c75ba98f81 Documentation/etcd-mixin: Fix EtcdInsufficientMembers alerting
Currently the EtcdInsufficientMembers alert fires, when more than (X/2)-1
instances are unavailable. This fixes it to fire at the correct limit of (X-1)/2
unavailable instances and $value now contains the number of available instances
instead of unavailable ones. Added unit test for EtcdInsufficientMembers alert.
2018-10-15 19:23:43 +02:00
Joonyoung Park
bd74c10fdb Documentation/etcd-mixin: fix typo in README.md
Promethues -> Prometheus
2018-07-19 19:10:46 +09:00
Joshua Olson
3826107af6 Documentation: removing alerts that were specific to etcd v2 2018-07-18 12:31:46 -04:00
Frederic Branczyk
778bfe1c82 Documentation: Add Grafana dashboard to etcd monitoring mixin 2018-05-30 13:42:36 +02:00
Tom Wilkie
13d4e1509b Documentation: add Prometheus monitoring-mixin
Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>
2018-05-29 09:52:40 -07:00