mirror of
https://github.com/etcd-io/etcd.git
synced 2024-09-27 06:25:44 +00:00
Before this change, during a reboot in which etcd recovers quickly (e.g. 1 min), the etcdMembersDown alert tends to fire even when etcd is fully healthy because the averaging function can take more than 3 minutes to average back down below the 0.01 threshold. This change tries to reduce the possibility of a false negative by considering a shorter (1 min) failure rate window which tends to average down below the threshold far more quickly (within 1 min). The `for` clause of the alert should ensure that the alert still fires if the poor conditions are sustained for an unreasonable overall time (3 min).
Prometheus Monitoring Mixin for etcd
NOTE: This project is alpha stage. Flags, configuration, behaviour and design may change significantly in following releases.
A set of customisable Prometheus alerts for etcd.
Instructions for use are the same as the kubernetes-mixin.
Background
- For more information about monitoring mixins, see this design doc.
Testing alerts
Make sure to have jsonnet and gojsontoyaml installed.
First compile the mixin to a YAML file, which the promtool will read:
jsonnet -e '(import "mixin.libsonnet").prometheusAlerts' | gojsontoyaml > mixin.yaml
Then run the unit test:
promtool test rules test.yaml