Documentation/etcd-mixin: Raise etcdHighNumberOfLeaderChanges threshold to 4

A cluster with three members could see three leader changes during a healthy rolling reboot, and we don't want to alert on that. Growing to 4 reduces false-alarms for clusters with three or fewer members, and that's probably most clusters. It will also slightly increase the risk of false-negatives, but if the cluster is struggling with high latency, it seems likely that it would quickly pass the new threshold too. The hard-coded threshold means that we are still likely to get false-positives during rolling reboots of clusters with four or more members. Ideally we'd scale this with the cluster size, or something, but I'm not sure how to do that. Three members is the minimum size for high availability, so reducing false positives for that case seems worth addressing even if we leave larger clusters largely unchanges. Also manually catch etcd3_alert.rules up to speed, since it seems to have been passed over by 16fc8a2b4b (Documentation/op-guide: Re-generate alert rules and dashboard from mixin, 2020-04-07, #11768).
2024-09-27 06:25:44 +00:00 · 2020-06-25 14:03:23 -07:00 · 2020-06-25 14:03:23 -07:00 · 0c5cffc60b
commit 0c5cffc60b
parent 2b79442d8e
4 changed files with 5 additions and 5 deletions
--- a/Documentation/etcd-mixin/mixin.libsonnet
+++ b/Documentation/etcd-mixin/mixin.libsonnet
@ -57,7 +57,7 @@
          {
            alert: 'etcdHighNumberOfLeaderChanges',
            expr: |||
-              increase((max by (job) (etcd_server_leader_changes_seen_total{%(etcd_selector)s}) or 0*absent(etcd_server_leader_changes_seen_total{%(etcd_selector)s}))[15m:1m]) >= 3
+              increase((max by (job) (etcd_server_leader_changes_seen_total{%(etcd_selector)s}) or 0*absent(etcd_server_leader_changes_seen_total{%(etcd_selector)s}))[15m:1m]) >= 4
            ||| % $._config,
            'for': '5m',
            labels: {
--- a/Documentation/etcd-mixin/test.yaml
+++ b/Documentation/etcd-mixin/test.yaml
@ -99,7 +99,7 @@ tests:
              job: etcd
              severity: warning
            exp_annotations:
-              message: 'etcd cluster "etcd": 3 leader changes within the last 15 minutes. Frequent elections may be a sign of insufficient resources, high network latency, or disruptions by other components and should be investigated.'
+              message: 'etcd cluster "etcd": 4 leader changes within the last 15 minutes. Frequent elections may be a sign of insufficient resources, high network latency, or disruptions by other components and should be investigated.'
  - interval: 1m
    input_series:
      - series: 'etcd_server_leader_changes_seen_total{job="etcd",instance="10.10.10.0"}'
--- a/Documentation/op-guide/etcd3_alert.rules
+++ b/Documentation/op-guide/etcd3_alert.rules
@ -29,13 +29,13 @@ ANNOTATIONS {

 # alert if there are lots of leader changes
 ALERT HighNumberOfLeaderChanges
-IF increase(etcd_server_leader_changes_seen_total{job="etcd"}[1h]) > 3
+IF increase((max by (job) (etcd_server_leader_changes_seen_total{job=~".*etcd.*"}) or 0*absent(etcd_server_leader_changes_seen_total{job=~".*etcd.*"}))[15m:1m]) >= 4
 LABELS {
  severity = "warning"
 }
 ANNOTATIONS {
  summary = "a high number of leader changes within the etcd cluster are happening",
-  description = "etcd instance {{ $labels.instance }} has seen {{ $value }} leader changes within the last hour",
+  description = "etcd cluster "{{ $labels.job }}": {{ $value }} leader changes within the last 15 minutes. Frequent elections may be a sign of insufficient resources, high network latency, or disruptions by other components and should be investigated.",
 }

 # gRPC request alerts
--- a/Documentation/op-guide/etcd3_alert.rules.yml
+++ b/Documentation/op-guide/etcd3_alert.rules.yml
@ -41,7 +41,7 @@ groups:
        the last 15 minutes. Frequent elections may be a sign of insufficient resources,
        high network latency, or disruptions by other components and should be investigated.'
    expr: |
-      increase((max by (job) (etcd_server_leader_changes_seen_total{job=~".*etcd.*"}) or 0*absent(etcd_server_leader_changes_seen_total{job=~".*etcd.*"}))[15m:1m]) >= 3
+      increase((max by (job) (etcd_server_leader_changes_seen_total{job=~".*etcd.*"}) or 0*absent(etcd_server_leader_changes_seen_total{job=~".*etcd.*"}))[15m:1m]) >= 4
    for: 5m
    labels:
      severity: warning