The `etcdHighNumberOfLeaderChanges` alert had a copy and paste
error when it was converted from docs to mixin in 10244 - we moved
from "increase over 15m > 3" to "rate over 15m > 3" which is not
the same (rate is measured per second, so it should have been
"rate over 15m > (3 / 60 / 15)"). As part of fixing that, we
need to capture when prometheus starts or when new etcd clusters
are captured with a high leader change - i.e. if you start a new
etcd cluster and at the moment prometheus first scrapes you are
already at 5 leader changes, we should fire on that transition.
This alert is also now more responsive, so if you get a quick
burst of 3 leader changes we'll alert within 5m rather than 15m.
An etcd member being down is an important failure state - while
normal admin operations may cause transient outages to rotate,
when any member is down the cluster is operating in a degraded
fashion. Add an alert that records when any members are down
so that administrators know whether the next failure is fatal.
The rule is more complicated than `up{...} == 0` because not all
failure modes for etcd may have an `up{...}` entry for each member.
For instance, a Kubernetes service in front of an etcd cluster
might only have 2 endpoints recorded in `up` because the third
pod is evicted by the kubelet - the cluster is degraded but
`count(up{...})` would not return the full quorum size. Instead,
use network peer send failures as a failure detector and attempt
to return the max of down services or failing peers. We may
undercount the number of total failures, but we will at least
alert that a member is down.
Currently alert messages state that we detect issue
within the last 1 hour, although we check
for last 15min and wait for 15min for this alert to keep firing.
This fix changes the message to be 30minutes.
Currently the EtcdInsufficientMembers alert fires, when more than (X/2)-1
instances are unavailable. This fixes it to fire at the correct limit of (X-1)/2
unavailable instances and $value now contains the number of available instances
instead of unavailable ones. Added unit test for EtcdInsufficientMembers alert.