* etcd-mixin: Reformulate alerting rules to use `without` rather than `by`
With aggregations using `by`, all additional target labels that a user
might have configured, are aggregated away. However, those target
labels are useful for e.g. alert routing. With this commit, nothing
should change for vanilla job/instance target labels, but whoever has
more target labels can now still make use of them.
Signed-off-by: beorn7 <beorn@grafana.com>
* etcd-mixin: Parametrize instance labels to aggregate away
Signed-off-by: beorn7 <beorn@grafana.com>
* raft: check conf change before campaign
Signed-off-by: Jay Lee <BusyJayLee@gmail.com>
* raft: extract hup function
Signed-off-by: Jay Lee <BusyJayLee@gmail.com>
* raft: check pending conf change for transferleader
Signed-off-by: Jay Lee <BusyJayLee@gmail.com>
Tested conditions that cause
panic: invalid Go type int for field k8s_io.kubernetes.vendor.go_etcd_io.etcd.etcdserver.etcdserverpb.loggablePutRequest.value_size
Signed-off-by: Ed Bartosh <eduard.bartosh@intel.com>
Before this change, during a reboot in which etcd recovers quickly (e.g. 1 min),
the etcdMembersDown alert tends to fire even when etcd is fully healthy because
the averaging function can take more than 3 minutes to average back down below
the 0.01 threshold.
This change tries to reduce the possibility of a false negative by considering a
shorter (1 min) failure rate window which tends to average down below the
threshold far more quickly (within 1 min). The `for` clause of the alert should
ensure that the alert still fires if the poor conditions are sustained for an
unreasonable overall time (3 min).
A cluster with three members could see three leader changes during a
healthy rolling reboot, and we don't want to alert on that. Growing
to 4 reduces false-alarms for clusters with three or fewer members,
and that's probably most clusters. It will also slightly increase the
risk of false-negatives, but if the cluster is struggling with high
latency, it seems likely that it would quickly pass the new threshold
too.
The hard-coded threshold means that we are still likely to get
false-positives during rolling reboots of clusters with four or more
members. Ideally we'd scale this with the cluster size, or something,
but I'm not sure how to do that. Three members is the minimum size
for high availability, so reducing false positives for that case seems
worth addressing even if we leave larger clusters largely unchanges.
Also manually catch etcd3_alert.rules up to speed, since it seems to
have been passed over by 16fc8a2b4b (Documentation/op-guide:
Re-generate alert rules and dashboard from mixin, 2020-04-07, #11768).
backend: Create base type for readTx and concurrentReadTx
backend: Implemented review comments to rename rTx to baseReadTx and remove TODO
backend: Resolved comments around baseReadTx
backend: Implemented a review comment