Motivation:
- ServerConfig is part of 'embed' public API, while etcdserver is more 'internal'
- EtcdServer is already too big and config is pretty wide-spread leaf
if we were to split etcdserver (e.g. into pre & post-apply part).
Thanks to this, unix sockets should be not longer
created by integration tests in the the source code directory,
so potentially trigger IDE reloads and unnecessery load (and mess).
Thanks to this the logs:
- are automatically printed if the test fails.
- are in pretty consistent format.
- are annotated by 'member' information of the cluster emitting them.
Side changes:
- Set propert default got DefaultWarningApplyDuration (used to be '0')
- Name the members based on their 'place' on the list (as opposed to
'random')
The flag protects etcd memory from being swapped out to disk.
This can happen in memory constrained systems where mmaped bbolt
area is natural condidate for swapping out.
This flag should provide better tail latency on the cost of higher RSS
ram usage. If the experiment is successful, the logic should get moved
into bbolt layer, where we can protect specific bbolt instances
(e.g. avoid protecting both during defragmentation).
There are situations where we don't wish to fsync but we do want to
write the data.
Typically this occurs in clusters where fsync latency (often the
result of firmware) transiently spikes. For Kubernetes clusters this
causes (many) elections which have knock-on effects such that the API
server will transiently fail causing other components fail in turn.
By writing the data (buffered and asynchronously flushed, so in most
situations the write is fast) and avoiding the fsync we no longer
trigger this situation and opportunistically write out the data.
Anecdotally:
Because the fsync is missing there is the argument that certain
types of failure events will cause data corruption or loss, in
testing this wasn't seen. If this was to occur the expectation is
the member can be readded to a cluster or worst-case restored from a
robust persisted snapshot.
The etcd members are deployed across isolated racks with different
power feeds. An instantaneous failure of all of them simultaneously
is unlikely.
Testing was usually of the form:
* create (Kubernetes) etcd write-churn by creating replicasets of
some 1000s of pods
* break/fail the leader
Failure testing included:
* hard node power-off events
* disk removal
* orderly reboots/shutdown
In all cases when the node recovered it was able to rejoin the
cluster and synchronize.
The purpose of this change is to learn more about flake cases like:
https://travis-ci.com/github/etcd-io/etcd/jobs/488324449
```
% (cd tests && 'env' 'go' 'test' '-timeout=30m' '--race=false' '--cpu=2' './integration/...')
stderr: go: downloading github.com/golang/groupcache v0.0.0-20200121045136-8c9f03a8e57e
ok go.etcd.io/etcd/tests/v3/integration 197.295s
ok go.etcd.io/etcd/tests/v3/integration/client 0.089s
ok go.etcd.io/etcd/tests/v3/integration/client/examples 0.038s
ok go.etcd.io/etcd/tests/v3/integration/clientv3 70.365s
ok go.etcd.io/etcd/tests/v3/integration/clientv3/concurrency 3.169s
ok go.etcd.io/etcd/tests/v3/integration/clientv3/connectivity 100.535s
ok go.etcd.io/etcd/tests/v3/integration/clientv3/examples 1.341s
ok go.etcd.io/etcd/tests/v3/integration/clientv3/experimental/recipes 3.277s
No output has been received in the last 10m0s,
```