When we can't reach quorum, we were waiting forever and never sending
the systemd notify message. As a result, systemd would eventually time out
and restart the etcd process which likely would make the unhealthy cluster
in an even worse state
Improves #13785
Signed-off-by: Nicolai Moore <niconorsk@gmail.com>
This should aid in debugging test flakes, especially in tests where the process is restarted very often and thus changes its pid.
Now it's a lot easier to grep for different members, also when different tests fail at the same time.
The test TestDowngradeUpgradeClusterOf3 as mentioned in #13167 is a good example for that.
Signed-off-by: Thomas Jungblut <tjungblu@redhat.com>
When etcdserver receives a LeaseRenew request, it may be still in
progress of processing the LeaseGrantRequest on exact the same
leaseID. Accordingly it may return a TTL=0 to client due to the
leaseID not found error. So the leader should wait for the appliedID
to be available before processing client requests.
Currently there are a handful of tests within etcd that silently fail
because LeakDetection will skip the test before it manages to hit this
check.
Here we move the check to the beginning of the process to highlight
these cases earlier, and to avoid them accidentally presenting as leaks.
The code now ensures that each of the test is running in its own directory as opposed to shared os.tempdir.
```
$ (cd tests && env go test -timeout=15m --race go.etcd.io/etcd/tests/v3/integration/clientv3/examples -run ExampleAuth)
2022/04/03 10:24:59 Running tests (examples): ...
2022/04/03 10:24:59 the function can be called only in the test context. Was integration.BeforeTest() called ?
2022/04/03 10:24:59 2022-04-03T10:24:59.462+0200 INFO m0 LISTEN GRPC {"member": "m0", "grpcAddr": "localhost:m0", "m.Name": "m0"}
```
Nearly none of the tests was checking the value... just assuming WaitLeader success.
```
maintenance_test.go:277: Waiting for leader...
logger.go:130: 2022-04-03T08:01:09.914+0200 INFO m0 cluster version differs from storage version. {"member": "m0", "cluster-version": "3.6.0", "storage-version": "3.5.0"}
logger.go:130: 2022-04-03T08:01:09.915+0200 WARN m0 leader failed to send out heartbeat on time; took too long, leader is overloaded likely from slow disk {"member": "m0", "to": "2acc3d3b521981", "heartbeat-interval": "10ms", "expected-duration": "20ms", "exceeded-duration": "103.756219ms"}
logger.go:130: 2022-04-03T08:01:09.916+0200 INFO m0 updated storage version {"member": "m0", "new-storage-version": "3.6.0"}
...
logger.go:130: 2022-04-03T08:01:09.926+0200 INFO grpc [[roundrobin] roundrobinPicker: Build called with info: {map[0xc002630ac0:{{unix:localhost:m0 localhost <nil> 0 <nil>}} 0xc002630af0:{{unix:localhost:m1 localhost <nil> 0 <nil>}} 0xc002630b20:{{unix:localhost:m2 localhost <nil> 0 <nil>}}]}]
logger.go:130: 2022-04-03T08:01:09.926+0200 WARN m0 apply request took too long {"member": "m0", "took": "114.661766ms", "expected-duration": "100ms", "prefix": "", "request": "header:<ID:12658633312866157316 > cluster_version_set:<ver:\"3.6.0\" > ", "response": ""}
logger.go:130: 2022-04-03T08:01:09.927+0200 INFO m0 cluster version is updated {"member": "m0", "cluster-version": "3.6"}
logger.go:130: 2022-04-03T08:01:09.955+0200 INFO m2.raft 9f96af25a04e2ec3 [logterm: 2, index: 8, vote: 9903a56eaf96afac] ignored MsgVote from 2acc3d3b521981 [logterm: 2, index: 8] at term 2: lease is not expired (remaining ticks: 10) {"member": "m2"}
logger.go:130: 2022-04-03T08:01:09.955+0200 INFO m0.raft 9903a56eaf96afac [logterm: 2, index: 8, vote: 9903a56eaf96afac] ignored MsgVote from 2acc3d3b521981 [logterm: 2, index: 8] at term 2: lease is not expired (remaining ticks: 5) {"member": "m0"}
logger.go:130: 2022-04-03T08:01:09.955+0200 INFO m0.raft 9903a56eaf96afac [term: 2] received a MsgAppResp message with higher term from 2acc3d3b521981 [term: 3] {"member": "m0"}
logger.go:130: 2022-04-03T08:01:09.955+0200 INFO m0.raft 9903a56eaf96afac became follower at term 3 {"member": "m0"}
logger.go:130: 2022-04-03T08:01:09.955+0200 INFO m0.raft raft.node: 9903a56eaf96afac lost leader 9903a56eaf96afac at term 3 {"member": "m0"}
maintenance_test.go:279: Leader established.
```
Tmp
```
% (cd client/v3 && env go test -short -timeout=3m --race ./...)
--- FAIL: TestAuthTokenBundleNoOverwrite (0.00s)
client_test.go:210: listen unix /var/folders/t1/3m8z9xz93t9c3vpt7zyzjm6w00374n/T/TestAuthTokenBundleNoOverwrite3197524989/001/etcd-auth-test:0: bind: invalid argument
FAIL
FAIL go.etcd.io/etcd/client/v3 4.270s
```
The reason was that the path exceeded 108 chars (that is too much for socket).
In the mitigation we first change chroot (working directory) to the tempDir... such the path is 'local'.