Previously the SetConsistentIndex() is called during the apply workflow,
but it's outside the db transaction. If a commit happens between SetConsistentIndex
and the following apply workflow, and etcd crashes for whatever reason right
after the commit, then etcd commits an incomplete transaction to db.
Eventually etcd runs into the data inconsistency issue.
In this commit, we move the SetConsistentIndex into a txPostLockHook, so
it will be executed inside the transaction lock.
We don't consistently reach the same etcd server during the lifetime of
a test and in some cases, this means that this test will flake if an
etcd server was slow to update its state and the test hits the outdated
server.
Here we switch to using an `Eventually` case which will wait upto a
second for the expected result before failing - with a 10ms gap between
invocations.
```
[tests(dani/leasefix)] $ gotestsum -- ./common -tags integration -count 100 -timeout 15m -run TestLeaseGrantAndList
✓ common (2m26.71s)
DONE 1600 tests in 147.258s
```
I think strong (not-equal) relationship was too restrictive when expressed with 1s granularity.
```
logger.go:130: 2022-04-03T22:15:15.242+0200 WARN m1 leader failed to send out heartbeat on time; took too long, leader is overloaded likely from slow disk {"member": "m1", "to": "cb785755eb80ac1", "heartbeat-interval": "10ms", "expected-duration": "20ms", "exceeded-duration": "24.666613ms"}
logger.go:130: 2022-04-03T22:15:15.262+0200 INFO m-1 published local member to cluster through raft {"member": "m-1", "local-member-id": "e2dd9f523aa7be87", "local-member-attributes": "{Name:m-1 ClientURLs:[unix://127.0.0.1:2196386040]}", "cluster-id": "b4b8e7e41c23c8b5", "publish-timeout": "5.2s"}
v3_lease_test.go:415: Expected lease ttl (4m58s) to be greather than (4m58s)
```
The code now ensures that each of the test is running in its own directory as opposed to shared os.tempdir.
```
$ (cd tests && env go test -timeout=15m --race go.etcd.io/etcd/tests/v3/integration/clientv3/examples -run ExampleAuth)
2022/04/03 10:24:59 Running tests (examples): ...
2022/04/03 10:24:59 the function can be called only in the test context. Was integration.BeforeTest() called ?
2022/04/03 10:24:59 2022-04-03T10:24:59.462+0200 INFO m0 LISTEN GRPC {"member": "m0", "grpcAddr": "localhost:m0", "m.Name": "m0"}
```
Nearly none of the tests was checking the value... just assuming WaitLeader success.
```
maintenance_test.go:277: Waiting for leader...
logger.go:130: 2022-04-03T08:01:09.914+0200 INFO m0 cluster version differs from storage version. {"member": "m0", "cluster-version": "3.6.0", "storage-version": "3.5.0"}
logger.go:130: 2022-04-03T08:01:09.915+0200 WARN m0 leader failed to send out heartbeat on time; took too long, leader is overloaded likely from slow disk {"member": "m0", "to": "2acc3d3b521981", "heartbeat-interval": "10ms", "expected-duration": "20ms", "exceeded-duration": "103.756219ms"}
logger.go:130: 2022-04-03T08:01:09.916+0200 INFO m0 updated storage version {"member": "m0", "new-storage-version": "3.6.0"}
...
logger.go:130: 2022-04-03T08:01:09.926+0200 INFO grpc [[roundrobin] roundrobinPicker: Build called with info: {map[0xc002630ac0:{{unix:localhost:m0 localhost <nil> 0 <nil>}} 0xc002630af0:{{unix:localhost:m1 localhost <nil> 0 <nil>}} 0xc002630b20:{{unix:localhost:m2 localhost <nil> 0 <nil>}}]}]
logger.go:130: 2022-04-03T08:01:09.926+0200 WARN m0 apply request took too long {"member": "m0", "took": "114.661766ms", "expected-duration": "100ms", "prefix": "", "request": "header:<ID:12658633312866157316 > cluster_version_set:<ver:\"3.6.0\" > ", "response": ""}
logger.go:130: 2022-04-03T08:01:09.927+0200 INFO m0 cluster version is updated {"member": "m0", "cluster-version": "3.6"}
logger.go:130: 2022-04-03T08:01:09.955+0200 INFO m2.raft 9f96af25a04e2ec3 [logterm: 2, index: 8, vote: 9903a56eaf96afac] ignored MsgVote from 2acc3d3b521981 [logterm: 2, index: 8] at term 2: lease is not expired (remaining ticks: 10) {"member": "m2"}
logger.go:130: 2022-04-03T08:01:09.955+0200 INFO m0.raft 9903a56eaf96afac [logterm: 2, index: 8, vote: 9903a56eaf96afac] ignored MsgVote from 2acc3d3b521981 [logterm: 2, index: 8] at term 2: lease is not expired (remaining ticks: 5) {"member": "m0"}
logger.go:130: 2022-04-03T08:01:09.955+0200 INFO m0.raft 9903a56eaf96afac [term: 2] received a MsgAppResp message with higher term from 2acc3d3b521981 [term: 3] {"member": "m0"}
logger.go:130: 2022-04-03T08:01:09.955+0200 INFO m0.raft 9903a56eaf96afac became follower at term 3 {"member": "m0"}
logger.go:130: 2022-04-03T08:01:09.955+0200 INFO m0.raft raft.node: 9903a56eaf96afac lost leader 9903a56eaf96afac at term 3 {"member": "m0"}
maintenance_test.go:279: Leader established.
```
Tmp
```
% (cd client/v3 && env go test -short -timeout=3m --race ./...)
--- FAIL: TestAuthTokenBundleNoOverwrite (0.00s)
client_test.go:210: listen unix /var/folders/t1/3m8z9xz93t9c3vpt7zyzjm6w00374n/T/TestAuthTokenBundleNoOverwrite3197524989/001/etcd-auth-test:0: bind: invalid argument
FAIL
FAIL go.etcd.io/etcd/client/v3 4.270s
```
The reason was that the path exceeded 108 chars (that is too much for socket).
In the mitigation we first change chroot (working directory) to the tempDir... such the path is 'local'.
We've had these since in one form or another since 23a302364c
(Makefile: initial commit, 2017-09-29), but in at least some cases the
underlying shell does not pipefail, a test failure gets swallowed, and
the make call exits zero despite failing the tests [1]:
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_etcd/109/pull-ci-openshift-etcd-openshift-4.11-unit/1509260812278042624/artifacts/test/build-log.txt
TEST_OPTS: PASSES='unit'
log-file: test-MTY0ODY3MTA1MQo.log
PASSES='unit' ./test.sh 2>&1 | tee test-MTY0ODY3MTA1MQo.log
% env GO111MODULE=off go get github.com/myitcv/gobin
Running with --race
Starting at: Wed Mar 30 20:10:52 UTC 2022
'unit' started at Wed Mar 30 20:10:52 UTC 2022
% (cd api && env go test -short -timeout=3m --race ./...)
stderr: authpb/auth.pb.go:12:2: open /go/pkg/mod/github.com/gogo/protobuf@v1.3.2/gogoproto: permission denied
stderr: authpb/auth.pb.go:13:2: open /go/pkg/mod/github.com/golang/protobuf@v1.5.2/proto: permission denied
stderr: etcdserverpb/rpc.pb.go:17:2: open /go/pkg/mod/google.golang.org/genproto@v0.0.0-20210602131652-f16073e35f0c/googleapis/api/annotations: permission denied
stderr: etcdserverpb/rpc.pb.go:18:2: open /go/pkg/mod/google.golang.org/grpc@v1.38.0: permission denied
stderr: etcdserverpb/rpc.pb.go:19:2: open /go/pkg/mod/google.golang.org/grpc@v1.38.0/codes: permission denied
stderr: etcdserverpb/rpc.pb.go:20:2: open /go/pkg/mod/google.golang.org/grpc@v1.38.0/status: permission denied
stderr: etcdserverpb/gw/rpc.pb.gw.go:17:2: open /go/pkg/mod/github.com/golang/protobuf@v1.5.2/descriptor: permission denied
stderr: etcdserverpb/gw/rpc.pb.gw.go:19:2: open /go/pkg/mod/github.com/grpc-ecosystem/grpc-gateway@v1.16.0/runtime: permission denied
stderr: etcdserverpb/gw/rpc.pb.gw.go:20:2: open /go/pkg/mod/github.com/grpc-ecosystem/grpc-gateway@v1.16.0/utilities: permission denied
FAIL: (code:1):
% (cd api && env go test -short -timeout=3m --race ./...)
stderr: etcdserverpb/gw/rpc.pb.gw.go:23:2: open /go/pkg/mod/google.golang.org/grpc@v1.38.0/grpclog: permission denied
stderr: version/version.go:23:2: open /go/pkg/mod/github.com/coreos/go-semver@v0.3.0/semver: permission denied
FAIL: 'unit' failed at Wed Mar 30 20:10:52 UTC 2022
! egrep "(--- FAIL:|DATA RACE|panic: test timed out|appears to have leaked)" -B50 -A10 test-MTY0ODY3MTA1MQo.log
We can't drop the log aggregation, because the log files are used for
the panic/race grepping. But I'm dropping the tee (so no more
synchronous updates, but we no longer have to worry about pipefail
handling). And then if the test script fails, I'm dumping the log
file to stdout and exiting 1, so the overall run fails.
[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_etcd/109/pull-ci-openshift-etcd-openshift-4.11-unit/1509260812278042624
This change is to ensure that all members returned during the client's
AutoSync are started and are not learners, which are not valid
etcd members to make requests to.