humanize.Time() drops precision resulting in some events reporting they took
"now" time to complete. Using time.Duration.String() results in accurate
duration being reported.
Fixes#13905
Reason to store CI and term in backend was to make db fully independent
snapshot, it was never meant to interfere with apply logic. Skip of CI
was introduced for v2->v3 migration where we wanted to prevent it from
decreasing when replaying wal in
https://github.com/etcd-io/etcd/pull/5391. By mistake it was added to
apply flow during refactor in
https://github.com/etcd-io/etcd/pull/12855#commitcomment-70713670.
Consistency index and term should only be negotiated and used by raft to make
decisions. Their values should only driven by raft state machine and
backend should only be responsible for storing them.
Currently there are a handful of tests within etcd that silently fail
because LeakDetection will skip the test before it manages to hit this
check.
Here we move the check to the beginning of the process to highlight
these cases earlier, and to avoid them accidentally presenting as leaks.
Removed the fields consistentIdx and consistentTerm from struct EtcdServer,
and added applyingIndex and applyingTerm into struct consistentIndex in
package cindex. We may remove the two fields completely if we decide to
remove the OnPreCommitUnsafe, and it will depend on the performance test
result.
Previously the SetConsistentIndex() is called during the apply workflow,
but it's outside the db transaction. If a commit happens between SetConsistentIndex
and the following apply workflow, and etcd crashes for whatever reason right
after the commit, then etcd commits an incomplete transaction to db.
Eventually etcd runs into the data inconsistency issue.
In this commit, we move the SetConsistentIndex into a txPostLockHook, so
it will be executed inside the transaction lock.
We don't consistently reach the same etcd server during the lifetime of
a test and in some cases, this means that this test will flake if an
etcd server was slow to update its state and the test hits the outdated
server.
Here we switch to using an `Eventually` case which will wait upto a
second for the expected result before failing - with a 10ms gap between
invocations.
```
[tests(dani/leasefix)] $ gotestsum -- ./common -tags integration -count 100 -timeout 15m -run TestLeaseGrantAndList
✓ common (2m26.71s)
DONE 1600 tests in 147.258s
```
I think strong (not-equal) relationship was too restrictive when expressed with 1s granularity.
```
logger.go:130: 2022-04-03T22:15:15.242+0200 WARN m1 leader failed to send out heartbeat on time; took too long, leader is overloaded likely from slow disk {"member": "m1", "to": "cb785755eb80ac1", "heartbeat-interval": "10ms", "expected-duration": "20ms", "exceeded-duration": "24.666613ms"}
logger.go:130: 2022-04-03T22:15:15.262+0200 INFO m-1 published local member to cluster through raft {"member": "m-1", "local-member-id": "e2dd9f523aa7be87", "local-member-attributes": "{Name:m-1 ClientURLs:[unix://127.0.0.1:2196386040]}", "cluster-id": "b4b8e7e41c23c8b5", "publish-timeout": "5.2s"}
v3_lease_test.go:415: Expected lease ttl (4m58s) to be greather than (4m58s)
```
The code now ensures that each of the test is running in its own directory as opposed to shared os.tempdir.
```
$ (cd tests && env go test -timeout=15m --race go.etcd.io/etcd/tests/v3/integration/clientv3/examples -run ExampleAuth)
2022/04/03 10:24:59 Running tests (examples): ...
2022/04/03 10:24:59 the function can be called only in the test context. Was integration.BeforeTest() called ?
2022/04/03 10:24:59 2022-04-03T10:24:59.462+0200 INFO m0 LISTEN GRPC {"member": "m0", "grpcAddr": "localhost:m0", "m.Name": "m0"}
```
Nearly none of the tests was checking the value... just assuming WaitLeader success.
```
maintenance_test.go:277: Waiting for leader...
logger.go:130: 2022-04-03T08:01:09.914+0200 INFO m0 cluster version differs from storage version. {"member": "m0", "cluster-version": "3.6.0", "storage-version": "3.5.0"}
logger.go:130: 2022-04-03T08:01:09.915+0200 WARN m0 leader failed to send out heartbeat on time; took too long, leader is overloaded likely from slow disk {"member": "m0", "to": "2acc3d3b521981", "heartbeat-interval": "10ms", "expected-duration": "20ms", "exceeded-duration": "103.756219ms"}
logger.go:130: 2022-04-03T08:01:09.916+0200 INFO m0 updated storage version {"member": "m0", "new-storage-version": "3.6.0"}
...
logger.go:130: 2022-04-03T08:01:09.926+0200 INFO grpc [[roundrobin] roundrobinPicker: Build called with info: {map[0xc002630ac0:{{unix:localhost:m0 localhost <nil> 0 <nil>}} 0xc002630af0:{{unix:localhost:m1 localhost <nil> 0 <nil>}} 0xc002630b20:{{unix:localhost:m2 localhost <nil> 0 <nil>}}]}]
logger.go:130: 2022-04-03T08:01:09.926+0200 WARN m0 apply request took too long {"member": "m0", "took": "114.661766ms", "expected-duration": "100ms", "prefix": "", "request": "header:<ID:12658633312866157316 > cluster_version_set:<ver:\"3.6.0\" > ", "response": ""}
logger.go:130: 2022-04-03T08:01:09.927+0200 INFO m0 cluster version is updated {"member": "m0", "cluster-version": "3.6"}
logger.go:130: 2022-04-03T08:01:09.955+0200 INFO m2.raft 9f96af25a04e2ec3 [logterm: 2, index: 8, vote: 9903a56eaf96afac] ignored MsgVote from 2acc3d3b521981 [logterm: 2, index: 8] at term 2: lease is not expired (remaining ticks: 10) {"member": "m2"}
logger.go:130: 2022-04-03T08:01:09.955+0200 INFO m0.raft 9903a56eaf96afac [logterm: 2, index: 8, vote: 9903a56eaf96afac] ignored MsgVote from 2acc3d3b521981 [logterm: 2, index: 8] at term 2: lease is not expired (remaining ticks: 5) {"member": "m0"}
logger.go:130: 2022-04-03T08:01:09.955+0200 INFO m0.raft 9903a56eaf96afac [term: 2] received a MsgAppResp message with higher term from 2acc3d3b521981 [term: 3] {"member": "m0"}
logger.go:130: 2022-04-03T08:01:09.955+0200 INFO m0.raft 9903a56eaf96afac became follower at term 3 {"member": "m0"}
logger.go:130: 2022-04-03T08:01:09.955+0200 INFO m0.raft raft.node: 9903a56eaf96afac lost leader 9903a56eaf96afac at term 3 {"member": "m0"}
maintenance_test.go:279: Leader established.
```
Tmp
```
% (cd client/v3 && env go test -short -timeout=3m --race ./...)
--- FAIL: TestAuthTokenBundleNoOverwrite (0.00s)
client_test.go:210: listen unix /var/folders/t1/3m8z9xz93t9c3vpt7zyzjm6w00374n/T/TestAuthTokenBundleNoOverwrite3197524989/001/etcd-auth-test:0: bind: invalid argument
FAIL
FAIL go.etcd.io/etcd/client/v3 4.270s
```
The reason was that the path exceeded 108 chars (that is too much for socket).
In the mitigation we first change chroot (working directory) to the tempDir... such the path is 'local'.
We've had these since in one form or another since 23a302364c
(Makefile: initial commit, 2017-09-29), but in at least some cases the
underlying shell does not pipefail, a test failure gets swallowed, and
the make call exits zero despite failing the tests [1]:
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_etcd/109/pull-ci-openshift-etcd-openshift-4.11-unit/1509260812278042624/artifacts/test/build-log.txt
TEST_OPTS: PASSES='unit'
log-file: test-MTY0ODY3MTA1MQo.log
PASSES='unit' ./test.sh 2>&1 | tee test-MTY0ODY3MTA1MQo.log
% env GO111MODULE=off go get github.com/myitcv/gobin
Running with --race
Starting at: Wed Mar 30 20:10:52 UTC 2022
'unit' started at Wed Mar 30 20:10:52 UTC 2022
% (cd api && env go test -short -timeout=3m --race ./...)
stderr: authpb/auth.pb.go:12:2: open /go/pkg/mod/github.com/gogo/protobuf@v1.3.2/gogoproto: permission denied
stderr: authpb/auth.pb.go:13:2: open /go/pkg/mod/github.com/golang/protobuf@v1.5.2/proto: permission denied
stderr: etcdserverpb/rpc.pb.go:17:2: open /go/pkg/mod/google.golang.org/genproto@v0.0.0-20210602131652-f16073e35f0c/googleapis/api/annotations: permission denied
stderr: etcdserverpb/rpc.pb.go:18:2: open /go/pkg/mod/google.golang.org/grpc@v1.38.0: permission denied
stderr: etcdserverpb/rpc.pb.go:19:2: open /go/pkg/mod/google.golang.org/grpc@v1.38.0/codes: permission denied
stderr: etcdserverpb/rpc.pb.go:20:2: open /go/pkg/mod/google.golang.org/grpc@v1.38.0/status: permission denied
stderr: etcdserverpb/gw/rpc.pb.gw.go:17:2: open /go/pkg/mod/github.com/golang/protobuf@v1.5.2/descriptor: permission denied
stderr: etcdserverpb/gw/rpc.pb.gw.go:19:2: open /go/pkg/mod/github.com/grpc-ecosystem/grpc-gateway@v1.16.0/runtime: permission denied
stderr: etcdserverpb/gw/rpc.pb.gw.go:20:2: open /go/pkg/mod/github.com/grpc-ecosystem/grpc-gateway@v1.16.0/utilities: permission denied
FAIL: (code:1):
% (cd api && env go test -short -timeout=3m --race ./...)
stderr: etcdserverpb/gw/rpc.pb.gw.go:23:2: open /go/pkg/mod/google.golang.org/grpc@v1.38.0/grpclog: permission denied
stderr: version/version.go:23:2: open /go/pkg/mod/github.com/coreos/go-semver@v0.3.0/semver: permission denied
FAIL: 'unit' failed at Wed Mar 30 20:10:52 UTC 2022
! egrep "(--- FAIL:|DATA RACE|panic: test timed out|appears to have leaked)" -B50 -A10 test-MTY0ODY3MTA1MQo.log
We can't drop the log aggregation, because the log files are used for
the panic/race grepping. But I'm dropping the tee (so no more
synchronous updates, but we no longer have to worry about pipefail
handling). And then if the test script fails, I'm dumping the log
file to stdout and exiting 1, so the overall run fails.
[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_etcd/109/pull-ci-openshift-etcd-openshift-4.11-unit/1509260812278042624