Difference in load configuration for watch delay tests show how huge the
impact is. Even with random write scheduler grpc under http
server can only handle 500 KB with 2 seconds delay. On the other hand,
separate grpc server easily hits 10, 100 or even 1000 MB within 100 miliseconds.
Priority write scheduler that was used in most previous releases
is far worse than random one.
Tests configured to only 5 MB to avoid flakes and taking too long to fill
etcd.
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
There are two goroutines accessing the `gs` grpc server var. Before
insecure `gs` server start, the `gs` can be changed to secure server and
then the client will fail to connect to etcd with insecure request. It
is data-race. We should use argument for reference in the new goroutine.
fix: #15495
Signed-off-by: Wei Fu <fuweid89@gmail.com>
The cluster version will be initialized after the member becomes leader.
The update is handled asynchronously. It couldn't be updated if the member
has been closed and the go-runtime picks the `s.stopping` channel first.
```go
// e2a5df534c/server/etcdserver/server.go (L2170)
func (s *EtcdServer) monitorClusterVersions() {
...
for {
select {
case <-s.firstCommitInTerm.Receive():
case <-time.After(monitorVersionInterval):
case <-s.stopping:
return
}
...
}
}
```
Or after the `s.stopping` has been closed, the [UpdateClusterVersion][1] won't
file GoAttach successfully. For the #15409, we can see the warn log
`server has stopped; skipping GoAttach` from GoAttach:
```plain
https://github.com/etcd-io/etcd/actions/runs/4340931587/jobs/7580103902
logger.go:130: 2023-03-06T07:36:44.253Z WARN default stopping grpc server due to error {"error": "accept tcp 127.0.0.1:2379: use of closed network connection"}
logger.go:130: 2023-03-06T07:36:44.253Z WARN default stopped grpc server due to error {"error": "accept tcp 127.0.0.1:2379: use of closed network connection"}
logger.go:130: 2023-03-06T07:36:44.253Z ERROR default setting up serving from embedded etcd failed. {"error": "accept tcp 127.0.0.1:2379: use of closed network connection"}
logger.go:130: 2023-03-06T07:36:44.253Z ERROR default setting up serving from embedded etcd failed. {"error": "http: Server closed"}
logger.go:130: 2023-03-06T07:36:44.253Z INFO default skipped leadership transfer for single voting member cluster {"local-member-id": "8e9e05c52164694d", "current-leader-member-id": "8e9e05c52164694d"}
logger.go:130: 2023-03-06T07:36:44.253Z WARN default server has stopped; skipping GoAttach
...
```
If the cluster version isn't updated, the minimum storage version will
be v3.5 because the [AuthStatus][2] is introduced in [v3.5][3].
The compare will fail.
To fix this issue, we should wait for cluster version to become ready
after server is ready to serve request.
[1]: <e2a5df534c/server/etcdserver/adapters.go (L45)>
[2]: <071e70cdc4>
[3]: <1b4e54c238>
Signed-off-by: Wei Fu <fuweid89@gmail.com>
The TestLeasingDeleteRangeContendTxn is trying to test for RangeDelete when
the target resources are being updated. When the `txnLeasing` wants a
server-side transaction, it needs to ensure all the keys mod revision should
be leass than what it saw. If the compare fails, it will repeat to apply the
server-side transaction until it is sucessful. I believe the test-case is
trying to verify how the `txnLeasing` handles the race issue.
Before the patch #15401, the resource-updating goroutine keeps updating until
the RangeDelete finishes. The testcase is flaky because two goroutines are
sharing one `ctx` and grpc-go client won't wait for the response if `ctx`
has been canceled.
For example,
| DelLease Goroutine | PutLease Goroutine | ETCD Server | Key/0 Status |
| -- | --- | -- | -- |
| deleted | | | version = 0 |
| | send update(key/0=123) req | received update(key/0=123) req | version = 0 |
| cancel | | | version = 0 |
| | exit because of cancel | | version = 0 |
| get key/0 by putkv | | | version = 0 |
| | | applied update(key/0=123) | version = 1 |
| get key/0 by raw-cli | | | version = 1 |
So `raw-cli` gets `[key/0=123]` while the `putkv` gets `[]`. If `putkv`
applies two update reqs to ETCD server and the last one is canceled
before apply, the error will be like:
```
expected [key:"key/0" version:2 value:"123" ], got [key:"key/0" version:1 value:"123" ]
```
The resource-updating goroutine should not share the ctx with RangeDelete here.
And I also revert current main branch because the resource-update goroutine
only updates 8 times and might exit before `RangeDelete`. In this case,
the `txnLeasing` is not handling the race issue.
Fixes: #15352
Signed-off-by: Wei Fu <fuweid89@gmail.com>
The huge (100k+) value was justified when storev2 was being dumped completely with every snapshot.
With storev2 being decomissioned we can checkpoint more frequently for faster recovery.
Signed-off-by: James Blair <mail@jamesblair.net>
Fixes etcd-io#15352.
Depending on the goroutine scheduling, the expected count of 8 might not
have been reached yet. This ensures the routine won't stop earlier than
that.
Signed-off-by: Thomas Jungblut <tjungblu@redhat.com>