For the sub test TestKVDelete/ClientAutoTLS, the timeout might be caused
by IO or task scheduled. It's really hard to reproduce this case. And
we don't add `VERBOSE=1` for e2e test, the pipeline doesn't show the
`EXPECT_DEBUG` log in time. It's also hard to debug when flaky case
show. So, this patch will enable `VERBOSE=1` to get timestamp in time.
And the one operation `del a c -w json` will be performed after two seconds
delay, we can increase the timeout from 15s to 30s just in case that the
put or del operations take more time in unstable env.
```
Warning: Keys between "a" and "c" will be deleted. Please interrupt the
command within next 2 seconds to cancel.
```
Signed-off-by: Wei Fu <fuweid89@gmail.com>
If start grpc proxy with --resolver-prefix, memberlist will return all alive proxy nodes, when one grpc proxy node is down, it is expected to not return the down node, but it is still return
Signed-off-by: yellowzf <zzhf3311@163.com>
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
Co-authored-by: Benjamin Wang <wachao@vmware.com>
Co-authored-by: chao <54131596+chaochn47@users.noreply.github.com>
* Use mod revision for optimistic concurrency.
* Introduce range requests as more general then get
* Add kubernetes specific traffic generation, for now using pull, but
expected to evolve to use watch.
* Introduce kubernetes specific test scenario
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
Increase request to 1000 to increase sample size/reduce variability and increase tolerance threshold from 10 to 15%.
Signed-off-by: James Blair <mail@jamesblair.net>
In a [scheduled test][1], the error shows
```
2023-04-19T11:16:15.8166316Z traffic.go:96: rpc error: code = Unavailable desc = keepalive ping failed to receive ACK within timeout
```
According to [grpc-keepalive@v1.51.0][2], each frame from server will
fresh the `lastRead` and it won't file `Ping` frame to server. But the
client used by [`tombstone` request][3] might hit the race. Since we use
5ms as timeout, the client might not receive the result of `Ping` from
server in time. The keepalive will mark it timeout and close the
connection.
I didn't reproduce it in my local. If we add the sleep before update
`lastRead`, it can reproduce it sometimes. Still investigating this
part.
```diff
diff --git a/internal/transport/http2_client.go b/internal/transport/http2_client.go
index d518b07e..bee9c00a 100644
--- a/internal/transport/http2_client.go
+++ b/internal/transport/http2_client.go
@@ -1560,6 +1560,7 @@ func (t *http2Client) reader(errCh chan<- error) {
t.controlBuf.throttle()
frame, err := t.framer.fr.ReadFrame()
if t.keepaliveEnabled {
+ time.Sleep(2 * time.Millisecond)
atomic.StoreInt64(&t.lastRead, time.Now().UnixNano())
}
if err != nil {
```
`DialKeepAliveTime` is always >= [10s][4]. I think we should increase
the timeout to avoid flaky caused by unstable env.
And in a [scheduled test][5], the error shows
```
logger.go:130: 2023-04-22T10:45:52.646Z INFO Failed to trigger failpoint {"failpoint": "blackhole", "error": "context deadline exceeded"}
```
Before sending `Status` to member, the client doesn't [pick][6] the
connection in time (100ms) and returns the error.
The `waitTillSnapshot` is used to ensure that it is good enough to
trigger snapshot transfer. And we have 1min timeout for
injectFailpoints, so I think we can remove the 100ms timeout to reduce
unnecessary stop.
```
injectFailpoints(1min timeout)
failpoint.Inject
triggerBlockhole.Trigger
blackhole
waitTillSnapshot
```
> NOTE: I didn't reproduce it either. :(
Reference:
[1]: <https://github.com/etcd-io/etcd/actions/runs/4741737098/jobs/8419176899>
[2]: <eeb9afa1f6/internal/transport/http2_client.go (L1647)>
[3]: <7450cd886d/tests/robustness/traffic.go (L94)>
[4]: <eeb9afa1f6/dialoptions.go (L445)>
[5]: <https://github.com/etcd-io/etcd/actions/runs/4772033408/jobs/8484334015>
[6]: <eeb9afa1f6/clientconn.go (L932)>
REF: #15763
Signed-off-by: Wei Fu <fuweid89@gmail.com>