If the cluster serves requests slowly, the database has few revision
number and then Compact won't trigger BatchCommit. Add a loop to check
the last revision is big enough to trigger panic.
Signed-off-by: Wei Fu <fuweid89@gmail.com>
In a [scheduled test][1], the error shows
```
2023-04-19T11:16:15.8166316Z traffic.go:96: rpc error: code = Unavailable desc = keepalive ping failed to receive ACK within timeout
```
According to [grpc-keepalive@v1.51.0][2], each frame from server will
fresh the `lastRead` and it won't file `Ping` frame to server. But the
client used by [`tombstone` request][3] might hit the race. Since we use
5ms as timeout, the client might not receive the result of `Ping` from
server in time. The keepalive will mark it timeout and close the
connection.
I didn't reproduce it in my local. If we add the sleep before update
`lastRead`, it can reproduce it sometimes. Still investigating this
part.
```diff
diff --git a/internal/transport/http2_client.go b/internal/transport/http2_client.go
index d518b07e..bee9c00a 100644
--- a/internal/transport/http2_client.go
+++ b/internal/transport/http2_client.go
@@ -1560,6 +1560,7 @@ func (t *http2Client) reader(errCh chan<- error) {
t.controlBuf.throttle()
frame, err := t.framer.fr.ReadFrame()
if t.keepaliveEnabled {
+ time.Sleep(2 * time.Millisecond)
atomic.StoreInt64(&t.lastRead, time.Now().UnixNano())
}
if err != nil {
```
`DialKeepAliveTime` is always >= [10s][4]. I think we should increase
the timeout to avoid flaky caused by unstable env.
And in a [scheduled test][5], the error shows
```
logger.go:130: 2023-04-22T10:45:52.646Z INFO Failed to trigger failpoint {"failpoint": "blackhole", "error": "context deadline exceeded"}
```
Before sending `Status` to member, the client doesn't [pick][6] the
connection in time (100ms) and returns the error.
The `waitTillSnapshot` is used to ensure that it is good enough to
trigger snapshot transfer. And we have 1min timeout for
injectFailpoints, so I think we can remove the 100ms timeout to reduce
unnecessary stop.
```
injectFailpoints(1min timeout)
failpoint.Inject
triggerBlockhole.Trigger
blackhole
waitTillSnapshot
```
> NOTE: I didn't reproduce it either. :(
Reference:
[1]: <https://github.com/etcd-io/etcd/actions/runs/4741737098/jobs/8419176899>
[2]: <eeb9afa1f6/internal/transport/http2_client.go (L1647)>
[3]: <7450cd886d/tests/robustness/traffic.go (L94)>
[4]: <eeb9afa1f6/dialoptions.go (L445)>
[5]: <https://github.com/etcd-io/etcd/actions/runs/4772033408/jobs/8484334015>
[6]: <eeb9afa1f6/clientconn.go (L932)>
REF: #15763
Signed-off-by: Wei Fu <fuweid89@gmail.com>