Having too many delete requests is bad as they are not unique requests, so
linearization is more prone to timeout on them.
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
For now we just validate stale read revision, but not response content.
Reason is that etcd model only stores latest version of keys, and no
history like real etcd.
Validating stale read contents needs to be done outside of model
as storing whole history is just to costly for linearization validation.
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
It is just to complicated to support starting from non-empty etcd.
Existing implementation was very naive to assume that we can build
full state from just one request. We might consider implementing
validation of non-empty history in future, but for now settting
this limit should clean up the code and speed up development.
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
Co-authored-by: Benjamin Wang <wachao@vmware.com>
Co-authored-by: chao <54131596+chaochn47@users.noreply.github.com>
* Use mod revision for optimistic concurrency.
* Introduce range requests as more general then get
* Add kubernetes specific traffic generation, for now using pull, but
expected to evolve to use watch.
* Introduce kubernetes specific test scenario
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
In a [scheduled test][1], the error shows
```
2023-04-19T11:16:15.8166316Z traffic.go:96: rpc error: code = Unavailable desc = keepalive ping failed to receive ACK within timeout
```
According to [grpc-keepalive@v1.51.0][2], each frame from server will
fresh the `lastRead` and it won't file `Ping` frame to server. But the
client used by [`tombstone` request][3] might hit the race. Since we use
5ms as timeout, the client might not receive the result of `Ping` from
server in time. The keepalive will mark it timeout and close the
connection.
I didn't reproduce it in my local. If we add the sleep before update
`lastRead`, it can reproduce it sometimes. Still investigating this
part.
```diff
diff --git a/internal/transport/http2_client.go b/internal/transport/http2_client.go
index d518b07e..bee9c00a 100644
--- a/internal/transport/http2_client.go
+++ b/internal/transport/http2_client.go
@@ -1560,6 +1560,7 @@ func (t *http2Client) reader(errCh chan<- error) {
t.controlBuf.throttle()
frame, err := t.framer.fr.ReadFrame()
if t.keepaliveEnabled {
+ time.Sleep(2 * time.Millisecond)
atomic.StoreInt64(&t.lastRead, time.Now().UnixNano())
}
if err != nil {
```
`DialKeepAliveTime` is always >= [10s][4]. I think we should increase
the timeout to avoid flaky caused by unstable env.
And in a [scheduled test][5], the error shows
```
logger.go:130: 2023-04-22T10:45:52.646Z INFO Failed to trigger failpoint {"failpoint": "blackhole", "error": "context deadline exceeded"}
```
Before sending `Status` to member, the client doesn't [pick][6] the
connection in time (100ms) and returns the error.
The `waitTillSnapshot` is used to ensure that it is good enough to
trigger snapshot transfer. And we have 1min timeout for
injectFailpoints, so I think we can remove the 100ms timeout to reduce
unnecessary stop.
```
injectFailpoints(1min timeout)
failpoint.Inject
triggerBlockhole.Trigger
blackhole
waitTillSnapshot
```
> NOTE: I didn't reproduce it either. :(
Reference:
[1]: <https://github.com/etcd-io/etcd/actions/runs/4741737098/jobs/8419176899>
[2]: <eeb9afa1f6/internal/transport/http2_client.go (L1647)>
[3]: <7450cd886d/tests/robustness/traffic.go (L94)>
[4]: <eeb9afa1f6/dialoptions.go (L445)>
[5]: <https://github.com/etcd-io/etcd/actions/runs/4772033408/jobs/8484334015>
[6]: <eeb9afa1f6/clientconn.go (L932)>
REF: #15763
Signed-off-by: Wei Fu <fuweid89@gmail.com>