11 Commits

Author SHA1 Message Date
Marek Siarkowicz
dd248518d1 tests/robustness: Move request progress field from traffic to watch config and pass testScenario to reduce number of arguments
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
2023-05-10 11:43:02 +02:00
Marek Siarkowicz
92366a5338 tests/robustness: Split model code into deterministic and non-deterministic
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
Co-authored-by: Benjamin Wang <wachao@vmware.com>
Co-authored-by: chao <54131596+chaochn47@users.noreply.github.com>
2023-05-05 12:25:10 +02:00
Wei Fu
09d053e035 tests/robustness: tune timeout policy
In a [scheduled test][1], the error shows

```
2023-04-19T11:16:15.8166316Z     traffic.go:96: rpc error: code = Unavailable desc = keepalive ping failed to receive ACK within timeout
```

According to [grpc-keepalive@v1.51.0][2], each frame from server will
fresh the `lastRead` and it won't file `Ping` frame to server. But the
client used by [`tombstone` request][3] might hit the race. Since we use
5ms as timeout, the client might not receive the result of `Ping` from
server in time. The keepalive will mark it timeout and close the
connection.

I didn't reproduce it in my local. If we add the sleep before update
`lastRead`, it can reproduce it sometimes. Still investigating this
part.

```diff
diff --git a/internal/transport/http2_client.go b/internal/transport/http2_client.go
index d518b07e..bee9c00a 100644
--- a/internal/transport/http2_client.go
+++ b/internal/transport/http2_client.go
@@ -1560,6 +1560,7 @@ func (t *http2Client) reader(errCh chan<- error) {
                t.controlBuf.throttle()
                frame, err := t.framer.fr.ReadFrame()
                if t.keepaliveEnabled {
+                       time.Sleep(2 * time.Millisecond)
                        atomic.StoreInt64(&t.lastRead, time.Now().UnixNano())
                }
                if err != nil {
```

`DialKeepAliveTime` is always >= [10s][4]. I think we should increase
the timeout to avoid flaky caused by unstable env.

And in a [scheduled test][5], the error shows

```
logger.go:130: 2023-04-22T10:45:52.646Z	INFO	Failed to trigger failpoint	{"failpoint": "blackhole", "error": "context deadline exceeded"}
```

Before sending `Status` to member, the client doesn't [pick][6] the
connection in time (100ms) and returns the error.

The `waitTillSnapshot` is used to ensure that it is good enough to
trigger snapshot transfer. And we have 1min timeout for
injectFailpoints, so I think we can remove the 100ms timeout to reduce
unnecessary stop.

```
injectFailpoints(1min timeout)
  failpoint.Inject
    triggerBlockhole.Trigger
      blackhole
        waitTillSnapshot
```

> NOTE: I didn't reproduce it either. :(

Reference:

[1]: <https://github.com/etcd-io/etcd/actions/runs/4741737098/jobs/8419176899>
[2]: <eeb9afa1f6/internal/transport/http2_client.go (L1647)>
[3]: <7450cd886d/tests/robustness/traffic.go (L94)>
[4]: <eeb9afa1f6/dialoptions.go (L445)>
[5]: <https://github.com/etcd-io/etcd/actions/runs/4772033408/jobs/8484334015>
[6]: <eeb9afa1f6/clientconn.go (L932)>

REF: #15763

Signed-off-by: Wei Fu <fuweid89@gmail.com>
2023-04-29 07:03:47 +08:00
Marek Siarkowicz
1e41d95ab2 tests/robustness: Document analysing watch issue
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
2023-04-05 22:40:47 +02:00
Peter Wortmann
42a2643df9 tests/robustness: Reproduce issue #15220
This issue is somewhat easily reproduced simply by bombarding the
server with requests for progress notifications, which eventually
leads to one being delivered ahead of the payload message. This is
then caught by the watch response validation code previously added by
Marek Siarkowicz.

Signed-off-by: Peter Wortmann <peter.wortmann@skao.int>
2023-04-05 11:23:02 +01:00
Marek Siarkowicz
6582e349db tests: Enfoce timeout on failpoints
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
2023-04-04 12:25:07 +02:00
Marek Siarkowicz
0cbd56e8b6 tests: Cleanup endpoints
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
2023-04-03 12:18:54 +02:00
Marek Siarkowicz
4340cbb4aa
Merge pull request #15575 from serathius/ensure-watch
tests: Ensure watch catches all events generated in traffic
2023-03-30 10:28:22 +02:00
Marek Siarkowicz
ad688b2a85 tests: Ensure watch catches all events generated in traffic
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
2023-03-29 11:41:10 +02:00
Marek Siarkowicz
c54521156e tests: Refactor watch validation
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
2023-03-28 17:32:34 +02:00
Marek Siarkowicz
d475cf81a0 tests: Rename linearizability tests to robustness
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
2023-02-26 14:36:18 +01:00