21 Commits

Author SHA1 Message Date
Wei Fu
07effc4d0a *: fix revive linter
Remove old revive_pass in the bash scripts and migirate the revive.toml
into golangci linter_settings.

Signed-off-by: Wei Fu <fuweid89@gmail.com>
2023-09-24 14:21:11 +08:00
Marek Siarkowicz
11da84a1d1 tests/robustness: Implement loading client reports
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
2023-06-28 15:35:17 +02:00
Marek Siarkowicz
26cd2bc017 tests/robustness: Store whole watch operations
Want to keep watch requests to properly validate reliability of watch
stream.

Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
2023-06-24 18:15:50 +02:00
Marek Siarkowicz
7bbc738ec4 tests/robustness: Extract validation to separate package
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
2023-06-14 09:14:27 +02:00
Marek Siarkowicz
16bf0f6641 tests/robustness: Use traffic.RecordingClient in watch
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
2023-05-25 22:17:23 +02:00
Marek Siarkowicz
4872b679a5 tests/robustness: Expect revions to be unique for Kubernetes Traffic
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
2023-05-23 15:51:10 +02:00
Marek Siarkowicz
6429f47631 tests/robustness: Validate all etcd watches opened to etcd
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
2023-05-16 10:28:01 +02:00
Marek Siarkowicz
911c40a347 tests/robustness: Implement kubernetes list watch protocol
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
2023-05-15 10:11:05 +02:00
Bogdan Kanivets
c338882d7a tests/robustness: use monotonic clock for watch events
see: https://github.com/etcd-io/etcd/pull/15323
For consistency watch events should also use only time-measurement operations.

fixes: https://github.com/etcd-io/etcd/issues/15328
Signed-off-by: Bogdan Kanivets <bkanivets@apple.com>
2023-05-14 12:58:13 -07:00
Marek Siarkowicz
831ce4c3cf tests/robustness: Improve naming of Txn fields
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
2023-05-12 13:10:25 +02:00
Marek Siarkowicz
dd248518d1 tests/robustness: Move request progress field from traffic to watch config and pass testScenario to reduce number of arguments
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
2023-05-10 11:43:02 +02:00
Marek Siarkowicz
92366a5338 tests/robustness: Split model code into deterministic and non-deterministic
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
Co-authored-by: Benjamin Wang <wachao@vmware.com>
Co-authored-by: chao <54131596+chaochn47@users.noreply.github.com>
2023-05-05 12:25:10 +02:00
Wei Fu
09d053e035 tests/robustness: tune timeout policy
In a [scheduled test][1], the error shows

```
2023-04-19T11:16:15.8166316Z     traffic.go:96: rpc error: code = Unavailable desc = keepalive ping failed to receive ACK within timeout
```

According to [grpc-keepalive@v1.51.0][2], each frame from server will
fresh the `lastRead` and it won't file `Ping` frame to server. But the
client used by [`tombstone` request][3] might hit the race. Since we use
5ms as timeout, the client might not receive the result of `Ping` from
server in time. The keepalive will mark it timeout and close the
connection.

I didn't reproduce it in my local. If we add the sleep before update
`lastRead`, it can reproduce it sometimes. Still investigating this
part.

```diff
diff --git a/internal/transport/http2_client.go b/internal/transport/http2_client.go
index d518b07e..bee9c00a 100644
--- a/internal/transport/http2_client.go
+++ b/internal/transport/http2_client.go
@@ -1560,6 +1560,7 @@ func (t *http2Client) reader(errCh chan<- error) {
                t.controlBuf.throttle()
                frame, err := t.framer.fr.ReadFrame()
                if t.keepaliveEnabled {
+                       time.Sleep(2 * time.Millisecond)
                        atomic.StoreInt64(&t.lastRead, time.Now().UnixNano())
                }
                if err != nil {
```

`DialKeepAliveTime` is always >= [10s][4]. I think we should increase
the timeout to avoid flaky caused by unstable env.

And in a [scheduled test][5], the error shows

```
logger.go:130: 2023-04-22T10:45:52.646Z	INFO	Failed to trigger failpoint	{"failpoint": "blackhole", "error": "context deadline exceeded"}
```

Before sending `Status` to member, the client doesn't [pick][6] the
connection in time (100ms) and returns the error.

The `waitTillSnapshot` is used to ensure that it is good enough to
trigger snapshot transfer. And we have 1min timeout for
injectFailpoints, so I think we can remove the 100ms timeout to reduce
unnecessary stop.

```
injectFailpoints(1min timeout)
  failpoint.Inject
    triggerBlockhole.Trigger
      blackhole
        waitTillSnapshot
```

> NOTE: I didn't reproduce it either. :(

Reference:

[1]: <https://github.com/etcd-io/etcd/actions/runs/4741737098/jobs/8419176899>
[2]: <eeb9afa1f6/internal/transport/http2_client.go (L1647)>
[3]: <7450cd886d/tests/robustness/traffic.go (L94)>
[4]: <eeb9afa1f6/dialoptions.go (L445)>
[5]: <https://github.com/etcd-io/etcd/actions/runs/4772033408/jobs/8484334015>
[6]: <eeb9afa1f6/clientconn.go (L932)>

REF: #15763

Signed-off-by: Wei Fu <fuweid89@gmail.com>
2023-04-29 07:03:47 +08:00
Marek Siarkowicz
1e41d95ab2 tests/robustness: Document analysing watch issue
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
2023-04-05 22:40:47 +02:00
Peter Wortmann
42a2643df9 tests/robustness: Reproduce issue #15220
This issue is somewhat easily reproduced simply by bombarding the
server with requests for progress notifications, which eventually
leads to one being delivered ahead of the payload message. This is
then caught by the watch response validation code previously added by
Marek Siarkowicz.

Signed-off-by: Peter Wortmann <peter.wortmann@skao.int>
2023-04-05 11:23:02 +01:00
Marek Siarkowicz
6582e349db tests: Enfoce timeout on failpoints
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
2023-04-04 12:25:07 +02:00
Marek Siarkowicz
0cbd56e8b6 tests: Cleanup endpoints
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
2023-04-03 12:18:54 +02:00
Marek Siarkowicz
4340cbb4aa
Merge pull request #15575 from serathius/ensure-watch
tests: Ensure watch catches all events generated in traffic
2023-03-30 10:28:22 +02:00
Marek Siarkowicz
ad688b2a85 tests: Ensure watch catches all events generated in traffic
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
2023-03-29 11:41:10 +02:00
Marek Siarkowicz
c54521156e tests: Refactor watch validation
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
2023-03-28 17:32:34 +02:00
Marek Siarkowicz
d475cf81a0 tests: Rename linearizability tests to robustness
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
2023-02-26 14:36:18 +01:00