189 Commits

Author SHA1 Message Date
Marek Siarkowicz
7c68be4cf3 tests/robustness: Implement Range limit and count
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
2023-05-07 09:32:07 +02:00
Marek Siarkowicz
40f71ef3c6 tests/robustness: Implement delete request for kubernetes scenario
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
2023-05-05 13:40:46 +02:00
Marek Siarkowicz
92366a5338 tests/robustness: Split model code into deterministic and non-deterministic
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
Co-authored-by: Benjamin Wang <wachao@vmware.com>
Co-authored-by: chao <54131596+chaochn47@users.noreply.github.com>
2023-05-05 12:25:10 +02:00
Marek Siarkowicz
cfe154209c tests/robustness: Separate describe model functions to dedicated file
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
2023-05-04 14:03:18 +02:00
Marek Siarkowicz
9b5680c5f1 tests/robustness: Implement first step in validating the Kubernetes-etcd contract.
* Use mod revision for optimistic concurrency.
* Introduce range requests as more general then get
* Add kubernetes specific traffic generation, for now using pull, but
  expected to evolve to use watch.
* Introduce kubernetes specific test scenario

Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
2023-05-04 13:26:54 +02:00
Wei Fu
09d053e035 tests/robustness: tune timeout policy
In a [scheduled test][1], the error shows

```
2023-04-19T11:16:15.8166316Z     traffic.go:96: rpc error: code = Unavailable desc = keepalive ping failed to receive ACK within timeout
```

According to [grpc-keepalive@v1.51.0][2], each frame from server will
fresh the `lastRead` and it won't file `Ping` frame to server. But the
client used by [`tombstone` request][3] might hit the race. Since we use
5ms as timeout, the client might not receive the result of `Ping` from
server in time. The keepalive will mark it timeout and close the
connection.

I didn't reproduce it in my local. If we add the sleep before update
`lastRead`, it can reproduce it sometimes. Still investigating this
part.

```diff
diff --git a/internal/transport/http2_client.go b/internal/transport/http2_client.go
index d518b07e..bee9c00a 100644
--- a/internal/transport/http2_client.go
+++ b/internal/transport/http2_client.go
@@ -1560,6 +1560,7 @@ func (t *http2Client) reader(errCh chan<- error) {
                t.controlBuf.throttle()
                frame, err := t.framer.fr.ReadFrame()
                if t.keepaliveEnabled {
+                       time.Sleep(2 * time.Millisecond)
                        atomic.StoreInt64(&t.lastRead, time.Now().UnixNano())
                }
                if err != nil {
```

`DialKeepAliveTime` is always >= [10s][4]. I think we should increase
the timeout to avoid flaky caused by unstable env.

And in a [scheduled test][5], the error shows

```
logger.go:130: 2023-04-22T10:45:52.646Z	INFO	Failed to trigger failpoint	{"failpoint": "blackhole", "error": "context deadline exceeded"}
```

Before sending `Status` to member, the client doesn't [pick][6] the
connection in time (100ms) and returns the error.

The `waitTillSnapshot` is used to ensure that it is good enough to
trigger snapshot transfer. And we have 1min timeout for
injectFailpoints, so I think we can remove the 100ms timeout to reduce
unnecessary stop.

```
injectFailpoints(1min timeout)
  failpoint.Inject
    triggerBlockhole.Trigger
      blackhole
        waitTillSnapshot
```

> NOTE: I didn't reproduce it either. :(

Reference:

[1]: <https://github.com/etcd-io/etcd/actions/runs/4741737098/jobs/8419176899>
[2]: <eeb9afa1f6/internal/transport/http2_client.go (L1647)>
[3]: <7450cd886d/tests/robustness/traffic.go (L94)>
[4]: <eeb9afa1f6/dialoptions.go (L445)>
[5]: <https://github.com/etcd-io/etcd/actions/runs/4772033408/jobs/8484334015>
[6]: <eeb9afa1f6/clientconn.go (L932)>

REF: #15763

Signed-off-by: Wei Fu <fuweid89@gmail.com>
2023-04-29 07:03:47 +08:00
Benjamin Wang
c7d81acaf0 test: forcibly save data on pinicking
Signed-off-by: Benjamin Wang <wachao@vmware.com>
2023-04-27 14:54:35 +08:00
Marek Siarkowicz
e48ef9ea6a tests/robustness: Disable blackholing traffic till snapshot for v3.4.X
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
2023-04-12 14:39:57 +02:00
Marek Siarkowicz
d19752f16a tests/robustness: Unify failpoint lists by depending on availability checking
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
2023-04-12 14:38:30 +02:00
Marek Siarkowicz
625d427eb5 tests/robustness: Separate triggering failpoint from injection
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
2023-04-12 14:36:50 +02:00
Marek Siarkowicz
932415a8d5 tests/robustness: Verify cluster configuration in failpoint availability
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
2023-04-12 14:36:50 +02:00
Chao Chen
941c4afb0c tests/framwork/e2e/cluster.go: revert back to sequential cluster stop to reduce e2e test run time
Signed-off-by: Chao Chen <chaochn@amazon.com>
2023-04-11 22:08:18 -07:00
Marek Siarkowicz
7153a8f2f4
Merge pull request #15646 from serathius/robustness-readme-watch-issue
tests/robustness: Document analysing watch issue
2023-04-07 23:45:42 +02:00
Marek Siarkowicz
540d012e5e tests/robustness: Ensure that etcdctl binary is provided
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
2023-04-05 23:04:20 +02:00
Marek Siarkowicz
1e41d95ab2 tests/robustness: Document analysing watch issue
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
2023-04-05 22:40:47 +02:00
Peter Wortmann
42a2643df9 tests/robustness: Reproduce issue #15220
This issue is somewhat easily reproduced simply by bombarding the
server with requests for progress notifications, which eventually
leads to one being delivered ahead of the payload message. This is
then caught by the watch response validation code previously added by
Marek Siarkowicz.

Signed-off-by: Peter Wortmann <peter.wortmann@skao.int>
2023-04-05 11:23:02 +01:00
Marek Siarkowicz
5bae6b1e44 tests/robustness: Detect trigger timeout and exit
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
2023-04-04 15:23:58 +02:00
James Blair
1227754284 Cancel watch if cluster not healthy before or after injecting failpoints.
Signed-off-by: James Blair <mail@jamesblair.net>
2023-04-04 13:58:17 +02:00
Marek Siarkowicz
6582e349db tests: Enfoce timeout on failpoints
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
2023-04-04 12:25:07 +02:00
Marek Siarkowicz
523f235c82
Merge pull request #15603 from serathius/robustness-finish-with-success
tests: Ensure that operation history finishes with successful request
2023-04-04 12:03:36 +02:00
Marek Siarkowicz
6a5d326519 tests: Ensure that operation history finishes with successful request
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
2023-04-04 09:40:17 +02:00
Marek Siarkowicz
138fae6246
Merge pull request #15632 from serathius/fix-comparing-etcd-version
tests: Fix comparing etcd version
2023-04-04 09:34:55 +02:00
Marek Siarkowicz
4fab20aa75
Merge pull request #15618 from serathius/robustness-fix-periodic-etcd-version
tests: Fix building incorrect etcd version and make switch strict
2023-04-04 09:30:20 +02:00
Marek Siarkowicz
69afcd1960 tests: Fix comparing etcd version
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
2023-04-03 21:13:36 +02:00
Marek Siarkowicz
6f4e5f316e
Merge pull request #15592 from serathius/cleanup-endpoints
tests: Cleanup endpoints
2023-04-03 16:00:44 +02:00
Marek Siarkowicz
9c72ecb1f9 tests: Fix building incorrect etcd version and make switch strict
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
2023-04-03 15:06:10 +02:00
Benjamin Wang
e57dcd5ceb test: fix typo in robustness test
Signed-off-by: Benjamin Wang <wachao@vmware.com>
2023-04-03 18:46:32 +08:00
Marek Siarkowicz
0cbd56e8b6 tests: Cleanup endpoints
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
2023-04-03 12:18:54 +02:00
Marek Siarkowicz
029315f57e tests/robustness: Support running snapshot tests on older versions
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
2023-04-03 10:43:06 +02:00
Marek Siarkowicz
03214c0239 Revert "tests/robustness: Disable testing network blackhole until #15595 is fixed"
This reverts commit 013e25fab9f76f0c1a00459555fe42b33f379eb9.

Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
2023-04-01 16:32:20 +02:00
Marek Siarkowicz
71ba0873e3 tests/robustness: Encrypt peer traffic to prevent proxy manipulating packets
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
2023-04-01 16:31:53 +02:00
Marek Siarkowicz
013e25fab9 tests/robustness: Disable testing network blackhole until #15595 is fixed
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
2023-03-31 13:55:58 +02:00
Marek Siarkowicz
4340cbb4aa
Merge pull request #15575 from serathius/ensure-watch
tests: Ensure watch catches all events generated in traffic
2023-03-30 10:28:22 +02:00
Marek Siarkowicz
ad688b2a85 tests: Ensure watch catches all events generated in traffic
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
2023-03-29 11:41:10 +02:00
Marek Siarkowicz
c54521156e tests: Refactor watch validation
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
2023-03-28 17:32:34 +02:00
Marek Siarkowicz
5223d09d41
Merge pull request #14838 from serathius/linearizability-docs
tests: Document robustness tests
2023-03-28 16:22:09 +02:00
Marek Siarkowicz
d03ac88b36 tests: Document robustness tests
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
2023-03-28 15:08:43 +02:00
Marek Siarkowicz
3e5fc2e4fc tests: Enable BlackholeUntilSnapshot robustness scenario
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
2023-03-20 11:15:42 +01:00
Marek Siarkowicz
d475cf81a0 tests: Rename linearizability tests to robustness
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
2023-02-26 14:36:18 +01:00