29 Commits

Author SHA1 Message Date
Marek Siarkowicz
f5e82260da Fix parsing failpoint names when failpoint has value set
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
2023-10-07 18:20:18 +02:00
Marek Siarkowicz
05a77032fc Inject sleep during etcd bootstrap to reproduce https://github.com/etcd-io/etcd/issues/16666
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
2023-10-07 12:31:56 +02:00
Marek Siarkowicz
11b441b605 Reuse embed.Config in e2e cluster config
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
2023-10-01 09:56:31 +02:00
Wei Fu
4704a5af3a *: fix unused issue
Signed-off-by: Wei Fu <fuweid89@gmail.com>
2023-09-25 19:37:18 +08:00
Wei Fu
07effc4d0a *: fix revive linter
Remove old revive_pass in the bash scripts and migirate the revive.toml
into golangci linter_settings.

Signed-off-by: Wei Fu <fuweid89@gmail.com>
2023-09-24 14:21:11 +08:00
Wei Fu
5e3910d96c *: fix govet-shadow lint
Signed-off-by: Wei Fu <fuweid89@gmail.com>
2023-09-19 20:24:01 +08:00
Marek Siarkowicz
eb32d9cccc tests: Add support for lazyfs
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
2023-07-27 13:42:38 +02:00
Wei Fu
516e096a97 tests/robustness: enhance compact failpoint
If the cluster serves requests slowly, the database has few revision
number and then Compact won't trigger BatchCommit. Add a loop to check
the last revision is big enough to trigger panic.

Signed-off-by: Wei Fu <fuweid89@gmail.com>
2023-07-26 21:32:16 +08:00
Marek Siarkowicz
5e7349b44c
Merge pull request #16094 from serathius/robustness-retry-failpoint
Robustness retry failpoint
2023-06-19 09:10:46 +02:00
Marek Siarkowicz
43b2477c28 tests/robustness: Retry injecting failpoint
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
2023-06-17 17:30:20 +02:00
Marek Siarkowicz
fb16bca44a tests/robustness: Disable blackhole until snapshot for v3.5 and v3.4
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
2023-06-16 13:26:24 +02:00
Marek Siarkowicz
dd248518d1 tests/robustness: Move request progress field from traffic to watch config and pass testScenario to reduce number of arguments
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
2023-05-10 11:43:02 +02:00
Marek Siarkowicz
ad20230e07 test/robustness: Create dedicated traffic package
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
2023-05-09 10:50:13 +02:00
Wei Fu
09d053e035 tests/robustness: tune timeout policy
In a [scheduled test][1], the error shows

```
2023-04-19T11:16:15.8166316Z     traffic.go:96: rpc error: code = Unavailable desc = keepalive ping failed to receive ACK within timeout
```

According to [grpc-keepalive@v1.51.0][2], each frame from server will
fresh the `lastRead` and it won't file `Ping` frame to server. But the
client used by [`tombstone` request][3] might hit the race. Since we use
5ms as timeout, the client might not receive the result of `Ping` from
server in time. The keepalive will mark it timeout and close the
connection.

I didn't reproduce it in my local. If we add the sleep before update
`lastRead`, it can reproduce it sometimes. Still investigating this
part.

```diff
diff --git a/internal/transport/http2_client.go b/internal/transport/http2_client.go
index d518b07e..bee9c00a 100644
--- a/internal/transport/http2_client.go
+++ b/internal/transport/http2_client.go
@@ -1560,6 +1560,7 @@ func (t *http2Client) reader(errCh chan<- error) {
                t.controlBuf.throttle()
                frame, err := t.framer.fr.ReadFrame()
                if t.keepaliveEnabled {
+                       time.Sleep(2 * time.Millisecond)
                        atomic.StoreInt64(&t.lastRead, time.Now().UnixNano())
                }
                if err != nil {
```

`DialKeepAliveTime` is always >= [10s][4]. I think we should increase
the timeout to avoid flaky caused by unstable env.

And in a [scheduled test][5], the error shows

```
logger.go:130: 2023-04-22T10:45:52.646Z	INFO	Failed to trigger failpoint	{"failpoint": "blackhole", "error": "context deadline exceeded"}
```

Before sending `Status` to member, the client doesn't [pick][6] the
connection in time (100ms) and returns the error.

The `waitTillSnapshot` is used to ensure that it is good enough to
trigger snapshot transfer. And we have 1min timeout for
injectFailpoints, so I think we can remove the 100ms timeout to reduce
unnecessary stop.

```
injectFailpoints(1min timeout)
  failpoint.Inject
    triggerBlockhole.Trigger
      blackhole
        waitTillSnapshot
```

> NOTE: I didn't reproduce it either. :(

Reference:

[1]: <https://github.com/etcd-io/etcd/actions/runs/4741737098/jobs/8419176899>
[2]: <eeb9afa1f6/internal/transport/http2_client.go (L1647)>
[3]: <7450cd886d/tests/robustness/traffic.go (L94)>
[4]: <eeb9afa1f6/dialoptions.go (L445)>
[5]: <https://github.com/etcd-io/etcd/actions/runs/4772033408/jobs/8484334015>
[6]: <eeb9afa1f6/clientconn.go (L932)>

REF: #15763

Signed-off-by: Wei Fu <fuweid89@gmail.com>
2023-04-29 07:03:47 +08:00
Marek Siarkowicz
e48ef9ea6a tests/robustness: Disable blackholing traffic till snapshot for v3.4.X
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
2023-04-12 14:39:57 +02:00
Marek Siarkowicz
d19752f16a tests/robustness: Unify failpoint lists by depending on availability checking
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
2023-04-12 14:38:30 +02:00
Marek Siarkowicz
625d427eb5 tests/robustness: Separate triggering failpoint from injection
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
2023-04-12 14:36:50 +02:00
Marek Siarkowicz
932415a8d5 tests/robustness: Verify cluster configuration in failpoint availability
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
2023-04-12 14:36:50 +02:00
Marek Siarkowicz
5bae6b1e44 tests/robustness: Detect trigger timeout and exit
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
2023-04-04 15:23:58 +02:00
James Blair
1227754284 Cancel watch if cluster not healthy before or after injecting failpoints.
Signed-off-by: James Blair <mail@jamesblair.net>
2023-04-04 13:58:17 +02:00
Marek Siarkowicz
6582e349db tests: Enfoce timeout on failpoints
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
2023-04-04 12:25:07 +02:00
Marek Siarkowicz
6f4e5f316e
Merge pull request #15592 from serathius/cleanup-endpoints
tests: Cleanup endpoints
2023-04-03 16:00:44 +02:00
Benjamin Wang
e57dcd5ceb test: fix typo in robustness test
Signed-off-by: Benjamin Wang <wachao@vmware.com>
2023-04-03 18:46:32 +08:00
Marek Siarkowicz
0cbd56e8b6 tests: Cleanup endpoints
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
2023-04-03 12:18:54 +02:00
Marek Siarkowicz
029315f57e tests/robustness: Support running snapshot tests on older versions
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
2023-04-03 10:43:06 +02:00
Marek Siarkowicz
03214c0239 Revert "tests/robustness: Disable testing network blackhole until #15595 is fixed"
This reverts commit 013e25fab9f76f0c1a00459555fe42b33f379eb9.

Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
2023-04-01 16:32:20 +02:00
Marek Siarkowicz
013e25fab9 tests/robustness: Disable testing network blackhole until #15595 is fixed
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
2023-03-31 13:55:58 +02:00
Marek Siarkowicz
3e5fc2e4fc tests: Enable BlackholeUntilSnapshot robustness scenario
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
2023-03-20 11:15:42 +01:00
Marek Siarkowicz
d475cf81a0 tests: Rename linearizability tests to robustness
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
2023-02-26 14:36:18 +01:00