If start grpc proxy with --resolver-prefix, memberlist will return all alive proxy nodes, when one grpc proxy node is down, it is expected to not return the down node, but it is still return
Signed-off-by: yellowzf <zzhf3311@163.com>
Increase request to 1000 to increase sample size/reduce variability and increase tolerance threshold from 10 to 15%.
Signed-off-by: James Blair <mail@jamesblair.net>
It's followup of #15667.
This patch is to use zaptest/observer as base to provide a similar
function to pkg/expect.Expect.
The test env
```bash
11th Gen Intel(R) Core(TM) i5-1135G7 @ 2.40GHz
mkdir /sys/fs/cgroup/etcd-followup-15667
echo 0-2 | tee /sys/fs/cgroup/etcd-followup-15667/cpuset.cpus # three cores
```
Before change:
* memory.peak: ~ 681 MiB
* Elapsed (wall clock) time (h:mm:ss or m:ss): 6:14.04
After change:
* memory.peak: ~ 671 MiB
* Elapsed (wall clock) time (h:mm:ss or m:ss): 6:13.07
Based on the test result, I think it's safe to be enabled by default.
Signed-off-by: Wei Fu <fuweid89@gmail.com>
The TestV3WatchRestoreSnapshotUnsync setups three members' cluster.
Before serving any update requests from client, after leader elected,
each member will have index 8 log: 3 x ConfChange +
3 x ClusterMemberAttrSet + 1 x ClusterVersionSet.
Based on the config (SnapshotCount: 10, CatchUpCount: 5), we need to
file update requests to trigger snapshot at least twice.
T1: L(snapshot-index: 11, compacted-index: 6) F_m0(index: 8)
T2: L(snapshot-index: 22, compacted-index: 17) F_m0(index: 8, out of date)
After member0 recovers from network partition, it will reject leader's
request and return hint (index:8, term:x). If it happens after
second snapshot, leader will find out the index:8 is out of date and
force to transfer snapshot.
However, the client only files 15 update requests and leader doesn't
finish the process of snapshot in time. Since the last of
compacted-index is 6, leader can still replicate index:9 to member0
instead of snapshot.
```bash
cd tests/integration
CLUSTER_DEBUG=true go test -v -count=1 -run TestV3WatchRestoreSnapshotUnsync ./
...
INFO m2.raft 3da8ba707f1a21a4 became leader at term 2 {"member": "m2"}
...
INFO m2 triggering snapshot {"member": "m2", "local-member-id": "3da8ba707f1a21a4", "local-member-applied-index": 22, "local-member-snapshot-index": 11, "local-member-snapshot-count": 10, "snapshot-forced": false}
...
cluster.go:1359: network partition between: 99626fe5001fde8b <-> 1c964119da6db036
cluster.go:1359: network partition between: 99626fe5001fde8b <-> 3da8ba707f1a21a4
cluster.go:416: WaitMembersForLeader
INFO m0.raft 99626fe5001fde8b became follower at term 2 {"member": "m0"}
INFO m0.raft raft.node: 99626fe5001fde8b elected leader 3da8ba707f1a21a4 at term 2 {"member": "m0"}
DEBUG m2.raft 3da8ba707f1a21a4 received MsgAppResp(rejected, hint: (index 8, term 2)) from 99626fe5001fde8b for index 23 {"member": "m2"}
DEBUG m2.raft 3da8ba707f1a21a4 decreased progress of 99626fe5001fde8b to [StateReplicate match=8 next=9 inflight=15] {"member": "m2"}
DEBUG m0 Applying entries {"member": "m0", "num-entries": 15}
DEBUG m0 Applying entry {"member": "m0", "index": 9, "term": 2, "type": "EntryNormal"}
....
INFO m2 saved snapshot {"member": "m2", "snapshot-index": 22}
INFO m2 compacted Raft logs {"member": "m2", "compact-index": 17}
```
To fix this issue, the patch uses log monitor to watch "compacted Raft
log" and expect that two members should compact log twice.
Fixes: #15545
Signed-off-by: Wei Fu <fuweid89@gmail.com>
This will fail basically every time, as the progress notification
request catches the watcher in an asynchronised state.
Signed-off-by: Peter Wortmann <peter.wortmann@skao.int>
The cluster version will be initialized after the member becomes leader.
The update is handled asynchronously. It couldn't be updated if the member
has been closed and the go-runtime picks the `s.stopping` channel first.
```go
// e2a5df534c/server/etcdserver/server.go (L2170)
func (s *EtcdServer) monitorClusterVersions() {
...
for {
select {
case <-s.firstCommitInTerm.Receive():
case <-time.After(monitorVersionInterval):
case <-s.stopping:
return
}
...
}
}
```
Or after the `s.stopping` has been closed, the [UpdateClusterVersion][1] won't
file GoAttach successfully. For the #15409, we can see the warn log
`server has stopped; skipping GoAttach` from GoAttach:
```plain
https://github.com/etcd-io/etcd/actions/runs/4340931587/jobs/7580103902
logger.go:130: 2023-03-06T07:36:44.253Z WARN default stopping grpc server due to error {"error": "accept tcp 127.0.0.1:2379: use of closed network connection"}
logger.go:130: 2023-03-06T07:36:44.253Z WARN default stopped grpc server due to error {"error": "accept tcp 127.0.0.1:2379: use of closed network connection"}
logger.go:130: 2023-03-06T07:36:44.253Z ERROR default setting up serving from embedded etcd failed. {"error": "accept tcp 127.0.0.1:2379: use of closed network connection"}
logger.go:130: 2023-03-06T07:36:44.253Z ERROR default setting up serving from embedded etcd failed. {"error": "http: Server closed"}
logger.go:130: 2023-03-06T07:36:44.253Z INFO default skipped leadership transfer for single voting member cluster {"local-member-id": "8e9e05c52164694d", "current-leader-member-id": "8e9e05c52164694d"}
logger.go:130: 2023-03-06T07:36:44.253Z WARN default server has stopped; skipping GoAttach
...
```
If the cluster version isn't updated, the minimum storage version will
be v3.5 because the [AuthStatus][2] is introduced in [v3.5][3].
The compare will fail.
To fix this issue, we should wait for cluster version to become ready
after server is ready to serve request.
[1]: <e2a5df534c/server/etcdserver/adapters.go (L45)>
[2]: <071e70cdc4>
[3]: <1b4e54c238>
Signed-off-by: Wei Fu <fuweid89@gmail.com>
The TestLeasingDeleteRangeContendTxn is trying to test for RangeDelete when
the target resources are being updated. When the `txnLeasing` wants a
server-side transaction, it needs to ensure all the keys mod revision should
be leass than what it saw. If the compare fails, it will repeat to apply the
server-side transaction until it is sucessful. I believe the test-case is
trying to verify how the `txnLeasing` handles the race issue.
Before the patch #15401, the resource-updating goroutine keeps updating until
the RangeDelete finishes. The testcase is flaky because two goroutines are
sharing one `ctx` and grpc-go client won't wait for the response if `ctx`
has been canceled.
For example,
| DelLease Goroutine | PutLease Goroutine | ETCD Server | Key/0 Status |
| -- | --- | -- | -- |
| deleted | | | version = 0 |
| | send update(key/0=123) req | received update(key/0=123) req | version = 0 |
| cancel | | | version = 0 |
| | exit because of cancel | | version = 0 |
| get key/0 by putkv | | | version = 0 |
| | | applied update(key/0=123) | version = 1 |
| get key/0 by raw-cli | | | version = 1 |
So `raw-cli` gets `[key/0=123]` while the `putkv` gets `[]`. If `putkv`
applies two update reqs to ETCD server and the last one is canceled
before apply, the error will be like:
```
expected [key:"key/0" version:2 value:"123" ], got [key:"key/0" version:1 value:"123" ]
```
The resource-updating goroutine should not share the ctx with RangeDelete here.
And I also revert current main branch because the resource-update goroutine
only updates 8 times and might exit before `RangeDelete`. In this case,
the `txnLeasing` is not handling the race issue.
Fixes: #15352
Signed-off-by: Wei Fu <fuweid89@gmail.com>
Fixes etcd-io#15352.
Depending on the goroutine scheduling, the expected count of 8 might not
have been reached yet. This ensures the routine won't stop earlier than
that.
Signed-off-by: Thomas Jungblut <tjungblu@redhat.com>
Added optional TLS min/max protocol version and command line switches to set
versions for the etcd server.
If max version is not explicitly set by the user, let Go select the max
version which is currently TLSv1.3. Previously max version was set to TLSv1.2.
Signed-off-by: Tero Saarni <tero.saarni@est.tech>
The change did in https://github.com/etcd-io/etcd/pull/14824 fixed
the test instead of the product code. It isn't correct. After we
fixed the product code in this PR, we can revert the change in
that PR.
Signed-off-by: Benjamin Wang <wachao@vmware.com>
Comments fixed as per goword in go _test package files that
shell function go_srcs_in_module lists as per changes on #14827
Helps in #14827
Signed-off-by: Bhargav Ravuri <bhargav.ravuri@infracloud.io>
Comments fixed as per goword in go test files that shell
function go_srcs_in_module lists as per changes on #14827
Helps in #14827
Signed-off-by: Bhargav Ravuri <bhargav.ravuri@infracloud.io>
If the corrupted member has been elected as leader, the memberID in alert
response won't be the corrupted one. It will be a smaller follower ID since
the raftCluster.Members always sorts by ID. We should check the leader
ID and decide to use which memberID.
Fixes: #14823
Signed-off-by: Wei Fu <fuweid89@gmail.com>
Check the values of myKey and myRev first in Unlock method to prevent calling Unlock without Lock. Because this may cause the value of pfx to be deleted by mistake.
Signed-off-by: chenyahui <cyhone@qq.com>
Check the client count before creating the ephemeral key, do not
create the key if there are already too many clients. Check the
count after creating the key again, if the total kvs is bigger
than the expected count, then check the rev of the current key,
and take action accordingly based on its rev. If its rev is in
the first ${count}, then it's valid client, otherwise, it should
fail.
Signed-off-by: Benjamin Wang <wachao@vmware.com>