160 Commits

Author SHA1 Message Date
Fu Wei
d63ca43092
Merge 4db8df677c618b462145fce7cb926c072a0ce932 into c86c93ca2951338115159dcdd20711603044e1f1 2024-09-25 21:36:55 -07:00
redwrasse
d4df7a902e Replaces a number of error equality checks with errors.Is
Signed-off-by: redwrasse <mail@redwrasse.io>
2024-09-03 16:02:24 -07:00
Benjamin Wang
b8b0cf83d1 Skip leadership check if the etcd instance is active processing heartbeat
Signed-off-by: Benjamin Wang <benjamin.ahrtr@gmail.com>
2024-08-09 17:02:02 +01:00
Clement
d820cd2b56 etcdserver: change the snapshot + compact into sync operation
Signed-off-by: Clement <gh.2lgqz@aleeas.com>
2024-07-05 01:27:30 +08:00
Baek
60e3f45469 Adds all feature_gate from component-base.
We'll likely use most of the feature_gate package from component-base.
Also this commit moves the pkg from server/internal/pkg to pkg/.

Signed-off-by: Baek <seungtackbaek@google.com>
2024-06-15 05:34:58 +00:00
Baek
69ebaaebca featuregate: adds EtcdServer.FeatureEnabled interface.
The interface can be used throughout the etcd server binary to check if
the feature is enabled or not.

Note that this commit also copies necessary FeatureGate interface from
k8s component-base.

Signed-off-by: Baek <seungtackbaek@google.com>
2024-06-15 05:34:58 +00:00
Max Neverov
c64c996c03 Revert quorum calculation: (active - 1) < 1+((len(m)-1)/2) calculates quorum after a member is deleted.
Signed-off-by: Max Neverov <neverov.max@gmail.com>
2024-04-17 07:55:24 +02:00
Max Neverov
3b16aae947 Fix remove member failed.
Signed-off-by: Max Neverov <neverov.max@gmail.com>
2024-04-17 07:55:24 +02:00
Ivan Valdes
14523bdc21
etcdserver: rename MemberId() to MemberID() to address var-naming
Signed-off-by: Ivan Valdes <ivan@vald.es>
2024-03-18 17:18:29 -07:00
Ivan Valdes
c613b78e6c
etcdserver: address golangci var-naming issues
Signed-off-by: Ivan Valdes <ivan@vald.es>
2024-03-18 17:17:07 -07:00
Siyuan Zhang
3565a822de Add VerifyTxConsistency to backend.
Signed-off-by: Siyuan Zhang <sizhang@google.com>

Update server/storage/backend/verify.go

Co-authored-by: Benjamin Wang <benjamin.wang@broadcom.com>

Update server/storage/backend/verify.go

Co-authored-by: Benjamin Wang <benjamin.wang@broadcom.com>
2024-02-22 11:31:16 -08:00
Ishan Tyagi
16a5e1da71 Added a error log when learner is not sync with etcd leader.
Signed-off-by: ishan16696 <ishan.tyagi@sap.com>
2024-01-30 15:42:11 +05:30
YaoC
f7ab7adf29 server: fix learner metric incorrect issue
Signed-off-by: YaoC <chengyao09@hotmail.com>
2024-01-12 09:36:33 +00:00
Marek Siarkowicz
a2eb17c809
Merge pull request #17199 from serathius/dont-flock
Don't flock snapshot files
2024-01-08 15:03:29 +01:00
Marek Siarkowicz
3471ef133d Add an e2e test and robustness failpoint around recovering from snapshot backend
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
2024-01-04 15:25:24 +01:00
Marek Siarkowicz
7f8346b3f2 Don't flock snapshot files
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
2024-01-04 14:53:44 +01:00
Marek Siarkowicz
1e8d66ef95 Add beforeOpenSnapshotBackend failpoint
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
2023-12-20 15:36:54 +01:00
Benjamin Wang
67f17166bf Safeguard lease operations by double checking the leadership
1. ignore old leader's leases revoking request
2. double check current member's leadership before perform lease renew request
3. etcdserve: ensure current member's leadership before performing lease checkpoint request

Signed-off-by: Benjamin Wang <benjamin.ahrtr@gmail.com>
2023-12-15 17:53:36 +00:00
Benjamin Wang
36b2523669 added some log messages for better diagnosis
Signed-off-by: Benjamin Wang <benjamin.ahrtr@gmail.com>
2023-12-13 18:43:22 +00:00
Neil Shen
fb769c4306 server: ignore raft messages if member id mismatch
Ignore Raft messages when the `To` field mismatches the local member ID.
In cases where incorrect Raft messages are dispatched, potentially due
to a malfunctioning switch, this proactive check prevents panics,
such as "tocommit is out of range".

Signed-off-by: Neil Shen <overvenus@gmail.com>
2023-12-07 11:57:45 +08:00
Marek Siarkowicz
bc697bc26e Revert "Switch to validating v3 when v2 and v3 are synchronized"
This reverts commit 4fe46f92030e4381e6f9bf95adbb22a08282d297.

Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
2023-12-03 18:12:09 +01:00
Marek Siarkowicz
03d551243b
Merge pull request #17015 from serathius/extract-membership-applier
Extract membership applier
2023-11-27 19:59:21 +01:00
Marek Siarkowicz
4fe46f9203 Switch to validating v3 when v2 and v3 are synchronized
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
2023-11-24 17:46:33 +01:00
Marek Siarkowicz
2ad21558ac Remove shouldApplyV3 from the v3 applier
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
2023-11-24 16:13:25 +01:00
Marek Siarkowicz
d22c00ccee Extract membership applier
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
2023-11-24 15:57:15 +01:00
Marek Siarkowicz
7fdb33065d Move duplicated shouldApplyV3 logic up into apply method
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
2023-11-24 10:21:14 +01:00
Marek Siarkowicz
093666f450 Cleanup v2 applier
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
2023-11-23 15:41:13 +01:00
Marek Siarkowicz
c72ff1e69c Remove syncing the v2 store TTLs
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
2023-11-23 14:55:01 +01:00
Marek Siarkowicz
dd7a4d28a8 Remove code used to make v2 proposals
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
2023-11-19 22:39:33 +01:00
Marek Siarkowicz
b4fd31f254 Remove code for setting cluster version via V2 API
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
2023-11-19 15:28:52 +01:00
Chao Chen
1324f03254 add existing http health check handler e2e test
Signed-off-by: Chao Chen <chaochn@amazon.com>
2023-10-18 12:42:23 -07:00
Benjamin Wang
628b45c099 test: add a test case to verify consistent memberlist on bootstrap
Signed-off-by: Benjamin Wang <wachao@vmware.com>
2023-09-28 20:04:47 +01:00
Wei Fu
aa97484166 *: enable goimports in verify-lint
Signed-off-by: Wei Fu <fuweid89@gmail.com>
2023-09-21 21:14:09 +08:00
chenyahui
c0aa3b613b Use any instead of interface{}
Signed-off-by: chenyahui <cyhone@qq.com>
2023-09-17 17:41:58 +08:00
Geeta Gharpure
8729417cee Preserve the order of steps done for snapshot
Signed-off-by: Geeta Gharpure <geetagh@amazon.com>
2023-08-22 19:12:37 +00:00
Geeta Gharpure
59332dc194 Update to generate v2 snapshot from v3 state
Signed-off-by: Geeta Gharpure <geetagh@amazon.com>
2023-08-21 19:18:11 +00:00
Jes Cok
52748f60f3 all: stop using math/rand.Seed
Fixes #16428.

Signed-off-by: Jes Cok <xigua67damn@gmail.com>
2023-08-20 16:34:44 +08:00
Wei Fu
4db8df677c feature: add new compactor based revision count
What would you like to be added?

Add new compactor based revision count, instead of fixed interval time.

In order to make it happen, the mvcc store needs to export
`CompactNotify` function to notify the compactor that configured number of
write transactions have occured since previsious compaction. The
new compactor can get the revision change and delete out-of-date data in time,
instead of waiting with fixed interval time. The underly bbolt db can
reuse the free pages as soon as possible.

Why is this needed?

In the kubernetes cluster, for instance, argo workflow, there will be batch
requests to create pods , and then there are also a lot of pod status's PATCH
requests, especially when the pod has more than 3 containers. If the burst
requests increase the db size in short time, it will be easy to exceed the max
quota size. And then the cluster admin get involved to defrag, which may casue
long downtime. So, we hope the ETCD can delete the out-of-date data as
soon as possible and slow down the grow of total db size.

Currently, both revision and periodic are based on time. It's not easy
to use fixed interval time to face the unexpected burst update requests.
The new compactor based on revision count can make the admin life easier.
For instance, let's say that average of object size is 50 KiB. The new
compactor will compact based on 10,000 revisions. It's like that ETCD can
compact after new 500 MiB data in, no matter how long ETCD takes to get
new 10,000 revisions. It can handle the burst update requests well.

There are some test results:

* Fixed value size: 10 KiB, Update Rate: 100/s, Total key space: 3,000

```
enchmark put --rate=100 --total=300000 --compact-interval=0 \
  --key-space-size=3000 --key-size=256 --val-size=10240
```

|                      Compactor | DB Total Size | DB InUse Size |
|                             -- | --            |            -- |
| Revision(5min,retension:10000) | 570 MiB       |       208 MiB |
|                   Periodic(1m) | 232 MiB       |       165 MiB |
|                  Periodic(30s) | 151 MiB       |       127 MiB |
|   NewRevision(retension:10000) | 195 MiB       |       187 MiB |

* Random value size: [9 KiB, 11 KiB], Update Rate: 150/s, Total key space: 3,000

```
bnchmark put --rate=150 --total=300000 --compact-interval=0 \
  --key-space-size=3000 --key-size=256 --val-size=10240 \
  --delta-val-size=1024
```

|                      Compactor | DB Total Size | DB InUse Size |
|                             -- | --            |            -- |
| Revision(5min,retension:10000) | 718 MiB       |       554 MiB |
|                   Periodic(1m) | 297 MiB       |       246 MiB |
|                  Periodic(30s) | 185 MiB       |       146 MiB |
|   NewRevision(retension:10000) | 186 MiB       |       178 MiB |

* Random value size: [6 KiB, 14 KiB], Update Rate: 200/s, Total key space: 3,000

```
bnchmark put --rate=200 --total=300000 --compact-interval=0 \
  --key-space-size=3000 --key-size=256 --val-size=10240 \
  --delta-val-size=4096
```

|                      Compactor | DB Total Size | DB InUse Size |
|                             -- | --            |            -- |
| Revision(5min,retension:10000) | 874 MiB       |       221 MiB |
|                   Periodic(1m) | 357 MiB       |       260 MiB |
|                  Periodic(30s) | 215 MiB       |       151 MiB |
|   NewRevision(retension:10000) | 182 MiB       |       176 MiB |

For the burst requests, we needs to use short periodic interval.
Otherwise, the total size will be large. I think the new compactor can
handle it well.

Additional Change:

Currently, the quota system only checks DB total size. However, there
could be a lot of free pages which can be reused to upcoming requests.
Based on this proposal, I also want to extend current quota system with DB's
InUse size.

If the InUse size is less than max quota size, we should allow requests to
update. Since the bbolt might be resized if there is no available
continuous pages, we should setup a hard limit for the overflow, like 1
GiB.

```diff
 // Quota represents an arbitrary quota against arbitrary requests. Each request
@@ -130,7 +134,17 @@ func (b *BackendQuota) Available(v interface{}) bool {
                return true
        }
        // TODO: maybe optimize Backend.Size()
-       return b.be.Size()+int64(cost) < b.maxBackendBytes
+
+       // Since the compact comes with allocatable pages, we should check the
+       // SizeInUse first. If there is no continuous pages for key/value and
+       // the boltdb continues to resize, it should not increase more than 1
+       // GiB. It's hard limitation.
+       //
+       // TODO: It should be enabled by flag.
+       if b.be.Size()+int64(cost)-b.maxBackendBytes >= maxAllowedOverflowBytes(b.maxBackendBytes) {
+               return false
+       }
+       return b.be.SizeInUse()+int64(cost) < b.maxBackendBytes
 }
```

And it's likely to disable NOSPACE alarm if the compact can get much
more free pages. It can reduce downtime.

Signed-off-by: Wei Fu <fuweid89@gmail.com>
2023-08-16 23:35:08 +08:00
Chao Chen
6cdc9ae4fe server/etcdserver/raft.go:
1. rename confChangeCh to raftAdvancedC
2. rename waitApply to confChanged
3. add comments and test assertion

Signed-off-by: Chao Chen <chaochn@amazon.com>
2023-06-26 22:42:44 -07:00
Benjamin Wang
ad3b6ee4c6 etcdserver: wait for raft is notified on confChange before responding to client
Signed-off-by: Benjamin Wang <wachao@vmware.com>
2023-06-26 13:40:51 -07:00
Geeta Gharpure
550aa152a7 Verify consistent index is latest at the time of snapshot
Signed-off-by: Geeta Gharpure <geetagh@amazon.com>
2023-06-19 16:00:04 +00:00
Chao Chen
f31d0eafb9 tests/e2e: add graceful shutdown test
Signed-off-by: Chao Chen <chaochn@amazon.com>
2023-05-09 17:08:53 -07:00
Chao Chen
caed563e08 fix flaking auth member remove test
Signed-off-by: Chao Chen <chaochn@amazon.com>
2023-04-03 17:41:08 -07:00
Wei Fu
22bdc91302 server/etcdserver: add log for terminating monitors
Adding log for terminating monitors is to make the debug easier.

Signed-off-by: Wei Fu <fuweid89@gmail.com>
2023-03-11 15:07:17 +08:00
James Blair
275e10bcf7
Return default snapshot count to 10,000.
The huge (100k+) value was justified when storev2 was being dumped completely with every snapshot.

With storev2 being decomissioned we can checkpoint more frequently for faster recovery.

Signed-off-by: James Blair <mail@jamesblair.net>
2023-03-06 20:21:03 +13:00
guozhao
de8d6b3792 etcdserver: use time.Ticker instead of time.After
Using time.After will create a new Timer in each cycle, In these cases
, it is better to use time.Ticker.

Signed-off-by: guozhao <guozhao@360.cn>
2023-01-17 16:58:13 +08:00
Benjamin Wang
8ed20e85d2 etcdserver: return membership.ErrIDNotFound when the memberID not found
When promoting a learner, we need to wait until the leader's applied ID
catches up to the commitId. Afterwards, check whether the learner ID
exist or not, and return `membership.ErrIDNotFound` directly in the API
if the member ID not found, to avoid the request being unnecessarily
delivered to raft.

Signed-off-by: Benjamin Wang <wachao@vmware.com>
2023-01-17 06:18:15 +08:00
Piotr Tabor
6f899a7b40
Merge pull request #15052 from ptabor/20221228-goimports-fix
./scripts/fix.sh: Takes care of goimports across the whole project.
2022-12-29 11:31:22 +01:00
Piotr Tabor
9e1abbab6e Fix goimports in all existing files. Execution of ./scripts/fix.sh
Signed-off-by: Piotr Tabor <ptab@google.com>
2022-12-29 09:41:31 +01:00
KiloG
101a2a61ea
etcdserver: fix typo in comment
etcdserver: fix typo in comment
2022-12-28 18:41:08 +08:00