* etcdserver: rename `minimumBatchInterval` to `defaultCompactionSleepInterval` and `defaultCompactBatchLimit` to `defaultCompactionBatchLimit`
Signed-off-by: Jalin Wang <JalinWang@outlook.com>
Before this patch, the tombstone can be deleted if its revision is equal
compacted revision. It causes that the watch subscriber won't get this
DELETE event. Based on Compact API[1], we should keep tombstone revision
if it's not less than the compaction revision.
> CompactionRequest compacts the key-value store up to a given revision.
> All superseded keys with a revision less than the compaction revision
> will be removed.
[1]: https://etcd.io/docs/latest/dev-guide/api_reference_v3/
Signed-off-by: Wei Fu <fuweid89@gmail.com>
In commit [[1]], the newTestKeyIndex function creates one key with two
Revision{Main: 14} revisions. However, starting from version [[2]], etcd server
does not allow duplicate keys in a single transaction. This update to
newTestKeyIndex is to avoid confusion and ensure consistency with the
latest etcd server behavior.
REF:
[1]: be80d11948
[2]: https://github.com/etcd-io/etcd/pull/4376
Signed-off-by: Wei Fu <fuweid89@gmail.com>
golangci-lint reports the following issue:
storage/mvcc/kvstore.go:312:27: (*store).restore - result 0 (error) is always nil (unparam)
It's due to the fact that both Attach() and compactLockfree() within the
function restore() are able to return an error, but we only log them in
the current implementation. Thus, the return value restore() is always
nil, hence the linter warning.
We have agreed to suppress the linter warning for now [1].
Reference:
[1] https://github.com/etcd-io/etcd/pull/18228#issuecomment-2187309957
Signed-off-by: Chun-Hung Tseng <henrybear327@gmail.com>
scheduleCompaction function
To improve traceability of backend database usage, Added below parameter
related to backend database usage metrics inside scheduledCompaction
function.
current-db-size-bytes
current-db-size
current-db-size-in-use-bytes
current-db-size-in-use
Signed-off-by: Rahul More <rahulbapumore@gmail.com>
The HashByRev-goroutines exit since receive `donec` notification. The
Check-computed-hashes goroutine could not have chance to get the hash
result and be stuck forever. We should add case for donec when we wait
for hash result.
Signed-off-by: Wei Fu <fuweid89@gmail.com>
What would you like to be added?
Add new compactor based revision count, instead of fixed interval time.
In order to make it happen, the mvcc store needs to export
`CompactNotify` function to notify the compactor that configured number of
write transactions have occured since previsious compaction. The
new compactor can get the revision change and delete out-of-date data in time,
instead of waiting with fixed interval time. The underly bbolt db can
reuse the free pages as soon as possible.
Why is this needed?
In the kubernetes cluster, for instance, argo workflow, there will be batch
requests to create pods , and then there are also a lot of pod status's PATCH
requests, especially when the pod has more than 3 containers. If the burst
requests increase the db size in short time, it will be easy to exceed the max
quota size. And then the cluster admin get involved to defrag, which may casue
long downtime. So, we hope the ETCD can delete the out-of-date data as
soon as possible and slow down the grow of total db size.
Currently, both revision and periodic are based on time. It's not easy
to use fixed interval time to face the unexpected burst update requests.
The new compactor based on revision count can make the admin life easier.
For instance, let's say that average of object size is 50 KiB. The new
compactor will compact based on 10,000 revisions. It's like that ETCD can
compact after new 500 MiB data in, no matter how long ETCD takes to get
new 10,000 revisions. It can handle the burst update requests well.
There are some test results:
* Fixed value size: 10 KiB, Update Rate: 100/s, Total key space: 3,000
```
enchmark put --rate=100 --total=300000 --compact-interval=0 \
--key-space-size=3000 --key-size=256 --val-size=10240
```
| Compactor | DB Total Size | DB InUse Size |
| -- | -- | -- |
| Revision(5min,retension:10000) | 570 MiB | 208 MiB |
| Periodic(1m) | 232 MiB | 165 MiB |
| Periodic(30s) | 151 MiB | 127 MiB |
| NewRevision(retension:10000) | 195 MiB | 187 MiB |
* Random value size: [9 KiB, 11 KiB], Update Rate: 150/s, Total key space: 3,000
```
bnchmark put --rate=150 --total=300000 --compact-interval=0 \
--key-space-size=3000 --key-size=256 --val-size=10240 \
--delta-val-size=1024
```
| Compactor | DB Total Size | DB InUse Size |
| -- | -- | -- |
| Revision(5min,retension:10000) | 718 MiB | 554 MiB |
| Periodic(1m) | 297 MiB | 246 MiB |
| Periodic(30s) | 185 MiB | 146 MiB |
| NewRevision(retension:10000) | 186 MiB | 178 MiB |
* Random value size: [6 KiB, 14 KiB], Update Rate: 200/s, Total key space: 3,000
```
bnchmark put --rate=200 --total=300000 --compact-interval=0 \
--key-space-size=3000 --key-size=256 --val-size=10240 \
--delta-val-size=4096
```
| Compactor | DB Total Size | DB InUse Size |
| -- | -- | -- |
| Revision(5min,retension:10000) | 874 MiB | 221 MiB |
| Periodic(1m) | 357 MiB | 260 MiB |
| Periodic(30s) | 215 MiB | 151 MiB |
| NewRevision(retension:10000) | 182 MiB | 176 MiB |
For the burst requests, we needs to use short periodic interval.
Otherwise, the total size will be large. I think the new compactor can
handle it well.
Additional Change:
Currently, the quota system only checks DB total size. However, there
could be a lot of free pages which can be reused to upcoming requests.
Based on this proposal, I also want to extend current quota system with DB's
InUse size.
If the InUse size is less than max quota size, we should allow requests to
update. Since the bbolt might be resized if there is no available
continuous pages, we should setup a hard limit for the overflow, like 1
GiB.
```diff
// Quota represents an arbitrary quota against arbitrary requests. Each request
@@ -130,7 +134,17 @@ func (b *BackendQuota) Available(v interface{}) bool {
return true
}
// TODO: maybe optimize Backend.Size()
- return b.be.Size()+int64(cost) < b.maxBackendBytes
+
+ // Since the compact comes with allocatable pages, we should check the
+ // SizeInUse first. If there is no continuous pages for key/value and
+ // the boltdb continues to resize, it should not increase more than 1
+ // GiB. It's hard limitation.
+ //
+ // TODO: It should be enabled by flag.
+ if b.be.Size()+int64(cost)-b.maxBackendBytes >= maxAllowedOverflowBytes(b.maxBackendBytes) {
+ return false
+ }
+ return b.be.SizeInUse()+int64(cost) < b.maxBackendBytes
}
```
And it's likely to disable NOSPACE alarm if the compact can get much
more free pages. It can reduce downtime.
Signed-off-by: Wei Fu <fuweid89@gmail.com>
check ScheduledCompactKeyName and FinishedCompactKeyName
before writing hash to hashstore. If they do not match, then it means this compaction has once been interrupted and its hash value is invalid. In such cases, we won't write the hash values to the hashstore, and avoids the incorrect corruption alarm.
Signed-off-by: caojiamingalan <alan.c.19971111@gmail.com>
Progress notifications requested using ProgressRequest were sent
directly using the ctrlStream, which means that they could race
against watch responses in the watchStream.
This would especially happen when the stream was not synced - e.g. if
you requested a progress notification on a freshly created unsynced
watcher, the notification would typically arrive indicating a revision
for which not all watch responses had been sent.
This changes the behaviour so that v3rpc always goes through the watch
stream, using a new RequestProgressAll function that closely matches
the behaviour of the v3rpc code - i.e.
1. Generate a message with WatchId -1, indicating the revision for
*all* watchers in the stream
2. Guarantee that a response is (eventually) sent
The latter might require us to defer the response until all watchers
are synced, which is likely as it should be. Note that we do *not*
guarantee that the number of progress notifications matches the number
of requests, only that eventually at least one gets sent.
Signed-off-by: Peter Wortmann <peter.wortmann@skao.int>
Problem: during restore in watchableStore.Restore, synced watchers are moved to unsynced.
minRev will be behind since it's not updated when watcher stays synced.
Solution: update minRev
fixes: https://github.com/etcd-io/etcd/issues/15271
Signed-off-by: Bogdan Kanivets <bkanivets@apple.com>
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
The original testcase uses `return` statement which skips `restore`
case. It's aimed to enable `restore` testcase.
Signed-off-by: Wei Fu <fuweid89@gmail.com>
After setting the ComparionSleepInterval, we can use time.Ticker
instead of time.After to optimize the scheduleComparison(),
otherwise it will fail in the 'TestStoreCompact(t)' test.
Signed-off-by: guozhao <guozhao@360.cn>
Comments fixed as per goword in go test files that shell
function go_srcs_in_module lists as per changes on #14827
Helps in #14827
Signed-off-by: Bhargav Ravuri <bhargav.ravuri@infracloud.io>