What would you like to be added?
Add new compactor based revision count, instead of fixed interval time.
In order to make it happen, the mvcc store needs to export
`CompactNotify` function to notify the compactor that configured number of
write transactions have occured since previsious compaction. The
new compactor can get the revision change and delete out-of-date data in time,
instead of waiting with fixed interval time. The underly bbolt db can
reuse the free pages as soon as possible.
Why is this needed?
In the kubernetes cluster, for instance, argo workflow, there will be batch
requests to create pods , and then there are also a lot of pod status's PATCH
requests, especially when the pod has more than 3 containers. If the burst
requests increase the db size in short time, it will be easy to exceed the max
quota size. And then the cluster admin get involved to defrag, which may casue
long downtime. So, we hope the ETCD can delete the out-of-date data as
soon as possible and slow down the grow of total db size.
Currently, both revision and periodic are based on time. It's not easy
to use fixed interval time to face the unexpected burst update requests.
The new compactor based on revision count can make the admin life easier.
For instance, let's say that average of object size is 50 KiB. The new
compactor will compact based on 10,000 revisions. It's like that ETCD can
compact after new 500 MiB data in, no matter how long ETCD takes to get
new 10,000 revisions. It can handle the burst update requests well.
There are some test results:
* Fixed value size: 10 KiB, Update Rate: 100/s, Total key space: 3,000
```
enchmark put --rate=100 --total=300000 --compact-interval=0 \
--key-space-size=3000 --key-size=256 --val-size=10240
```
| Compactor | DB Total Size | DB InUse Size |
| -- | -- | -- |
| Revision(5min,retension:10000) | 570 MiB | 208 MiB |
| Periodic(1m) | 232 MiB | 165 MiB |
| Periodic(30s) | 151 MiB | 127 MiB |
| NewRevision(retension:10000) | 195 MiB | 187 MiB |
* Random value size: [9 KiB, 11 KiB], Update Rate: 150/s, Total key space: 3,000
```
bnchmark put --rate=150 --total=300000 --compact-interval=0 \
--key-space-size=3000 --key-size=256 --val-size=10240 \
--delta-val-size=1024
```
| Compactor | DB Total Size | DB InUse Size |
| -- | -- | -- |
| Revision(5min,retension:10000) | 718 MiB | 554 MiB |
| Periodic(1m) | 297 MiB | 246 MiB |
| Periodic(30s) | 185 MiB | 146 MiB |
| NewRevision(retension:10000) | 186 MiB | 178 MiB |
* Random value size: [6 KiB, 14 KiB], Update Rate: 200/s, Total key space: 3,000
```
bnchmark put --rate=200 --total=300000 --compact-interval=0 \
--key-space-size=3000 --key-size=256 --val-size=10240 \
--delta-val-size=4096
```
| Compactor | DB Total Size | DB InUse Size |
| -- | -- | -- |
| Revision(5min,retension:10000) | 874 MiB | 221 MiB |
| Periodic(1m) | 357 MiB | 260 MiB |
| Periodic(30s) | 215 MiB | 151 MiB |
| NewRevision(retension:10000) | 182 MiB | 176 MiB |
For the burst requests, we needs to use short periodic interval.
Otherwise, the total size will be large. I think the new compactor can
handle it well.
Additional Change:
Currently, the quota system only checks DB total size. However, there
could be a lot of free pages which can be reused to upcoming requests.
Based on this proposal, I also want to extend current quota system with DB's
InUse size.
If the InUse size is less than max quota size, we should allow requests to
update. Since the bbolt might be resized if there is no available
continuous pages, we should setup a hard limit for the overflow, like 1
GiB.
```diff
// Quota represents an arbitrary quota against arbitrary requests. Each request
@@ -130,7 +134,17 @@ func (b *BackendQuota) Available(v interface{}) bool {
return true
}
// TODO: maybe optimize Backend.Size()
- return b.be.Size()+int64(cost) < b.maxBackendBytes
+
+ // Since the compact comes with allocatable pages, we should check the
+ // SizeInUse first. If there is no continuous pages for key/value and
+ // the boltdb continues to resize, it should not increase more than 1
+ // GiB. It's hard limitation.
+ //
+ // TODO: It should be enabled by flag.
+ if b.be.Size()+int64(cost)-b.maxBackendBytes >= maxAllowedOverflowBytes(b.maxBackendBytes) {
+ return false
+ }
+ return b.be.SizeInUse()+int64(cost) < b.maxBackendBytes
}
```
And it's likely to disable NOSPACE alarm if the compact can get much
more free pages. It can reduce downtime.
Signed-off-by: Wei Fu <fuweid89@gmail.com>
etcdctl/ctlv3: migrate cheggaaa/pb.v1 to cheggaaa/pb/v3
This commit also changes the format of the progress bar, from using a
custom progress bar to the default provided by the library.
Old behaviour:
./benchmarkv1 put
0 / 10000 B ! 0.00%
3987 / 10000 Boooooooooooooom ! 39.87%
10000 / 10000 Boooooooooooooooooooooooooooooooooooooooooooo! 100.00% 1s
New behaviour:
./benchmark put
6536 / 10000 [----------------------->________________] 65.36% 7053 p/s
10000 / 10000 [---------------------------------------] 100.00% 7581 p/s
Signed-off-by: Mikel Olasagasti Uranga <mikel@olasagasti.info>
This change makes the etcd package compatible with the existing Go
ecosystem for module versioning.
Used this tool to update package imports:
https://github.com/KSubedi/gomove
Current benchmark picks destinations of RPCs in a random
manner. However, it will result divergent benchmarking result because
RPCs other than serializable range must be forwarded to a leader node
when a follower node receives it. This commit adds a new flag
--target-leader for avoid the problem. If the flag is passed,
benchmark always picks an endpoint of a leader node.
Current benchmark doesn't have an option for configuring dial timeout
of gRPC. This commit adds --dial-timeout for the purpose. It is useful
for stopping long sticking benchmarks.
This commit adds --user for auth in benchmarks. Its purpose is
measuring overhead of authentication of v3 API. Of course the given
user must be granted permission of target keys before benchmarking.
Example of a case with no authentication:
% ./benchmark range k1
bench with linearizable range
10000 / 10000 Booooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo! 100.00%2m10s
Summary:
Total: 130.1850 secs.
Slowest: 0.4071 secs.
Fastest: 0.0064 secs.
Average: 0.0130 secs.
Stddev: 0.0079 secs.
Requests/sec: 76.8138
Response time histogram:
0.006 [1] |
0.046 [9990] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
0.087 [3] |
0.127 [0] |
0.167 [3] |
0.207 [2] |
0.247 [0] |
0.287 [0] |
0.327 [0] |
0.367 [0] |
0.407 [1] |
Latency distribution:
10% in 0.0076 secs.
25% in 0.0086 secs.
50% in 0.0113 secs.
75% in 0.0146 secs.
90% in 0.0209 secs.
95% in 0.0272 secs.
99% in 0.0344 secs.
Example of a case with authentication:
% ./benchmark --user=u1:p range k1
bench with linearizable range
10000 / 10000 Booooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo! 100.00%2m11s
Summary:
Total: 131.4923 secs.
Slowest: 0.1637 secs.
Fastest: 0.0065 secs.
Average: 0.0131 secs.
Stddev: 0.0070 secs.
Requests/sec: 76.0501
Response time histogram:
0.006 [1] |
0.022 [9075] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
0.038 [875] |∎∎∎
0.054 [36] |
0.069 [5] |
0.085 [1] |
0.101 [1] |
0.117 [0] |
0.132 [0] |
0.148 [5] |
0.164 [1] |
Latency distribution:
10% in 0.0076 secs.
25% in 0.0087 secs.
50% in 0.0114 secs.
75% in 0.0150 secs.
90% in 0.0215 secs.
95% in 0.0272 secs.
99% in 0.0347 secs.
It seems that current auth mechanism does not introduce visible overhead.
This commit adds flags for profiling with runtime/pprof to storage
put:
- --cpuprofile: specify a path of CPU profiling result, if it is not
empty, profiling is activated
- --memprofile: specify a path of heap profiling result, if it is not
empty, profiling is activated
Of course, the flags should be added to RootCmd ideally. However,
adding common flags that shared by children command requires the
ongoing PR: https://github.com/spf13/cobra/pull/220 . Therefore this
commit adds the flags to storage put only.
Reports depended on writing all results to a large buffered channel and
reading from that synchronously. Similarly, requests were buffered the
same way which can take significant memory on big request strings. Instead,
have reports stream in results as they're produced then print when the
results channel closes.