Mirroristas/etcd

mirror of https://github.com/etcd-io/etcd.git synced 2024-09-27 06:25:44 +00:00

Author	SHA1	Message	Date
Wei Fu	4db8df677c	feature: add new compactor based revision count What would you like to be added? Add new compactor based revision count, instead of fixed interval time. In order to make it happen, the mvcc store needs to export `CompactNotify` function to notify the compactor that configured number of write transactions have occured since previsious compaction. The new compactor can get the revision change and delete out-of-date data in time, instead of waiting with fixed interval time. The underly bbolt db can reuse the free pages as soon as possible. Why is this needed? In the kubernetes cluster, for instance, argo workflow, there will be batch requests to create pods , and then there are also a lot of pod status's PATCH requests, especially when the pod has more than 3 containers. If the burst requests increase the db size in short time, it will be easy to exceed the max quota size. And then the cluster admin get involved to defrag, which may casue long downtime. So, we hope the ETCD can delete the out-of-date data as soon as possible and slow down the grow of total db size. Currently, both revision and periodic are based on time. It's not easy to use fixed interval time to face the unexpected burst update requests. The new compactor based on revision count can make the admin life easier. For instance, let's say that average of object size is 50 KiB. The new compactor will compact based on 10,000 revisions. It's like that ETCD can compact after new 500 MiB data in, no matter how long ETCD takes to get new 10,000 revisions. It can handle the burst update requests well. There are some test results: * Fixed value size: 10 KiB, Update Rate: 100/s, Total key space: 3,000 ``` enchmark put --rate=100 --total=300000 --compact-interval=0 \ --key-space-size=3000 --key-size=256 --val-size=10240 ``` \| Compactor \| DB Total Size \| DB InUse Size \| \| -- \| -- \| -- \| \| Revision(5min,retension:10000) \| 570 MiB \| 208 MiB \| \| Periodic(1m) \| 232 MiB \| 165 MiB \| \| Periodic(30s) \| 151 MiB \| 127 MiB \| \| NewRevision(retension:10000) \| 195 MiB \| 187 MiB \| * Random value size: [9 KiB, 11 KiB], Update Rate: 150/s, Total key space: 3,000 ``` bnchmark put --rate=150 --total=300000 --compact-interval=0 \ --key-space-size=3000 --key-size=256 --val-size=10240 \ --delta-val-size=1024 ``` \| Compactor \| DB Total Size \| DB InUse Size \| \| -- \| -- \| -- \| \| Revision(5min,retension:10000) \| 718 MiB \| 554 MiB \| \| Periodic(1m) \| 297 MiB \| 246 MiB \| \| Periodic(30s) \| 185 MiB \| 146 MiB \| \| NewRevision(retension:10000) \| 186 MiB \| 178 MiB \| * Random value size: [6 KiB, 14 KiB], Update Rate: 200/s, Total key space: 3,000 ``` bnchmark put --rate=200 --total=300000 --compact-interval=0 \ --key-space-size=3000 --key-size=256 --val-size=10240 \ --delta-val-size=4096 ``` \| Compactor \| DB Total Size \| DB InUse Size \| \| -- \| -- \| -- \| \| Revision(5min,retension:10000) \| 874 MiB \| 221 MiB \| \| Periodic(1m) \| 357 MiB \| 260 MiB \| \| Periodic(30s) \| 215 MiB \| 151 MiB \| \| NewRevision(retension:10000) \| 182 MiB \| 176 MiB \| For the burst requests, we needs to use short periodic interval. Otherwise, the total size will be large. I think the new compactor can handle it well. Additional Change: Currently, the quota system only checks DB total size. However, there could be a lot of free pages which can be reused to upcoming requests. Based on this proposal, I also want to extend current quota system with DB's InUse size. If the InUse size is less than max quota size, we should allow requests to update. Since the bbolt might be resized if there is no available continuous pages, we should setup a hard limit for the overflow, like 1 GiB. ```diff // Quota represents an arbitrary quota against arbitrary requests. Each request @@ -130,7 +134,17 @@ func (b *BackendQuota) Available(v interface{}) bool { return true } // TODO: maybe optimize Backend.Size() - return b.be.Size()+int64(cost) < b.maxBackendBytes + + // Since the compact comes with allocatable pages, we should check the + // SizeInUse first. If there is no continuous pages for key/value and + // the boltdb continues to resize, it should not increase more than 1 + // GiB. It's hard limitation. + // + // TODO: It should be enabled by flag. + if b.be.Size()+int64(cost)-b.maxBackendBytes >= maxAllowedOverflowBytes(b.maxBackendBytes) { + return false + } + return b.be.SizeInUse()+int64(cost) < b.maxBackendBytes } ``` And it's likely to disable NOSPACE alarm if the compact can get much more free pages. It can reduce downtime. Signed-off-by: Wei Fu <fuweid89@gmail.com>	2023-08-16 23:35:08 +08:00
Chao Chen	6cdc9ae4fe	server/etcdserver/raft.go: 1. rename confChangeCh to raftAdvancedC 2. rename waitApply to confChanged 3. add comments and test assertion Signed-off-by: Chao Chen <chaochn@amazon.com>	2023-06-26 22:42:44 -07:00
Benjamin Wang	ad3b6ee4c6	etcdserver: wait for raft is notified on confChange before responding to client Signed-off-by: Benjamin Wang <wachao@vmware.com>	2023-06-26 13:40:51 -07:00
Geeta Gharpure	550aa152a7	Verify consistent index is latest at the time of snapshot Signed-off-by: Geeta Gharpure <geetagh@amazon.com>	2023-06-19 16:00:04 +00:00
Chao Chen	f31d0eafb9	tests/e2e: add graceful shutdown test Signed-off-by: Chao Chen <chaochn@amazon.com>	2023-05-09 17:08:53 -07:00
Chao Chen	caed563e08	fix flaking auth member remove test Signed-off-by: Chao Chen <chaochn@amazon.com>	2023-04-03 17:41:08 -07:00
Wei Fu	22bdc91302	server/etcdserver: add log for terminating monitors Adding log for terminating monitors is to make the debug easier. Signed-off-by: Wei Fu <fuweid89@gmail.com>	2023-03-11 15:07:17 +08:00
James Blair	275e10bcf7	Return default snapshot count to 10,000. The huge (100k+) value was justified when storev2 was being dumped completely with every snapshot. With storev2 being decomissioned we can checkpoint more frequently for faster recovery. Signed-off-by: James Blair <mail@jamesblair.net>	2023-03-06 20:21:03 +13:00
guozhao	de8d6b3792	etcdserver: use time.Ticker instead of time.After Using time.After will create a new Timer in each cycle, In these cases , it is better to use time.Ticker. Signed-off-by: guozhao <guozhao@360.cn>	2023-01-17 16:58:13 +08:00
Benjamin Wang	8ed20e85d2	etcdserver: return membership.ErrIDNotFound when the memberID not found When promoting a learner, we need to wait until the leader's applied ID catches up to the commitId. Afterwards, check whether the learner ID exist or not, and return `membership.ErrIDNotFound` directly in the API if the member ID not found, to avoid the request being unnecessarily delivered to raft. Signed-off-by: Benjamin Wang <wachao@vmware.com>	2023-01-17 06:18:15 +08:00
Piotr Tabor	6f899a7b40	Merge pull request #15052 from ptabor/20221228-goimports-fix ./scripts/fix.sh: Takes care of goimports across the whole project.	2022-12-29 11:31:22 +01:00
Piotr Tabor	9e1abbab6e	Fix goimports in all existing files. Execution of ./scripts/fix.sh Signed-off-by: Piotr Tabor <ptab@google.com>	2022-12-29 09:41:31 +01:00
KiloG	101a2a61ea	etcdserver: fix typo in comment etcdserver: fix typo in comment	2022-12-28 18:41:08 +08:00
Benjamin Wang	faff80a2b3	etcdserve: format the source code gofmt -w ./server Signed-off-by: Benjamin Wang <wachao@vmware.com>	2022-12-02 13:00:59 +08:00
Benjamin Wang	e9aa275b36	etcdserver: update etcdserver to use the new raft module go.etcd.io/raft/v3 Just replaced all go.etcd.io/etcd/raft/v3 with go.etcd.io/raft/v3 under directory server. Signed-off-by: Benjamin Wang <wachao@vmware.com>	2022-12-02 09:33:45 +08:00
Marek Siarkowicz	2b178fdd96	server: Handle cluster version equal downgrade version Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>	2022-10-17 12:05:57 +02:00
Benjamin Wang	cc840336f0	move consistent_index forward when executing alarmList operation The alarm list is the only exception that doesn't move consistent_index forward. The reproduction steps are as simple as, ``` etcd --snapshot-count=5 & for i in {1..6}; do etcdctl alarm list; done kill -9 <etcd_pid> etcd ``` Signed-off-by: Benjamin Wang <wachao@vmware.com>	2022-09-05 10:05:55 +08:00
Marek Siarkowicz	d44bbff278	server: Make corrtuption check optional and period configurable Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>	2022-07-26 09:31:15 +02:00
Marek Siarkowicz	6697fca97d	server: Implement compaction hash checking Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>	2022-07-26 09:31:14 +02:00
Marek Siarkowicz	c58ec9fe13	server: Refactor compaction checker Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>	2022-07-25 13:59:30 +02:00
Tsonglew	e5a80f5049	fix: typo gouroutine fix: typo gouroutine	2022-06-16 16:35:06 +08:00
SimFG	d83925e357	schedule: Provide logs when the fifo job panic happens To make the fifo scheduler better debuggability. Signed-off-by: SimFG <1142838399@qq.com>	2022-06-15 20:58:17 +08:00
Marek Siarkowicz	7c35dadc25	server: Extract corruption detection to dedicated struct Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>	2022-06-13 18:19:24 +02:00
ahrtr	25deb436af	fix the race condition between goroutine and channel on the same leases to be revoked	2022-05-25 16:44:41 +08:00
Piotr Tabor	85b18c9b3e	Rename WrapApply to Apply.	2022-05-20 14:32:04 +02:00
Piotr Tabor	0da0cf4795	expose UberApplier as interface (not as implementation struct).	2022-05-20 14:32:04 +02:00
Piotr Tabor	5097b33ab9	Rename etcdserver/etcderrors package to etcdserver/errors.	2022-05-20 14:32:04 +02:00
Piotr Tabor	63b2f63cc1	Rename package alising "apply2" -> apply.	2022-05-20 14:32:04 +02:00
Piotr Tabor	47a771871b	Move apply to its own package (no dependency on etcdserver).	2022-05-20 14:32:04 +02:00
Piotr Tabor	fc6a6c3c27	Move etcdserver/errors.go to sepatate package to avoid cyclic dependencies.	2022-05-20 14:32:04 +02:00
Piotr Tabor	b073129d03	Applier does not depend on EtcdServer any longer. All the depencies are explicily passed to the UberApplier factory method.	2022-05-20 14:32:04 +02:00
Piotr Tabor	651de5a057	Rename EtcdServer.Id with EtcdServer.MemberId. It was misleading and error prone vs. ClusterId.	2022-05-20 14:32:04 +02:00
Piotr Tabor	b7ad746bfe	Encapsulating applier logic: UberApplier coordinates all appliers for server This PR: - moves wrapping of appliers (due to Alarms) out of server.go into uber_applier.go - clearly devides the application logic into: chain of: a) 'WrapApply' (generic logic across all the methods) b) dispatcher (translation of Apply into specific method like 'Put') c) chain of 'wrappers' of the specific methods (like Put). - when we do recovery (restore from snapshot) we create new instance of appliers. The purpose is to make sure we control all the depencies of the apply process, i.e. we can supply e.g. special instance of 'backend' to the application logic.	2022-05-20 14:32:04 +02:00
Piotr Tabor	cdf9869d70	Encapsulation of applier logic: Move Txn related code out of applier.go. The PR removes calls to applierV3base logic from server.go that is NOT part of 'application'. The original idea was that read-only transaction and Range call shared logic with Apply, so they can call appliers directly (but bypassing all 'corrupt', 'quota' and 'auth' wrappers). This PR moves all the logic to a separate file (that later can become package on its own).	2022-05-20 14:32:04 +02:00
ahrtr	e7f8bf7c44	enhance the /version endpoint to add storageVersion	2022-05-06 20:29:42 +08:00
Marek Siarkowicz	f09da32f9d	Merge pull request #13655 from serathius/health Cleanup healthcheck code after V2 removal	2022-05-06 12:08:36 +02:00
Marek Siarkowicz	600ee13ac0	server: Cover V3 health with tests	2022-05-05 09:52:14 +02:00
Marek Siarkowicz	0096d2ecdb	server: Remove unused NewClientHandler	2022-05-05 09:52:13 +02:00
ahrtr	fb2eeb9027	verify consistent_index in snapshot must be equal to the snapshot index Usually the consistent_index should be greater than the index of the latest snapshot with suffix .snap. But for the snapshot coming from the leader, the consistent_index should be equal to the snapshot index.	2022-05-03 20:02:47 +08:00
ahrtr	6eef7ede40	Update conssitent_index when applying fails When clients have no permission to perform whatever operation, then the applying may fail. We should also move consistent_index forward in this case, otherwise the consitent_index may smaller than the snapshot index.	2022-04-20 21:44:48 +08:00
ahrtr	484d2f01f3	set backend to cindex before recovering the lessor in applySnapshot	2022-04-12 10:36:29 +08:00
ahrtr	4033f5c2b9	move the consistentIdx and consistentTerm from Etcdserver to cindex package Removed the fields consistentIdx and consistentTerm from struct EtcdServer, and added applyingIndex and applyingTerm into struct consistentIndex in package cindex. We may remove the two fields completely if we decide to remove the OnPreCommitUnsafe, and it will depend on the performance test result.	2022-04-07 15:16:49 +08:00
ahrtr	e155e50886	rename LockWithoutHook to LockOutsideApply and add LockInsideApply	2022-04-07 05:35:13 +08:00
ahrtr	47038593e9	set the consistent_index directly when applyV3 isn't performed	2022-04-07 05:35:13 +08:00
ahrtr	7ac995cdde	enhanced authBackend to support authReadTx	2022-04-07 05:35:13 +08:00
ahrtr	bfd5170f66	add a txPostLockHook into the backend Previously the SetConsistentIndex() is called during the apply workflow, but it's outside the db transaction. If a commit happens between SetConsistentIndex and the following apply workflow, and etcd crashes for whatever reason right after the commit, then etcd commits an incomplete transaction to db. Eventually etcd runs into the data inconsistency issue. In this commit, we move the SetConsistentIndex into a txPostLockHook, so it will be executed inside the transaction lock.	2022-04-07 05:35:13 +08:00
ahrtr	836bd6bc3a	fix WARNING: DATA RACE issue when multiple goroutines access the backend concurrently	2022-04-03 06:13:09 +08:00
ahrtr	edce939f6e	add one more field storageVersion into StatusResponse When performing the downgrade operation, users can confirm whether each member is ready to be downgraded using the field 'storageVersion'. If it's equal to the 'target version' in the downgrade command, then it's ready to be downgraded; otherwise, the etcd member is still in progress of processing the db file.	2022-03-18 07:04:44 +08:00
Marek Siarkowicz	a0f26ff4ea	server: Snapshot after cluster version downgrade	2022-02-21 15:48:00 +01:00
Marek Siarkowicz	8c91d60a6f	server: Switch to publishV3	2022-02-14 23:06:45 +01:00

1 2 3

123 Commits