1939 Commits

Author SHA1 Message Date
Benjamin Wang
b2b7b9d535
Merge pull request #14423 from serathius/one_member_data_loss_raft_3_4
[release-3.4] fix the potential data loss for clusters with only one member
2022-09-06 03:29:45 +08:00
Benjamin Wang
119e4dda19 fix the potential data loss for clusters with only one member
For a cluster with only one member, the raft always send identical
unstable entries and committed entries to etcdserver, and etcd
responds to the client once it finishes (actually partially) the
applying workflow.

When the client receives the response, it doesn't mean etcd has already
successfully saved the data, including BoltDB and WAL, because:
   1. etcd commits the boltDB transaction periodically instead of on each request;
   2. etcd saves WAL entries in parallel with applying the committed entries.
Accordingly, it may run into a situation of data loss when the etcd crashes
immediately after responding to the client and before the boltDB and WAL
successfully save the data to disk.
Note that this issue can only happen for clusters with only one member.

For clusters with multiple members, it isn't an issue, because etcd will
not commit & apply the data before it being replicated to majority members.
When the client receives the response, it means the data must have been applied.
It further means the data must have been committed.
Note: for clusters with multiple members, the raft will never send identical
unstable entries and committed entries to etcdserver.

Signed-off-by: Benjamin Wang <wachao@vmware.com>
2022-09-05 14:15:47 +02:00
Vladimir Sokolov
38342e88da etcdserver: nil-logger issue fix for version 3.4
In v3.5 it is assumed that the logger should not be nil, however it is
still a case in v3.4. The PR targeted to v3.5 was backported to 3.4 and
that's why it's possible to get panic on nil logger in 3.4. This commit
fixed this issue.

Fixes #14402

Signed-off-by: Vladimir Sokolov <vsvastey@gmail.com>
2022-09-03 04:34:03 +03:00
Benjamin Wang
cc1b0e6a44 do not get previous K/V for create event
Signed-off-by: Benjamin Wang <wachao@vmware.com>
2022-08-01 13:11:46 +08:00
Benjamin Wang
f53db9b246 etcdserver: resend ReadIndex request on empty apply request
Backport https://github.com/etcd-io/etcd/pull/12795 to 3.4

Signed-off-by: Benjamin Wang <wachao@vmware.com>
2022-07-25 09:21:31 +08:00
Benjamin Wang
e2b36f8879
Merge pull request #14253 from serathius/checkpoints-fix-3.4
[3.4] Checkpoints fix 3.4
2022-07-22 16:56:17 +08:00
Marek Siarkowicz
8f4735dfd4 server: Require either cluster version v3.6 or --experimental-enable-lease-checkpoint-persist to persist lease remainingTTL
To avoid inconsistant behavior during cluster upgrade we are feature
gating persistance behind cluster version. This should ensure that
all cluster members are upgraded to v3.6 before changing behavior.

To allow backporting this fix to v3.5 we are also introducing flag
--experimental-enable-lease-checkpoint-persist that will allow for
smooth upgrade in v3.5 clusters with this feature enabled.

Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
2022-07-22 10:28:29 +02:00
Benjamin Wang
f18d074866
Merge pull request #14254 from ramses/backport-13435
[3.4] Backport: non mutating requests pass through quotaKVServer when NOSPACE
2022-07-22 09:27:00 +08:00
vivekpatani
e4deb09c9e etcdserver,pkg: remove temp files in snap dir when etcdserver starting
- Backporting: https://github.com/etcd-io/etcd/pull/12846
- Reference: https://github.com/etcd-io/etcd/issues/14232

Signed-off-by: vivekpatani <9080894+vivekpatani@users.noreply.github.com>
2022-07-21 15:50:27 -07:00
Chao Chen
96f69dee47 Backport: non mutating requests pass through quotaKVServer when NOSPACE
This is a backport of https://github.com/etcd-io/etcd/pull/13435 and is
part of the work for 3.4.20
https://github.com/etcd-io/etcd/issues/14232.

The original change had a second commit that modifies a changelog file.
The 3.4 branch does not include any changelog file, so that part was not
cherry-picked.

Local Testing:

- `make build`
- `make test`

Both succeed.

Signed-off-by: Ramsés Morales <ramses@gmail.com>
2022-07-21 15:06:09 -07:00
Benjamin Wang
6071b1c523 Support configuring MaxConcurrentStreams for http2
Backport https://github.com/etcd-io/etcd/pull/14219 to 3.4

Signed-off-by: Benjamin Wang <wachao@vmware.com>
2022-07-21 14:25:29 +08:00
Chao Chen
864006b72d print out applied index as uint64
Signed-off-by: Chao Chen <chaochn@amazon.com>
2022-07-20 12:07:51 -07:00
Pierre Zemb
3f9fba9112 etcdserver: add more detailed traces on linearized reading
To improve debuggability of `agreement among raft nodes before
linearized reading`, we added some tracing inside
`linearizableReadLoop`.

This will allow us to know the timing of `s.r.ReadIndex` vs
`s.applyWait.Wait(rs.Index)`.

Signed-off-by: Chao Chen <chaochn@amazon.com>
2022-07-20 12:07:51 -07:00
Benjamin Wang
07d2b1d626 support linearizable renew lease for 3.4
Cherry pick https://github.com/etcd-io/etcd/pull/13932 to 3.4.

When etcdserver receives a LeaseRenew request, it may be still in
progress of processing the LeaseGrantRequest on exact the same
leaseID. Accordingly it may return a TTL=0 to client due to the
leaseID not found error. So the leader should wait for the appliedID
to be available before processing client requests.

Signed-off-by: Benjamin Wang <wachao@vmware.com>
2022-07-19 13:34:55 +08:00
Benjamin Wang
f036529b5d Backport two lease related bug fixes to 3.4
The first bug fix is to resolve the race condition between goroutine
and channel on the same leases to be revoked. It's a classic mistake
in using Golang channel + goroutine. Please refer to
https://go.dev/doc/effective_go#channels

The second bug fix is to resolve the issue that etcd lessor may
continue to schedule checkpoint after stepping down the leader role.

Signed-off-by: Benjamin Wang <wachao@vmware.com>
2022-06-24 09:09:40 +08:00
Benjamin Wang
1abf085cfb fix all the pipeline failues for release 3.4
Items resolved:
1. fix the vet error: possible misuse of reflect.SliceHeader;
2. fix the vet error: call to (*T).Fatal from a non-test goroutine;
3. bump package golang.org/x/crypto, net and sys;
4. bump boltdb from 1.3.3 to 1.3.6;
5. remove the vendor directory;
6. remove go 1.12.17 and 1.15.15, add go 1.16.15 into pipeline;
7. bump go version to 1.16 in go.mod;
8. fix the issue: compile: version go1.16.15 does not match go tool version go1.17.11,
   refer to https://github.com/actions/setup-go/issues/107;
9. fix data race on compactMainRev and watcherGauge;
10. fix test failure for TestLeasingTxnOwnerGet in cluster_proxy mode.

Signed-off-by: Benjamin Wang <wachao@vmware.com>
2022-06-22 05:28:45 +08:00
Hitoshi Mitake
757a8e8f5b *: implement a retry logic for auth old revision in the client 2022-04-29 23:46:24 +09:00
Chao Chen
04d47a93f9 backport from #13467 exclude the same alarm type activated by multiple peers 2021-11-12 14:17:14 -08:00
Chao Chen
dbde4f2d5e Backport-3.4 exclude alarms from health check conditionally 2021-05-04 10:37:12 -07:00
Chao Chen
c4eb81af99 Backport-3.4 etcdserver/util.go: reduce memory when logging range requests 2021-04-22 15:07:44 -07:00
Lili Cosic
0b7e4184e8 etcdserver,wal: Convert int to string using rune() 2021-04-19 11:18:13 +02:00
Chris Wedgwood
656dc63eab etcdserver: fix incorrect metrics generated when clients cancel watches
Manual cherry-pick of 9571325fe85173a60c89d6ac6ce3491c7b1ec7a4 for
release-3.4.
2021-03-31 22:59:29 -07:00
Piotr Tabor
30799c97be
Merge pull request #12815 from dbavatar/release-3.4-peervalidation
etcdserver: Fix PeerURL validation
2021-03-30 12:54:32 +02:00
Sam Batschelet
9aeabe447d server: Added config parameter experimental-warning-apply-duration
Signed-off-by: Sam Batschelet <sbatsche@redhat.com>
2021-03-03 12:14:30 -05:00
Chao Chen
f27ef4d343 [Backport-3.4] etcdserver/api/etcdhttp: log successful etcd server side health check in debug level
ref. #12677
ref. 0b9cfa8677
2021-02-08 21:44:44 -08:00
galal-hussein
3019246742 etcdserver: add ConfChangeAddLearnerNode to the list of config changes
To fix a panic that happens when trying to get ids of etcd members in
force new cluster mode, the issue happen if the cluster previously had
etcd learner nodes added to the cluster

Fixes #12285
2020-09-14 17:50:57 +02:00
Jordan Liggitt
b8878eac45 etcdserver: Avoid panics logging slow v2 requests in integration tests 2020-08-19 11:30:39 -04:00
Gyuho Lee
e71e0c5c88
Merge pull request #12226 from jingyih/fix_backport_PR12216
*: add plog logging to the backport of PR12216
2020-08-18 08:48:09 -07:00
Gyuho Lee
299e0f17aa Revert "etcdserver/api/v3rpc: "MemberList" never return non-empty ClientURLs"
This reverts commit 0372cfc7ab1052ac616ca34551f83657a8fd2e3e.
2020-08-18 08:45:38 -07:00
jingyih
75d5e78d1f *: fix backport of PR12216
Fix bugs introduced in commit c60dabf
2020-08-16 15:01:18 +08:00
jingyih
c60dabf2f3 *: add experimental flag for watch notify interval
Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
2020-08-15 10:24:25 -07:00
Gyuho Lee
008074187c etcdserver: add OS level FD metrics
Similar counts are exposed via Prometheus.
This adds the one that are perceived by etcd server.

e.g.

os_fd_limit 120000
os_fd_used 14
process_cpu_seconds_total 0.31
process_max_fds 120000
process_open_fds 17

Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
2020-08-12 18:38:35 -07:00
Gyuho Lee
0372cfc7ab etcdserver/api/v3rpc: "MemberList" never return non-empty ClientURLs
Signed-off-by: Gyuho Lee <leegyuho@amazon.com>

cr https://code.amazon.com/reviews/CR-29712724
2020-07-16 16:29:51 -07:00
Yuchen Zhou
ed28c768a3 etcdserver: change protobuf field type from int to int64 (#12000) 2020-07-08 10:21:10 +03:00
cfc4n
4488595e05 auth: Customize simpleTokenTTL settings.
see https://github.com/etcd-io/etcd/issues/11978 for more detail.
2020-06-25 19:58:26 +08:00
cfc4n
ee963470f4 etcdserver:FDUsage set ticker to 10 minute from 5 seconds. This ticker will check File Descriptor Requirements ,and count all fds in used. And recorded some logs when in used >= limit/5*4. Just recorded message. If fds was more than 10K,It's low performance due to FDUsage() works. So need to increase it.
see https://github.com/etcd-io/etcd/issues/11969 for more detail.
2020-06-24 13:28:40 +08:00
Gyuho Lee
7adbfa1144
Merge pull request #12038 from spzala/automated-cherry-pick-of-#11608-upstream-release-3.4
Automated cherry pick of #11608
2020-06-21 19:19:50 -07:00
Hitoshi Mitake
963b242846 etcdserver: don't let InternalAuthenticateRequest have password 2020-06-21 19:18:18 -04:00
Sahdev P. Zala
9a24f73f7b Discovery: do not allow passing negative cluster size
When an etcd instance attempts to perform service discovery, if a
cluster size with negative value  is provided, the etcd instance
will panic without recovery because of
2020-06-21 18:00:35 -04:00
David Crawshaw
78f67988aa
etcdserver, et al: add --unsafe-no-fsync flag
This makes it possible to run an etcd node for testing and development
without placing lots of load on the file system.

Fixes #11930.

Signed-off-by: David Crawshaw <crawshaw@tailscale.com>
2020-06-04 20:19:28 -07:00
Gyuho Lee
cfe37de6c0 rafthttp: log snapshot download duration
Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
2020-05-20 11:37:01 -07:00
Gyuho Lee
a668adba78 rafthttp: improve snapshot send logging
Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
2020-05-18 11:39:24 -07:00
Gyuho Lee
9bad82fee5 *: make sure snapshot save downloads SHA256 checksum
ref. https://github.com/etcd-io/etcd/pull/11896

Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
2020-05-17 17:38:42 -07:00
Gyuho Lee
f1ea03a7c8 etcdserver/api/snap: exclude orphaned defragmentation files in snapNames
ref. https://github.com/etcd-io/etcd/pull/11900

Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
2020-05-17 14:21:02 -07:00
Ted Yu
4079deadb4 etcdserver: continue releasing snap db in case of error
Signed-off-by: Ted Yu <yuzhihong@gmail.com>
Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
2020-05-17 14:15:26 -07:00
Viacheslav Biriukov
87fc3c9e57 etcdserver,wal: fix inconsistencies in WAL and snapshot
etcdserver/*, wal/*: changes to snapshots and wal logic
etcdserver/*: changes to snapshots and wal logic to fix #10219
etcdserver/*, wal/*: add Sync method
etcdserver/*, wal/*: find valid snapshots by cross checking snap files and wal snap entries
etcdserver/*, wal/*:Add comments, clean up error messages and tests
etcdserver/*, wal/*: Remove orphaned .snap.db files during Release

Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
2020-05-15 08:40:09 -07:00
Jingyi Hu
f1eca4e1fa
Merge pull request #11752 from tangcong/automated-cherry-pick-of-#11652-#11670-#11710-origin-release-3.4
Automated cherry pick of #11652 #11670 #11710
2020-04-10 23:21:45 +08:00
tangcong
eb80716532 etcdserver: print warn log when failed to apply request 2020-04-09 09:33:40 +08:00
tangcong
347c8dac3b *: fix auth revision corruption bug 2020-04-09 09:33:36 +08:00
Changxin Miao
9c8554573f etcdserver: watch stream got closed once one request is not permitted (#11708) 2020-04-06 07:06:57 -07:00