424 Commits

Author SHA1 Message Date
Pavel Kalinnikov
68af01ca6e raft: add MaxInflightBytes to Config
This commit introduces the max inflight bytes setting at the Config level, and
tests that raft flow control honours it.

Signed-off-by: Pavel Kalinnikov <pavel@cockroachlabs.com>
2022-11-13 23:05:16 +01:00
Pavel Kalinnikov
8c9c557d85 raft: factor out payloadsSize helper
Signed-off-by: Pavel Kalinnikov <pavel@cockroachlabs.com>
2022-11-13 23:05:16 +01:00
Pavel Kalinnikov
7bda0d7773 raft/tracker: add MaxInflightBytes to ProgressTracker
This commit plumbs the max total byte size of the Inflights type higher up the
stack to the ProgressTracker.

Signed-off-by: Pavel Kalinnikov <pavel@cockroachlabs.com>
2022-11-13 23:05:16 +01:00
Pavel Kalinnikov
bfb7b16f4f raft/tracker: add byte size limit to Inflights type
The Inflights type has limits on the message size and the number of inflight
messages. However, a single large entry that exceeds the size limit can still
be sent. In combination with the max messages count limit, many large messages
can be sent in a row and overflow the receiver. In effect, the "max" values act
as "target" rather than hard limits.

This commit adds an additional soft limit on the total size of inflight
messages, which catches such situations and prevents the receiver overflow.

Signed-off-by: Pavel Kalinnikov <pavel@cockroachlabs.com>
2022-11-13 23:05:16 +01:00
Nathan VanBenschoten
0f9d7a4f95 raft: make Message.Snapshot nullable, halve struct size
This commit makes the rarely used `raftpb.Message.Snapshot` field nullable.
In doing so, it reduces the memory size of a `raftpb.Message` message from
264 bytes to 128 bytes — a 52% reduction in size.

While this commit does not change the protobuf encoding, it does change
how that encoding is used. `(gogoproto.nullable) = false` instruct the
generated proto marshaling logic to always encode a value for the field,
even if that value is empty. `(gogoproto.nullable) = true` instructs the
generated proto marshaling logic to omit an encoded value for the field
if the field is nil.

This raises compatibility concerns in both directions. Messages encoded
by new binary versions without a `Snapshot` field will be decoded as an
empty field by old binary versions. In other words, old binary versions
can't tell the difference. However, messages encoded by old binary versions
with an empty Snapshot field will be decoded as a non-nil, empty field by
new binary versions. As a result, new binary versions need to be prepared
to handle such messages.

While Message.Snapshot is not intentionally part of the external interface
of this library, it was possible for users of the library to access it and
manipulate it. As such, this change may be considered a breaking change.

Signed-off-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
2022-11-09 17:35:52 +00:00
Pavel Kalinnikov
1ea13494eb raft/tracker: rename and comment MsgApp paused field
Make the field name and comment clearer on the fact that it's used both in
StateProbe and StateReplicate. The old name ProbeSent was slightly confusing,
and also triggered thinking that it's used only in StateProbe.

Signed-off-by: Pavel Kalinnikov <pavel@cockroachlabs.com>
2022-11-08 22:21:39 +00:00
Pavel Kalinnikov
4969aa81ae raft: send empty appends when replication is paused
When Inflights to a particular node is full, i.e. MaxInflightMsgs for the
append messages flow is saturated, it is still necessary to continue sending
MsgApp to ensure progress. Currently this is achieved by "forgetting" the first
in-flight message in the window, which frees up quota for one new MsgApp.

This new message is constructed in such a way that it potentially has multiple
entries, or a large entry. The effect of this is that the in-flight limitations
can be exceeded arbitrarily, for as long as the flow to this node continues
being saturated. In particular, if a follower is stuck, the leader will keep
sending entries to it.

This commit makes the MsgApp empty when Inflights is saturated, and prevents
the described leakage of Entries to slow followers.

Signed-off-by: Pavel Kalinnikov <pavel@cockroachlabs.com>
2022-11-08 22:21:39 +00:00
Pavel Kalinnikov
3bc3d2071e raft: extract Progress update on MsgApp to a method
Previously, Progress update on MsgApp send was scattered across raft.go and
tracker/progress.go. This commit better encapsulates this logic in the Progress
type.

Signed-off-by: Pavel Kalinnikov <pavel@cockroachlabs.com>
2022-11-08 22:21:38 +00:00
Pavel Kalinnikov
d5ac7b833f raft: cleanup maybeSendAppend method
- avoid large indented blocks, leave the main block unindented
- declare pb.Message inlined in the sending call

Signed-off-by: Pavel Kalinnikov <pavel@cockroachlabs.com>
2022-11-08 22:21:38 +00:00
Benjamin Wang
f64bed6033
Merge pull request #14698 from ahrtr/raft_warn_20221107
raft: change the log from debug to warning when uncommitted size exceeds threshold
2022-11-07 19:57:33 +08:00
Benjamin Wang
3e07097d77
Merge pull request #14545 from nvanbenschoten/nvanbenschoten/simplifyAutoLeave
raft: simplify auto-leave joint config on entry application logic
2022-11-07 17:20:26 +08:00
Benjamin Wang
a671e3ebd1 raft: change the log from debug to warning when uncommitted size exceeds max threshold
Signed-off-by: Benjamin Wang <wachao@vmware.com>
2022-11-07 17:17:48 +08:00
王霄霄
aac5feec94 raft: remove duplicate letter in comment.
Signed-off-by: Wang Xiaoxiao 1141195807@qq.com
Signed-off-by: 王霄霄 <1141195807@qq.com>
2022-10-22 19:13:40 +08:00
Nathan VanBenschoten
419ee8a9c6 raft: panic on self-addressed messages
These are nonsensical and a network implementation is not required
to handle them correctly, so panic instead of sending them out.

Signed-off-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
2022-10-06 20:25:07 -04:00
Nathan VanBenschoten
c50e728518 raft: simplify auto-leave joint config on entry application logic
This commit simplifies the logic added in 37c7e4d to auto-leave joint
configurations. It does so by making the following adjustments to the
code:

- remove the `oldApplied <= r.pendingConfIndex` condition. This does
  not seem necessary. When a node first attempts to auto-leave a joint
  config, it will bump `r.pendingConfIndex` when proposing. In cases
  where `oldApplied >= r.pendingConfIndex`, the proposal must have
  already been applied. Reviewers should double check this.
- use raft.Step instead of custom proposal code. This code was already
  present in stepLeader, so there was no reason to duplicate it. This
  would have avoided bugs like the one we fixed in #14538.
- use `confChangeToMsg` to generate message, to centralize the creation
  of all `MsgProp{EntryConfChange}` messages.

Signed-off-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
2022-10-03 02:11:56 -04:00
Nathan VanBenschoten
bd34388721 raft: broadcast MsgApp on auto-leave joint config proposal
This commit ensures that the raft leader eagerly broadcasts a MsgApp to
each follower when initiating an automatic transition out of a joint
configuration. This had been missed previously, which could lead to
delayed completion of an auto-transition.

Signed-off-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
2022-09-29 12:33:20 -04:00
Benjamin Wang
31d9664cb5
Merge pull request #14413 from tbg/raft-single-voter
raft: don't emit unstable CommittedEntries
2022-09-22 08:43:37 +08:00
Tobias Grieger
9ad36eecab fixup! address comments
Signed-off-by: Tobias Grieger <tobias.b.grieger@gmail.com>
2022-09-20 09:01:42 +02:00
Tobias Grieger
3c3e30a30e Revert "raft: directly update leader in advance"
This reverts commit d73a986e4edb15ef9dbfc994f1cbf5e96694d877, which
was added only for benchmarking purposes.

Signed-off-by: Tobias Grieger <tobias.b.grieger@gmail.com>
2022-09-20 09:01:42 +02:00
Tobias Grieger
67c3522893 raft: directly update leader in advance
This makes the alternative option of implementing the leader's self-ack
of entry append the default.

Signed-off-by: Tobias Grieger <tobias.b.grieger@gmail.com>
2022-09-20 09:01:42 +02:00
Tobias Grieger
169f4c3cc7 raft: don't emit unstable CommittedEntries
See https://github.com/etcd-io/etcd/issues/14370.

When run in a single-voter configuration, prior to this PR
raft would emit `HardState`s that would emit a proposed `Entry`
simultaneously in `CommittedEntries` and `Entries`.

To be correct, this requires users of the raft library to enforce an
ordering between appending to the log and notifying the client about
`CommittedEntries` also present in `Entries`. This was easy to miss.

Walk back this behavior to arrive at a simpler contract: what's
emitted in `CommittedEntries` is truly committed, i.e. present
in stable storage on a quorum of voters.

This in turn pessimizes the single-voter case: rather than fully
handling an `Entry` in just one `Ready`, now two are required,
and in particular one has to do extra work to save on allocations.

We accept this as a good tradeoff, since raft primarily serves
multi-voter configurations.

Signed-off-by: Tobias Grieger <tobias.b.grieger@gmail.com>
2022-09-20 08:59:37 +02:00
Tobias Grieger
3ad363d070 raft: always mark leader as RecentActive
RecentActive is now initialized to true in `becomeLeader`. Both
configuration changes and CheckQuorum make sure not to break this,
so we now now that the leader is always RecentActive.

Signed-off-by: Tobias Grieger <tobias.b.grieger@gmail.com>
2022-09-20 08:59:37 +02:00
demoManito
a9c3d56508 etcd: remove redundant type conversion
Signed-off-by: demoManito <1430482733@qq.com>
2022-09-20 11:26:02 +08:00
Abirdcfly
08a9d1da07
chore: remove duplicate word in comments
Signed-off-by: Abirdcfly <fp544037857@gmail.com>
2022-08-27 13:39:48 +08:00
howz97
f9c9bfa44c fix comment in raft.go 2022-04-02 14:27:33 +08:00
dengziming
a286f5bb99 MINOR: Fix typos(hearbeat -> heartbeat) 2021-08-07 11:41:13 +08:00
Lili Cosic
ddd390af01 raft/raft.go: Log unhandled errors 2021-06-02 11:41:26 +02:00
wpedrak
758ff0163c raft: postpone MsgReadIndex until first commit in the term
Fixes #12680
2021-03-23 12:28:42 +01:00
Piotr Tabor
87258efd90 Integration tests: Use zaptest.Logger based testing.TB
Thanks to this the logs:
  - are automatically printed if the test fails.
  - are in pretty consistent format.
  - are annotated by 'member' information of the cluster emitting them.

Side changes:
  - Set propert default got DefaultWarningApplyDuration (used to be '0')
  - Name the members based on their 'place' on the list (as opposed to
'random')
2021-03-09 18:19:51 +01:00
Tobias Grieger
73c50b869a
Merge pull request #12637 from BusyJay/check-outgoingvoters-when-restoring
raft: check `VotersOutgoing` for snapshot
2021-02-16 09:43:08 +01:00
Tobias Grieger
c1e8d3a63f Clarify documentation of probing
- Add a large detailed comment about the use and necessity of
  both the follower and leader probing optimization
- fix the log message in stepLeader that previously mixed up the
  log term for the rejection and the index of the append
- improve the test via subtests
- add some verbiage in findConflictByTerm around first index
2021-02-15 09:47:18 +01:00
qupeng
6828517965 raft: implement fast log rejection
Signed-off-by: qupeng <qupeng@pingcap.com>
2021-02-10 15:48:32 -05:00
Jay Lee
f947c815d0
raft: check VotersOutgoing for snapshot
Close #12631.

Signed-off-by: Jay Lee <BusyJayLee@gmail.com>
2021-01-21 16:09:37 +08:00
Piotr Tabor
5472b3336b
Merge pull request #12525 from sakateka/remove_raft.peers
raft tests: Remove Config.peers and Config.learners
2021-01-19 16:00:54 +01:00
Sergey Kacheev
ccfd00f687
raft: specify voters and learners via snapshot 2021-01-16 13:03:47 +07:00
Piotr Tabor
bf6f173d5e Document Raft.send method.
The change makes it explicit that sending messages does not happen
immidietely and is subject to proper persist & then send protocol
on the application side. See:

https://github.com/etcd-io/etcd/issues/12589#issuecomment-752867024

for more context.
2021-01-15 12:35:58 +01:00
Piotr Tabor
e62417297d *: Rename of imports of raft (as its now a module)
% find -name '*.go' -o -name '*.md' -o -name '*.sh' | xargs sed -i --follow-symlinks 's|etcd/v3/raft|etcd/raft/v3|g'
2020-10-16 13:58:18 +02:00
Jay
26b89fd418
raft: don't campaign with pending snapshot (#12163)
Signed-off-by: Jay Lee <BusyJayLee@gmail.com>
2020-07-26 00:04:46 -07:00
Jay
d0e4fe56a5
raft: check pending conf change before campaign (#12134)
* raft: check conf change before campaign

Signed-off-by: Jay Lee <BusyJayLee@gmail.com>

* raft: extract hup function

Signed-off-by: Jay Lee <BusyJayLee@gmail.com>

* raft: check pending conf change for transferleader

Signed-off-by: Jay Lee <BusyJayLee@gmail.com>
2020-07-22 17:04:48 -07:00
Jay
cc656718fa
raft: correct pendingConfIndex check for AutoLeave (#12137)
Close #12136

Signed-off-by: Jay Lee <BusyJayLee@gmail.com>
2020-07-20 16:49:22 -07:00
Zhihong Yu
7cc2f8a411
raft: break out of nested loop when id is found (#11870)
Signed-off-by: Ted Yu <yuzhihong@gmail.com>
2020-05-12 16:59:22 -07:00
Brandon Philips
96cce208c2 go.mod: use go.etcd.io/etcd/v3 versioning
This change makes the etcd package compatible with the existing Go
ecosystem for module versioning.

Used this tool to update package imports:
  https://github.com/KSubedi/gomove
2020-04-28 00:57:35 +00:00
Fullstop000
7eae024ead
raft: only redirect msg produced by own node (#11466)
Signed-off-by: Fullstop000 <fullstop1005@gmail.com>
2020-04-06 20:27:46 -07:00
qupeng
6f850a65a1
raft: cleanup read index code (#11528)
Signed-off-by: qupeng <qupeng@pingcap.com>
2020-03-03 09:20:25 -08:00
Tobias Schottdorf
0544f33248 raft: clarify ApplyConfChange contract for rejected conf changes
Apps typically maintain the raft configuration as part of the state
machine. As a result, they want to be able to reject configuration change
entries at apply time based on the state on which the entry is supposed
to be applied. When this happens, the app should not call
ApplyConfChange, but the comments did not make this clear.

As a result, it was tempting to pass an empty pb.ConfChange or it's V2
version instead of not calling ApplyConfChange.

However, an empty V1 or V2 proto aren't noops when the configuration is
joint: an empty V1 change is treated internally as a single
configuration change for NodeID zero and will cause a panic when applied
in a joint state. An empty V2 proto is treated as a signal to leave a
joint state, which means that the app's config and raft's would diverge.

The comments updated in this commit now ask users to not call
ApplyConfState when they reject a conf change. Apps that never use joint
consensus can keep their old behavior since the distinction only matters
when in a joint state, but we don't want to encourage that.
2020-02-25 12:45:45 +01:00
Tobias Schottdorf
37c7e4d1d8 raft: fix auto-transitioning out of joint config
The code doing so was undertested and buggy: it would launch multiple
attempts to transition out when the conf change was not the last element
in the log.

This commit fixes the problem and adds a regression test. It also
reworks the code to handle a former untested edge case, in which the
auto-transition append is refused. This can't happen any more with the
current version of the code because this proposal has size zero and is
special cased in increaseUncommittedSize. Last but not least, the
auto-leave proposal now also bumps pendingConfIndex, which was not done
previously due to an oversight.
2020-02-25 12:35:51 +01:00
qupeng
eaa0612e02 raft: abort leader transferring if the target is demoted (#11417)
Signed-off-by: qupeng <qupeng@pingcap.com>
2019-12-20 12:07:52 +08:00
Wine93
5f42161750 raft: fixed some typos and simplify minor logic 2019-08-25 04:46:29 +00:00
Tobias Schottdorf
306e75a96f raft: add a batch of interaction-driven conf change tests
Verifiy the behavior in various v1 and v2 conf change operations.
This also includes various fixups, notably it adds protection
against transitioning in and out of new configs when this is not
permissible.

There are more threads to pull, but those are left for future commits.
2019-08-16 09:38:44 +02:00
Tobias Schottdorf
4e19150676 raft: proactively probe newly added followers
When the leader applied a new configuration that added voters, it would
not immediately probe these voters, delaying when they would be caught
up.

I noticed this while writing an interaction-driven test, which has now
been cleaned up and completed.
2019-08-14 20:53:34 +02:00