This commit introduces the max inflight bytes setting at the Config level, and
tests that raft flow control honours it.
Signed-off-by: Pavel Kalinnikov <pavel@cockroachlabs.com>
This commit plumbs the max total byte size of the Inflights type higher up the
stack to the ProgressTracker.
Signed-off-by: Pavel Kalinnikov <pavel@cockroachlabs.com>
The Inflights type has limits on the message size and the number of inflight
messages. However, a single large entry that exceeds the size limit can still
be sent. In combination with the max messages count limit, many large messages
can be sent in a row and overflow the receiver. In effect, the "max" values act
as "target" rather than hard limits.
This commit adds an additional soft limit on the total size of inflight
messages, which catches such situations and prevents the receiver overflow.
Signed-off-by: Pavel Kalinnikov <pavel@cockroachlabs.com>
This commit makes the rarely used `raftpb.Message.Snapshot` field nullable.
In doing so, it reduces the memory size of a `raftpb.Message` message from
264 bytes to 128 bytes — a 52% reduction in size.
While this commit does not change the protobuf encoding, it does change
how that encoding is used. `(gogoproto.nullable) = false` instruct the
generated proto marshaling logic to always encode a value for the field,
even if that value is empty. `(gogoproto.nullable) = true` instructs the
generated proto marshaling logic to omit an encoded value for the field
if the field is nil.
This raises compatibility concerns in both directions. Messages encoded
by new binary versions without a `Snapshot` field will be decoded as an
empty field by old binary versions. In other words, old binary versions
can't tell the difference. However, messages encoded by old binary versions
with an empty Snapshot field will be decoded as a non-nil, empty field by
new binary versions. As a result, new binary versions need to be prepared
to handle such messages.
While Message.Snapshot is not intentionally part of the external interface
of this library, it was possible for users of the library to access it and
manipulate it. As such, this change may be considered a breaking change.
Signed-off-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Make the field name and comment clearer on the fact that it's used both in
StateProbe and StateReplicate. The old name ProbeSent was slightly confusing,
and also triggered thinking that it's used only in StateProbe.
Signed-off-by: Pavel Kalinnikov <pavel@cockroachlabs.com>
When Inflights to a particular node is full, i.e. MaxInflightMsgs for the
append messages flow is saturated, it is still necessary to continue sending
MsgApp to ensure progress. Currently this is achieved by "forgetting" the first
in-flight message in the window, which frees up quota for one new MsgApp.
This new message is constructed in such a way that it potentially has multiple
entries, or a large entry. The effect of this is that the in-flight limitations
can be exceeded arbitrarily, for as long as the flow to this node continues
being saturated. In particular, if a follower is stuck, the leader will keep
sending entries to it.
This commit makes the MsgApp empty when Inflights is saturated, and prevents
the described leakage of Entries to slow followers.
Signed-off-by: Pavel Kalinnikov <pavel@cockroachlabs.com>
Previously, Progress update on MsgApp send was scattered across raft.go and
tracker/progress.go. This commit better encapsulates this logic in the Progress
type.
Signed-off-by: Pavel Kalinnikov <pavel@cockroachlabs.com>
- avoid large indented blocks, leave the main block unindented
- declare pb.Message inlined in the sending call
Signed-off-by: Pavel Kalinnikov <pavel@cockroachlabs.com>
These are nonsensical and a network implementation is not required
to handle them correctly, so panic instead of sending them out.
Signed-off-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
This commit simplifies the logic added in 37c7e4d to auto-leave joint
configurations. It does so by making the following adjustments to the
code:
- remove the `oldApplied <= r.pendingConfIndex` condition. This does
not seem necessary. When a node first attempts to auto-leave a joint
config, it will bump `r.pendingConfIndex` when proposing. In cases
where `oldApplied >= r.pendingConfIndex`, the proposal must have
already been applied. Reviewers should double check this.
- use raft.Step instead of custom proposal code. This code was already
present in stepLeader, so there was no reason to duplicate it. This
would have avoided bugs like the one we fixed in #14538.
- use `confChangeToMsg` to generate message, to centralize the creation
of all `MsgProp{EntryConfChange}` messages.
Signed-off-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
This commit ensures that the raft leader eagerly broadcasts a MsgApp to
each follower when initiating an automatic transition out of a joint
configuration. This had been missed previously, which could lead to
delayed completion of an auto-transition.
Signed-off-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
This reverts commit d73a986e4edb15ef9dbfc994f1cbf5e96694d877, which
was added only for benchmarking purposes.
Signed-off-by: Tobias Grieger <tobias.b.grieger@gmail.com>
This makes the alternative option of implementing the leader's self-ack
of entry append the default.
Signed-off-by: Tobias Grieger <tobias.b.grieger@gmail.com>
See https://github.com/etcd-io/etcd/issues/14370.
When run in a single-voter configuration, prior to this PR
raft would emit `HardState`s that would emit a proposed `Entry`
simultaneously in `CommittedEntries` and `Entries`.
To be correct, this requires users of the raft library to enforce an
ordering between appending to the log and notifying the client about
`CommittedEntries` also present in `Entries`. This was easy to miss.
Walk back this behavior to arrive at a simpler contract: what's
emitted in `CommittedEntries` is truly committed, i.e. present
in stable storage on a quorum of voters.
This in turn pessimizes the single-voter case: rather than fully
handling an `Entry` in just one `Ready`, now two are required,
and in particular one has to do extra work to save on allocations.
We accept this as a good tradeoff, since raft primarily serves
multi-voter configurations.
Signed-off-by: Tobias Grieger <tobias.b.grieger@gmail.com>
RecentActive is now initialized to true in `becomeLeader`. Both
configuration changes and CheckQuorum make sure not to break this,
so we now now that the leader is always RecentActive.
Signed-off-by: Tobias Grieger <tobias.b.grieger@gmail.com>
Thanks to this the logs:
- are automatically printed if the test fails.
- are in pretty consistent format.
- are annotated by 'member' information of the cluster emitting them.
Side changes:
- Set propert default got DefaultWarningApplyDuration (used to be '0')
- Name the members based on their 'place' on the list (as opposed to
'random')
- Add a large detailed comment about the use and necessity of
both the follower and leader probing optimization
- fix the log message in stepLeader that previously mixed up the
log term for the rejection and the index of the append
- improve the test via subtests
- add some verbiage in findConflictByTerm around first index
* raft: check conf change before campaign
Signed-off-by: Jay Lee <BusyJayLee@gmail.com>
* raft: extract hup function
Signed-off-by: Jay Lee <BusyJayLee@gmail.com>
* raft: check pending conf change for transferleader
Signed-off-by: Jay Lee <BusyJayLee@gmail.com>
This change makes the etcd package compatible with the existing Go
ecosystem for module versioning.
Used this tool to update package imports:
https://github.com/KSubedi/gomove
Apps typically maintain the raft configuration as part of the state
machine. As a result, they want to be able to reject configuration change
entries at apply time based on the state on which the entry is supposed
to be applied. When this happens, the app should not call
ApplyConfChange, but the comments did not make this clear.
As a result, it was tempting to pass an empty pb.ConfChange or it's V2
version instead of not calling ApplyConfChange.
However, an empty V1 or V2 proto aren't noops when the configuration is
joint: an empty V1 change is treated internally as a single
configuration change for NodeID zero and will cause a panic when applied
in a joint state. An empty V2 proto is treated as a signal to leave a
joint state, which means that the app's config and raft's would diverge.
The comments updated in this commit now ask users to not call
ApplyConfState when they reject a conf change. Apps that never use joint
consensus can keep their old behavior since the distinction only matters
when in a joint state, but we don't want to encourage that.
The code doing so was undertested and buggy: it would launch multiple
attempts to transition out when the conf change was not the last element
in the log.
This commit fixes the problem and adds a regression test. It also
reworks the code to handle a former untested edge case, in which the
auto-transition append is refused. This can't happen any more with the
current version of the code because this proposal has size zero and is
special cased in increaseUncommittedSize. Last but not least, the
auto-leave proposal now also bumps pendingConfIndex, which was not done
previously due to an oversight.
Verifiy the behavior in various v1 and v2 conf change operations.
This also includes various fixups, notably it adds protection
against transitioning in and out of new configs when this is not
permissible.
There are more threads to pull, but those are left for future commits.
When the leader applied a new configuration that added voters, it would
not immediately probe these voters, delaying when they would be caught
up.
I noticed this while writing an interaction-driven test, which has now
been cleaned up and completed.