The suggested pattern for Raft proposals is that they be retried
periodically until they succeed. This turns out to be an issue
when a leader cannot commit entries because the leader will continue
to append re-proposed entries to its log without committing anything.
This can result in the uncommitted tail of a leader's log growing
without bound until it is able to commit entries.
This change add a safeguard to protect against this case where a
leader's log can grow without bound during loss of quorum scenarios.
It does so by introducing a new, optional ``MaxUncommittedEntriesSize
configuration. This config limits the max aggregate size of uncommitted
entries that may be appended to a leader's log. Once this limit
is exceeded, proposals will begin to return ErrProposalDropped
errors.
See cockroachdb/cockroach#27772
The previous logic was erroneously setting Ready.MustSync to true when
the hard state had not changed because we were comparing an empty hard
state to the previous hard state. In combination with another misfeature
in CockroachDB (unnecessary writing of empty batches), this was causing
a steady stream of synchronous writes to disk.
In #9982, a mechanism to limit the size of `CommittedEntries` was
introduced. The way this mechanism worked was that it would load
applicable entries (passing the max size hint) and would emit a
`HardState` whose commit index was truncated to match the limitation
applied to the entries. Unfortunately, this was subtly incorrect
when the user-provided `Entries` implementation didn't exactly
match what Raft uses internally. Depending on whether a `Node` or
a `RawNode` was used, this would either lead to regressing the
HardState's commit index or outright forgetting to apply entries,
respectively.
Asking implementers to precisely match the Raft size limitation
semantics was considered but looks like a bad idea as it puts
correctness squarely in the hands of downstream users. Instead, this
PR removes the truncation of `HardState` when limiting is active
and tracks the applied index separately. This removes the old
paradigm (that the previous code tried to work around) that the
client will always apply all the way to the commit index, which
isn't true when commit entries are paginated.
See [1] for more on the discovery of this bug (CockroachDB's
implementation of `Entries` returns one more entry than Raft's when the
size limit hits).
[1]: https://github.com/cockroachdb/cockroach/issues/28918#issuecomment-418174448
The MaxSizePerMsg setting is now used to limit the size of
Ready.CommittedEntries. This prevents out-of-memory errors if the raft
log has become very large and commits all at once.
Scanning the uncommitted portion of the raft log to determine whether
there are any pending config changes can be expensive. In
cockroachdb/cockroach#18601, we've seen that a new leader can spend so
much time scanning its log post-election that it fails to send
its first heartbeats in time to prevent a second election from
starting immediately.
Instead of tracking whether a pending config change exists with a
boolean, this commit tracks the latest log index at which a pending
config change *could* exist. This is a less expensive solution to
the problem, and the impact of false positives should be minimal since
a newly-elected leader should be able to quickly commit the tail of
its log.
If the node is stopped, then Status can hang forever because there is no
event loop to answer. So, just return empty status to avoid deadlocks.
Fix#6855
Signed-off-by: Alexander Morozov <lk4d4math@gmail.com>
The relevant structures are properly aligned, however, there is no comment
highlighting the need to keep it aligned as is present elsewhere in the
codebase.
Adding note to keep alignment, in line with similar comments in the codebase.
Those three log statements in node.go have not been using the logger that was passed via `raft.Config`, but instead the default raft logger. This changes it to use the proper logger.
raft node should set initial prev hard state to empty.
Or it will not send the first hard coded state to application
until the state changes again.
This commit fixs the issue. It introduce a small overhead, that
the same tate might send to application twice when restarting.
But this is fine.
raft relies on the link layer to report the status of the sent snapshot.
If the snapshot is still sending, the replication to that remote peer will
be paused. If the snapshot finish sending, the replication will begin
optimistically after electionTimeout. If the snapshot fails, raft will
try to resend it.