This is tested directly at the level of `RawNode` in
`TestRawNodeReadIndex`. `*node` is a thin wrapper around `RawNode` so
this is sufficient.
The reason to remove the test is that it now incurs data races
since it's not possible to adjust the `readStates` and `step`
fields while the node is running, and there is no primitive
to synchronize with its goroutine. This could all be fixed
but it's not worth it.
Signed-off-by: Tobias Grieger <tobias.b.grieger@gmail.com>
Switched this to baking the conf changes into the initial state
to have fewer cycles to walk through in the test.
Signed-off-by: Tobias Grieger <tobias.b.grieger@gmail.com>
This needed to apply entries from CommittedEntries, not Entries.
Previously the test got away with it because the two slices were
equal. Now it was hanging because when it proposed the second
conf change the first one hadn't applied yet, and so it got dropped,
and the test would hang.
Signed-off-by: Tobias Grieger <tobias.b.grieger@gmail.com>
`StartNode` runs a naked goroutine, so it's impossible to test against
it in a way that will reliably produce contained test failures when
assertions are hit on the `(*node).run` goroutine.
This commit introduces a harness that we can use in tests to wrap
this goroutine and allow it to defer errors to `*testing.T`.
Note that tests of `Node` still need to be architected carefully
since it's easy to produce a deadlock in them should things not
go exactly as planned.
Signed-off-by: Tobias Grieger <tobias.b.grieger@gmail.com>
raft: fix goroutine leaks in TestCommitPagination
The goroutine created with n.run() will leak if we forget to call n.Stop().
We can replay the goroutine leaks by using [goleak](https://github.com/uber-go/goleak):
```
$ cd raft && env go test -short -v -timeout=3m --race -run=TestCommitPagination.
... ...
raft2021/12/27 20:47:15 INFO: raft.node: 1 elected leader 1 at term 1
leaks.go:78: found unexpected goroutines:
[Goroutine 20 in state select, with go.etcd.io/etcd/raft/v3.(*node).run on top of the stack:
goroutine 20 [select]:
go.etcd.io/etcd/raft/v3.(*node).run(0xc00036f260)
/home/yuanting/work/dev/goprojects/etcd/raft/node.go:344 +0xc1d
created by go.etcd.io/etcd/raft/v3.TestCommitPagination
/home/yuanting/work/dev/goprojects/etcd/raft/node_test.go:920 +0x554
]
--- FAIL: TestCommitPagination (0.45s)
FAIL
FAIL go.etcd.io/etcd/raft/v3 0.508s
FAIL
```
This change makes the etcd package compatible with the existing Go
ecosystem for module versioning.
Used this tool to update package imports:
https://github.com/KSubedi/gomove
It has a data race between the test's call to `reduceUncommittedSize`
and a corresponding call during Ready handling in `(*node).run()`.
The corresponding RawNode test still verifies the functionality, so
instead of fixing the test we can remove it.
This is the first (maybe not last) step in cleaning up the bootstrap
code around StartNode.
Initializing a Raft group for the first time is awkward, since a
configuration has to be pulled from thin air. The way this is solved
today is unclean: The app is supposed to pass peers to StartNode(),
we add configuration changes for them to the log, immediately pretend
that they are applied, but actually leave them unapplied (to give the
app a chance to observe them, though if the app did decide to not apply
them things would really go off the rails), and then return control to
the app. The app will then process the initial Readys and as a result
the configuration will be persisted to disk; restarts of the node then
use RestartNode which doesn't take any peers.
The code that did this lived awkwardly in two places fairly deep down
the callstack, though it was really only necessary in StartNode(). This
commit refactors things to make this more obvious: only StartNode does
this dance now. In particular, RawNode does not support this at all any
more; it expects the app to set up its Storage correctly.
Future work may provide helpers to make this "preseeding" of the Storage
more user-friendly. It isn't entirely straightforward to do so since
the Storage interface doesn't provide the right accessors for this
purpose. Briefly speaking, we want to make sure that a non-bootstrapped
node can never catch up via the log so that we can implicitly use one
of the "skipped" log entries to represent the configuration change into
the bootstrap configuration. This is an invasive change that affects
all consumers of raft, and it is of lower urgency since the code (post
this commit) already encapsulates the complexity sufficiently.
It has always bugged me that any new feature essentially needed to be
tested twice due to the two ways in which apps can use raft (`*node` and
`*RawNode`). Due to upcoming testing work for joint consensus, now is a
good time to rectify this somewhat.
This commit removes most logic from `(*node).run` and uses `*RawNode`
internally. This simplifies the logic and also lead (via debugging) to
some insight on how the semantics of the approaches differ, which is now
documented in the comments.
Prior to this change, MaxSizePerMsg was used both to cap the total byte size of
entries in messages as well as the total byte size of entries passed through
CommittedEntries in the Ready struct. This change adds a new Config parameter
MaxCommittedSizePerReady which defaults to MaxSizePerMsg and contols the second
of above descibed settings.
The previous code was using the proto-generated `Size()` method to
track the size of an incoming proposal at the leader. This includes
the Index and Term, which were mutated after the call to `Size()`
when appending to the log. Additionally, it was not taking into
account that an ignored configuration change would ignore the
original proposal and append an empty entry instead.
As a result, a fully committed Raft group could end up with a non-
zero tracked uncommitted Raft log counter that would eventually hit
the ceiling and drop all future proposals indiscriminately. It would
also immediately imply that proposals exceeding the threshold alone
would get refused (as the "first uncommitted proposal" gets special
treatment and is always allowed in).
Track only the size of the payload actually appended to the Raft log
instead.
For context, see:
https://github.com/cockroachdb/cockroach/issues/31618#issuecomment-431374938
The suggested pattern for Raft proposals is that they be retried
periodically until they succeed. This turns out to be an issue
when a leader cannot commit entries because the leader will continue
to append re-proposed entries to its log without committing anything.
This can result in the uncommitted tail of a leader's log growing
without bound until it is able to commit entries.
This change add a safeguard to protect against this case where a
leader's log can grow without bound during loss of quorum scenarios.
It does so by introducing a new, optional ``MaxUncommittedEntriesSize
configuration. This config limits the max aggregate size of uncommitted
entries that may be appended to a leader's log. Once this limit
is exceeded, proposals will begin to return ErrProposalDropped
errors.
See cockroachdb/cockroach#27772
In #9982, a mechanism to limit the size of `CommittedEntries` was
introduced. The way this mechanism worked was that it would load
applicable entries (passing the max size hint) and would emit a
`HardState` whose commit index was truncated to match the limitation
applied to the entries. Unfortunately, this was subtly incorrect
when the user-provided `Entries` implementation didn't exactly
match what Raft uses internally. Depending on whether a `Node` or
a `RawNode` was used, this would either lead to regressing the
HardState's commit index or outright forgetting to apply entries,
respectively.
Asking implementers to precisely match the Raft size limitation
semantics was considered but looks like a bad idea as it puts
correctness squarely in the hands of downstream users. Instead, this
PR removes the truncation of `HardState` when limiting is active
and tracks the applied index separately. This removes the old
paradigm (that the previous code tried to work around) that the
client will always apply all the way to the commit index, which
isn't true when commit entries are paginated.
See [1] for more on the discovery of this bug (CockroachDB's
implementation of `Entries` returns one more entry than Raft's when the
size limit hits).
[1]: https://github.com/cockroachdb/cockroach/issues/28918#issuecomment-418174448
The MaxSizePerMsg setting is now used to limit the size of
Ready.CommittedEntries. This prevents out-of-memory errors if the raft
log has become very large and commits all at once.
Scanning the uncommitted portion of the raft log to determine whether
there are any pending config changes can be expensive. In
cockroachdb/cockroach#18601, we've seen that a new leader can spend so
much time scanning its log post-election that it fails to send
its first heartbeats in time to prevent a second election from
starting immediately.
Instead of tracking whether a pending config change exists with a
boolean, this commit tracks the latest log index at which a pending
config change *could* exist. This is a less expensive solution to
the problem, and the impact of false positives should be minimal since
a newly-elected leader should be able to quickly commit the tail of
its log.
TestNodeTick relies on a unreliable func `waitForSchedule` when running
with GOMAXPROCS > 1. This commit changes the test to make sure we stop
the node afte it drains the tick chan. The test should be reliable now.
Getting gosimple suggestion while running test script, so this PR is for fixing gosimple S1019 check.
raft/node_test.go:456:40: should use make([]raftpb.Entry, 1) instead (S1019)
raft/node_test.go:457:49: should use make([]raftpb.Entry, 1) instead (S1019)
raft/node_test.go:458:43: should use make([]raftpb.Message, 1) instead (S1019)
Refer https://github.com/dominikh/go-tools/blob/master/cmd/gosimple/README.md#checks for more information.
n.Tick() is async. It can be racy when running with n.Stop().
n.Status() is sync and has a feedback mechnism internally. So there wont be
any race between n.Status() and n.Stop() call.