TestNodeWithSmallerTermCanCompleteElection tests the scenario where a
node that has been partitioned away (and fallen behind) rejoins the
cluster at about the same time the leader node gets partitioned away.
Previously the cluster would come to a standstill when run with PreVote
enabled.
When responding to Msg{Pre,}Vote messages we now include the term from
the message, not the local term. To see why consider the case where a
single node was previously partitioned away and it's local term is now
of date. If we include the local term (recall that for pre-votes we
don't update the local term), the (pre-)campaigning node on the other
end will proceed to ignore the message (it ignores all out of date
messages).
The term in the original message and current local term are the same in
the case of regular votes, but different for pre-votes.
NB: Had to change TestRecvMsgVote to include pb.Message.Term when
sending MsgVote messages. The new sanity checks on MsgVoteResp
(m.Term != 0) would panic with the old test as raft.Term would be equal
to 0 when responding with MsgVoteResp messages.
I found that enabling the CheckQuorum flag led to spurious leader
elections when new nodes joined. It looks like in the time between a new
node joining the cluster, and that node first communicating with the
leader, the quorum check could fail because the new node looks inactive.
To solve this, set the RecentActive flag when nodes are first added.
This gives a grace period for the node to communicate before it causes
the quorum check to fail.
Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>
After add node conf proposed twice with the same node id, the pending state is not reset because
the addNode returned without setting the pending state at the second
time and the pending state will always be true unless other conf changed. During this we
can not add any new node because the propose will be ignored since the
pending state is true.
If MsgTimeoutNow arrived after a node was removed, the node could start
and win an election, then panic in becomeLeader (see
cockroachdb/cockroach#8535)
Move all vote handling from the per-state step functions to the
top-level Step(). This wasn't necessary before because MsgVote would
cause us to become a follower, but MsgPreVote needs to be handled
without changing the node's current state.
rand.NewSource creates a 4872 byte object. With a small number of raft
groups in a process this isn't a problem. With 10k raft groups we'd use
46MB for these random sources. The only usage is in
raft.resetRandomizedElectionTimeout which isn't performance critical.
Fixes#6347.
Previously, the checkQuorum flag required an election timeout to
expire before a node could cast its first vote. This change permits
the node to cast a vote at any time when the leader is not known,
including immediately after startup.
Follower has already set its leader ID from
previous append messages from the leader, but
to be consistent, this adds a line to set its
leader id from leader snapshot message.
And 'maybeSnapshotAbort' does not 'unset'
the pendingSnapshot. 'resetState', which is called after this
metho, is the one that unsets pendingSnapshot. So this changes
the method name.