removeNode reduces the required quorum size, so some pending entries may
be able to commit after it is applied.
Discovered in cockroachdb/cockroach#3642
We need to be able to force an election (on one node) after creating a
new group (cockroachdb/cockroach#1384), but it is difficult to ensure
that our call to Campaign does not race with an election that may be
started by raft itself. A redundant call to Campaign should be a no-op
instead of a panic. (But the panic in becomeCandidate remains, because
we don't want to update the term or change the committed index in this
case)
sendApp accesses the storage several times. Perviously, we
assume that the storage will not be modified during the read
opeartions. The assumption is not true since the storage can
be compacted between the read operations. If a compaction
causes a read entries error, we should not painc. Instead, we
can simply retry the sendApp logic until succeed.
Each progress has a inflighs sliding window. When the progress
is in replicate state, inflights will control the sending speed
of the leader.
The leader can have at most maxInflight number of inflight
messages for each replicate progress. Receving a appResp moves
forward the sliding window. Heartbeat response free one
slot if the window is full.
limit the max size of entries sent per message.
Lower the cost at probing state as we limit the size per message;
lower the penalty when aggressively decrease to a too low next.
This addresses a problem that comes up in the cockroach tests,
in which the order of messages may lead to deadlocks (due to
the fact that we don't have regular heartbeat timers in most
of our tests).
Now that heartbeats are distinct from MsgApp{,Resp}, the retries
currently performed in stepLeader's MsgAppResp section are only
performed on an actual MsgAppResp (or a new MsgProp). This means
that it may take a long time to recover from a dropped MsgAppResp
in a quiet cluster.
This commit adds a dedicated heartbeat response message. This message
does not convey the follower's current log position because the
MsgHeartbeat does not include the leaders term and index. Upon receipt
of a heartbeat response, the leader may retry the latest MsgApp if it
believes the follower to be behind.
It is reasonable for the leader to wait for the reply before sending out the next
msgApp or msgSnap for the follower in bad path. Or the leader will send out useless
messages if the previous message is rejected or the previous message is a snapshot.
Especially for the snapshot case, the leader will be 100% to send out duplicate message
including the snapshot, which is a huge waste.
This commit implement a timeout based wait mechanism. The timeout for normal msgApp is a
heartbeatTimeout and the timeout for snapshot is electionTimeout(snapshot is larger). We
can implement a piggyback mechanism(application notifies the msg lost) in the future
if necessary.
stableTo should only mark the index stable if the term is matched. After raft sends out unstable
entries to application, raft makes progress without waiting for reply. When the appliaction
calls the stableTo to notify the entries up to "index" are stable, raft might have truncated
some entries before "index" due to leader lost. raft must verify the (index,term) of stableTo,
before marking the entries as stable.