Each progress has a inflighs sliding window. When the progress
is in replicate state, inflights will control the sending speed
of the leader.
The leader can have at most maxInflight number of inflight
messages for each replicate progress. Receving a appResp moves
forward the sliding window. Heartbeat response free one
slot if the window is full.
limit the max size of entries sent per message.
Lower the cost at probing state as we limit the size per message;
lower the penalty when aggressively decrease to a too low next.
Follower should not reject the append message with a smaller index than its commit
index. Or it will trigger the leader's resending logic, which might have a high cost.
raft relies on the link layer to report the status of the sent snapshot.
If the snapshot is still sending, the replication to that remote peer will
be paused. If the snapshot finish sending, the replication will begin
optimistically after electionTimeout. If the snapshot fails, raft will
try to resend it.
This addresses a problem that comes up in the cockroach tests,
in which the order of messages may lead to deadlocks (due to
the fact that we don't have regular heartbeat timers in most
of our tests).
Now that heartbeats are distinct from MsgApp{,Resp}, the retries
currently performed in stepLeader's MsgAppResp section are only
performed on an actual MsgAppResp (or a new MsgProp). This means
that it may take a long time to recover from a dropped MsgAppResp
in a quiet cluster.
This commit adds a dedicated heartbeat response message. This message
does not convey the follower's current log position because the
MsgHeartbeat does not include the leaders term and index. Upon receipt
of a heartbeat response, the leader may retry the latest MsgApp if it
believes the follower to be behind.
In code outside the raft package, we cannot call raft.bcastHeartbeat
directly. Instead, to control heartbeats we set heartbeatInterval to 1
and call Tick().
It is reasonable for the leader to wait for the reply before sending out the next
msgApp or msgSnap for the follower in bad path. Or the leader will send out useless
messages if the previous message is rejected or the previous message is a snapshot.
Especially for the snapshot case, the leader will be 100% to send out duplicate message
including the snapshot, which is a huge waste.
This commit implement a timeout based wait mechanism. The timeout for normal msgApp is a
heartbeatTimeout and the timeout for snapshot is electionTimeout(snapshot is larger). We
can implement a piggyback mechanism(application notifies the msg lost) in the future
if necessary.
This panic can never be reached when using raft.Node, because we only
read from propc when there is a leader. However, it is possible to see
this error when using raft the raft object directly (as in MultiNode),
and in this case it is better to simply drop the proposal (as if we had
sent it to a leader that immediately vanished).
Add an error return to MemoryStorage.Append for consistency.