diff --git a/raft/doc.go b/raft/doc.go index 4d80727ab..253c336f2 100644 --- a/raft/doc.go +++ b/raft/doc.go @@ -13,7 +13,8 @@ // limitations under the License. /* -Package raft provides an implementation of the raft consensus algorithm. +Package raft sends and receives messages in the Protocol Buffer format +defined in the raftpb package. Raft is a protocol by which a cluster of nodes can maintain a replicated state machine. The state machine is kept in sync through the use of a replicated log. @@ -180,5 +181,101 @@ cannot be removed any more since the cluster cannot make progress. For this reason it is highly recommended to use three or more nodes in every cluster. +MessageType + +Package raft sends and receives message in Protocol Buffer format (defined +in raftpb package). Each state (follower, candidate, leader) implements its +own 'step' method ('stepFollower', 'stepCandidate', 'stepLeader') when +advancing with the given raftpb.Message. Each step is determined by its +raftpb.MessageType. Note that every step is checked by one common method +'Step' that safety-checks the terms of node and incoming message to prevent +stale log entries: + + 'MsgHup' is used for election. If a node is a follower or candidate, the + 'tick' function in 'raft' struct is set as 'tickElection'. If a follower or + candidate has not received any heartbeat before the election timeout, it + passes 'MsgHup' to its Step method and becomes (or remains) a candidate to + start a new election. + + 'MsgBeat' is an internal type that signals leaders to send a heartbeat of + the 'MsgHeartbeat' type. If a node is a leader, the 'tick' function in + the 'raft' struct is set as 'tickHeartbeat', and sends periodic heartbeat + messages of the 'MsgBeat' type to its followers. + + 'MsgProp' proposes to append data to its log entries. This is a special + type to redirect proposals to leader. Therefore, send method overwrites + raftpb.Message's term with its HardState's term to avoid attaching its + local term to 'MsgProp'. When 'MsgProp' is passed to the leader's 'Step' + method, the leader first calls the 'appendEntry' method to append entries + to its log, and then calls 'bcastAppend' method to send those entries to + its peers. When passed to candidate, 'MsgProp' is dropped. When passed to + follower, 'MsgProp' is stored in follower's mailbox(msgs) by the send + method. It is stored with sender's ID and later forwarded to leader by + rafthttp package. + + 'MsgApp' contains log entries to replicate. A leader calls bcastAppend, + which calls sendAppend, which sends soon-to-be-replicated logs in 'MsgApp' + type. When 'MsgApp' is passed to candidate's Step method, candidate reverts + back to follower, because it indicates that there is a valid leader sending + 'MsgApp' messages. Candidate and follower respond to this message in + 'MsgAppResp' type. + + 'MsgAppResp' is response to log replication request('MsgApp'). When + 'MsgApp' is passed to candidate or follower's Step method, it responds by + calling 'handleAppendEntries' method, which sends 'MsgAppResp' to raft + mailbox. + + 'MsgVote' requests votes for election. When a node is a follower or + candidate and 'MsgHup' is passed to its Step method, then the node calls + 'campaign' method to campaign itself to become a leader. Once 'campaign' + method is called, the node becomes candidate and sends 'MsgVote' to peers + in cluster to request votes. When passed to leader or candidate's Step + method and the message's Term is lower than leader's or candidate's, + 'MsgVote' will be rejected ('MsgVoteResp' is returned with Reject true). + If leader or candidate receives 'MsgVote' with higher term, it will revert + back to follower. When 'MsgVote' is passed to follower, it votes for the + sender only when sender's last term is greater than MsgVote's term or + sender's last term is equal to MsgVote's term but sender's last committed + index is greater than or equal to follower's. + + 'MsgVoteResp' contains responses from voting request. When 'MsgVoteResp' is + passed to candidate, the candidate calculates how many votes it has won. If + it's more than majority (quorum), it becomes leader and calls 'bcastAppend'. + If candidate receives majority of votes of denials, it reverts back to + follower. + + 'MsgSnap' requests to install a snapshot message. When a node has just + become a leader or the leader receives 'MsgProp' message, it calls + 'bcastAppend' method, which then calls 'sendAppend' method to each + follower. In 'sendAppend', if a leader fails to get term or entries, + the leader requests snapshot by sending 'MsgSnap' type message. + + 'MsgSnapStatus' tells the result of snapshot install message. When a + follower rejected 'MsgSnap', it indicates the snapshot request with + 'MsgSnap' had failed from network issues which causes the network layer + to fail to send out snapshots to its followers. Then leader considers + follower's progress as probe. When 'MsgSnap' were not rejected, it + indicates that the snapshot succeeded and the leader sets follower's + progress to probe and resumes its log replication. + + 'MsgHeartbeat' sends heartbeat from leader. When 'MsgHeartbeat' is passed + to candidate and message's term is higher than candidate's, the candidate + reverts back to follower and updates its committed index from the one in + this heartbeat. And it sends the message to its mailbox. When + 'MsgHeartbeat' is passed to follower's Step method and message's term is + higher than follower's, the follower updates its leaderID with the ID + from the message. + + 'MsgHeartbeatResp' is a response to 'MsgHeartbeat'. When 'MsgHeartbeatResp' + is passed to leader's Step method, the leader knows which follower + responded. And only when the leader's last committed index is greater than + follower's Match index, the leader runs 'sendAppend` method. + + 'MsgUnreachable' tells that request(message) wasn't delivered. When + 'MsgUnreachable' is passed to leader's Step method, the leader discovers + that the follower that sent this 'MsgUnreachable' is not reachable, often + indicating 'MsgApp' is lost. When follower's progress state is replicate, + the leader sets it back to probe. + */ package raft