mirror of
https://github.com/etcd-io/etcd.git
synced 2024-09-27 06:25:44 +00:00
commit
c3f32504ec
123
raft/doc.go
123
raft/doc.go
@ -15,6 +15,8 @@
|
|||||||
/*
|
/*
|
||||||
Package raft provides an implementation of the raft consensus algorithm.
|
Package raft provides an implementation of the raft consensus algorithm.
|
||||||
|
|
||||||
|
Usage
|
||||||
|
|
||||||
The primary object in raft is a Node. You either start a Node from scratch
|
The primary object in raft is a Node. You either start a Node from scratch
|
||||||
using raft.StartNode or start a Node from some initial state using raft.RestartNode.
|
using raft.StartNode or start a Node from some initial state using raft.RestartNode.
|
||||||
storage := raft.NewMemoryStorage()
|
storage := raft.NewMemoryStorage()
|
||||||
@ -22,42 +24,71 @@ using raft.StartNode or start a Node from some initial state using raft.RestartN
|
|||||||
|
|
||||||
Now that you are holding onto a Node you have a few responsibilities:
|
Now that you are holding onto a Node you have a few responsibilities:
|
||||||
|
|
||||||
First, you need to push messages that you receive from other machines into the
|
First, you must read from the Node.Ready() channel and process the updates
|
||||||
Node with n.Step().
|
it contains. These steps may be performed in parallel, except as noted in step
|
||||||
|
2.
|
||||||
|
|
||||||
|
1. Write HardState, Entries, and Snapshot to persistent storage if they are
|
||||||
|
not empty. Note that when writing an Entry with Index i, any
|
||||||
|
previously-persisted entries with Index >= i must be discarded.
|
||||||
|
|
||||||
|
2. Send all Messages to the nodes named in the To field. It is important that
|
||||||
|
no messages be sent until after the latest HardState has been persisted to disk,
|
||||||
|
and all Entries written by any previous Ready batch (Messages may be sent while
|
||||||
|
entries from the same batch are being persisted).
|
||||||
|
|
||||||
|
3. Apply Snapshot (if any) and CommittedEntries to the state machine.
|
||||||
|
If any committed Entry has Type EntryConfChange, call Node.ApplyConfChange()
|
||||||
|
to apply it to the node. The configuration change may be cancelled at this point
|
||||||
|
by setting the NodeID field to zero before calling ApplyConfChange
|
||||||
|
(but ApplyConfChange must be called one way or the other, and the decision to cancel
|
||||||
|
must be based solely on the state machine and not external information such as
|
||||||
|
the observed health of the node).
|
||||||
|
|
||||||
|
4. Call Node.Advance() to signal readiness for the next batch of updates.
|
||||||
|
This may be done at any time after step 1, although all updates must be processed
|
||||||
|
in the order they were returned by Ready.
|
||||||
|
|
||||||
|
Second, all persisted log entries must be made available via an
|
||||||
|
implementation of the Storage interface. The provided MemoryStorage
|
||||||
|
type can be used for this (if you repopulate its state upon a
|
||||||
|
restart), or you can supply your own disk-backed implementation.
|
||||||
|
|
||||||
|
Third, when you receive a message from another node, pass it to Node.Step:
|
||||||
|
|
||||||
func recvRaftRPC(ctx context.Context, m raftpb.Message) {
|
func recvRaftRPC(ctx context.Context, m raftpb.Message) {
|
||||||
n.Step(ctx, m)
|
n.Step(ctx, m)
|
||||||
}
|
}
|
||||||
|
|
||||||
Second, you need to save log entries to storage, process committed log entries
|
Finally, you need to call Node.Tick() at regular intervals (probably
|
||||||
through your application and then send pending messages to peers by reading the
|
via a time.Ticker). Raft has two important timeouts: heartbeat and the
|
||||||
channel returned by n.Ready(). It is important that the user persist any
|
election timeout. However, internally to the raft package time is
|
||||||
entries that require stable storage before sending messages to other peers to
|
represented by an abstract "tick".
|
||||||
ensure fault-tolerance.
|
|
||||||
|
|
||||||
An example MemoryStorage is provided in the raft package.
|
|
||||||
|
|
||||||
And finally you need to service timeouts with Tick(). Raft has two important
|
|
||||||
timeouts: heartbeat and the election timeout. However, internally to the raft
|
|
||||||
package time is represented by an abstract "tick". The user is responsible for
|
|
||||||
calling Tick() on their raft.Node on a regular interval in order to service
|
|
||||||
these timeouts.
|
|
||||||
|
|
||||||
The total state machine handling loop will look something like this:
|
The total state machine handling loop will look something like this:
|
||||||
|
|
||||||
for {
|
for {
|
||||||
select {
|
select {
|
||||||
case <-s.Ticker:
|
case <-s.Ticker:
|
||||||
n.Tick()
|
n.Tick()
|
||||||
case rd := <-s.Node.Ready():
|
case rd := <-s.Node.Ready():
|
||||||
saveToStorage(rd.State, rd.Entries)
|
saveToStorage(rd.State, rd.Entries, rd.Snapshot)
|
||||||
send(rd.Messages)
|
send(rd.Messages)
|
||||||
process(rd.CommittedEntries)
|
if !raft.IsEmptySnap(rd.Snapshot) {
|
||||||
s.Node.Advance()
|
processSnapshot(rd.Snapshot)
|
||||||
case <-s.done:
|
}
|
||||||
return
|
for entry := range rd.CommittedEntries {
|
||||||
}
|
process(entry)
|
||||||
}
|
if entry.Type == raftpb.EntryConfChange:
|
||||||
|
var cc raftpb.ConfChange
|
||||||
|
cc.Unmarshal(entry.Data)
|
||||||
|
s.Node.ApplyConfChange(cc)
|
||||||
|
}
|
||||||
|
s.Node.Advance()
|
||||||
|
case <-s.done:
|
||||||
|
return
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
To propose changes to the state machine from your node take your application
|
To propose changes to the state machine from your node take your application
|
||||||
data, serialize it into a byte slice and call:
|
data, serialize it into a byte slice and call:
|
||||||
@ -65,21 +96,49 @@ data, serialize it into a byte slice and call:
|
|||||||
n.Propose(ctx, data)
|
n.Propose(ctx, data)
|
||||||
|
|
||||||
If the proposal is committed, data will appear in committed entries with type
|
If the proposal is committed, data will appear in committed entries with type
|
||||||
raftpb.EntryNormal.
|
raftpb.EntryNormal. There is no guarantee that a proposed command will be
|
||||||
|
committed; you may have to re-propose after a timeout.
|
||||||
|
|
||||||
To add or remove node in a cluster, build ConfChange struct 'cc' and call:
|
To add or remove node in a cluster, build ConfChange struct 'cc' and call:
|
||||||
|
|
||||||
n.ProposeConfChange(ctx, cc)
|
n.ProposeConfChange(ctx, cc)
|
||||||
|
|
||||||
After config change is committed, some committed entry with type
|
After config change is committed, some committed entry with type
|
||||||
raftpb.EntryConfChange will be returned. You should apply it to node through:
|
raftpb.EntryConfChange will be returned. You must apply it to node through:
|
||||||
|
|
||||||
var cc raftpb.ConfChange
|
var cc raftpb.ConfChange
|
||||||
cc.Unmarshal(data)
|
cc.Unmarshal(data)
|
||||||
n.ApplyConfChange(cc)
|
n.ApplyConfChange(cc)
|
||||||
|
|
||||||
Note: An ID represents a unique node in a cluster. A given ID MUST be used
|
Note: An ID represents a unique node in a cluster for all time. A
|
||||||
only once even if the old node has been removed.
|
given ID MUST be used only once even if the old node has been removed.
|
||||||
|
This means that for example IP addresses make poor node IDs since they
|
||||||
|
may be reused. Node IDs must be non-zero.
|
||||||
|
|
||||||
|
Implementation notes
|
||||||
|
|
||||||
|
This implementation is up to date with the final Raft thesis
|
||||||
|
(https://ramcloud.stanford.edu/~ongaro/thesis.pdf), although our
|
||||||
|
implementation of the membership change protocol differs somewhat from
|
||||||
|
that described in chapter 4. The key invariant that membership changes
|
||||||
|
happen one node at a time is preserved, but in our implementation the
|
||||||
|
membership change takes effect when its entry is applied, not when it
|
||||||
|
is added to the log (so the entry is committed under the old
|
||||||
|
membership instead of the new). This is equivalent in terms of safety,
|
||||||
|
since the old and new configurations are guaranteed to overlap.
|
||||||
|
|
||||||
|
To ensure that we do not attempt to commit two membership changes at
|
||||||
|
once by matching log positions (which would be unsafe since they
|
||||||
|
should have different quorum requirements), we simply disallow any
|
||||||
|
proposed membership change while any uncommitted change appears in
|
||||||
|
the leader's log.
|
||||||
|
|
||||||
|
This approach introduces a problem when you try to remove a member
|
||||||
|
from a two-member cluster: If one of the members dies before the
|
||||||
|
other one receives the commit of the confchange entry, then the member
|
||||||
|
cannot be removed any more since the cluster cannot make progress.
|
||||||
|
For this reason it is highly recommened to use three or more nodes in
|
||||||
|
every cluster.
|
||||||
|
|
||||||
*/
|
*/
|
||||||
package raft
|
package raft
|
||||||
|
Loading…
x
Reference in New Issue
Block a user