Currently the only way we know that a peer isn't getting a heartbeat is
an edge triggered event from go raft on every missed heartbeat. This
means that we need to do some book keeping in order to do exponential
backoff.
The upside is that instead of screaming thousands of log lines before a
machine hits the default removal of 30 minutes it is only ~100.
Changed the LeaderInfo struct "start time" field from "startTime" to "StartTime" so that it is an exported identifier. This required adding the `json:"startTime"` structure field tag so that the encoding/json package correctly performs JSON encoding (i.e. the correct property name --> startTime).
A peer might be removed during a network partiton. When it comes back it
will not have received any of the log entries that would have notified
it of its removal and go onto propose a vote. This will disrupt the
cluster and the cluster should give the machine feedback that it is no
longer a member.
The term of a denied vote is MaxUint64. The notification of the removal
is a raft event. These two modification are quick heck.
In reaction to this notification the machine should shutdown. In this
case the shutdown just moves it towards becoming a standby server.
Change log:
1. PeerServer
- estimate initial mode from its log through removedInLog variable
- refactor FindCluster to return the estimation
- refactor Start to call FindCluster explicitly
- move raftServer start and cluster init from FindCluster to Start
- remove stopNotify from PeerServer because it is not used anymore
2. Etcd
- refactor Run logic to fit the specification
3. ClusterConfig
- rename promoteDelay to removeDelay for better naming
- add SyncClusterInterval field to ClusterConfig
- commit command to set default cluster config when cluster is created
- store cluster config info into key space for consistency
- reload cluster config when reboot
4. add StandbyServer
5. Error
- remove unused EcodePromoteError
Peer server will be started and stopped repeatedly in the design.
This step ensures its stop doesn't affect the next start.
The patch includes goroutine stop and timer trigger remove.
- don't close ready channel until PeerServer is listening.
avoids possible panic in Stop() if PeerServer is nil.
- avoid data race in Run() (err variable was shared between 2 goroutines)
- avoid data race in PeerServer Start/Stop (PeerServer.closeChan)
1. We use PUT request to do a V2 join. So we should redirect a PUT request rather than a POST.
2. /admin only accept V2Join request. Send out V2Join instead of V1Join.