119 Commits

Author SHA1 Message Date
Blake Mizerany
d77773acb3 server: ignore server in build/tests 2014-09-03 09:20:06 -07:00
Yicheng Qin
c00594e680 server: fix timer leak 2014-07-21 14:23:52 -07:00
Yicheng Qin
f387bf8464 chore(peer_server): improve log for auto removal 2014-06-12 10:02:56 -07:00
Brandon Philips
16c2bcf951 chore(server): go fmt
blame me for not running test first.
2014-06-07 18:03:22 -07:00
Brandon Philips
1c958f8fc3 fix(server): reduce the screaming heartbeat logs
Currently the only way we know that a peer isn't getting a heartbeat is
an edge triggered event from go raft on every missed heartbeat. This
means that we need to do some book keeping in order to do exponential
backoff.

The upside is that instead of screaming thousands of log lines before a
machine hits the default removal of 30 minutes it is only ~100.
2014-06-07 17:47:10 -07:00
Yicheng Qin
fbcfe8e1c4 Merge pull request #807 from Shopify/raft-server-stats-struct-field-tag-fix
style(server): changed a LeaderInfo struct field from "startTime" to "StartTime"
2014-06-05 12:45:34 -07:00
Yicheng Qin
d7768635fd fix(server): set joinIndex when recovered 2014-05-31 10:03:39 -07:00
marc.barry
673d90728e style(server): changed a LeaderInfo struct field from "startTime" to "StartTime"
Changed the LeaderInfo struct "start time" field from "startTime" to "StartTime" so that it is an exported identifier. This required adding the `json:"startTime"` structure field tag so that the encoding/json package correctly performs JSON encoding (i.e. the correct property name --> startTime).
2014-05-21 11:19:56 -04:00
Brandon Philips
62560f9959 fix(server): add user facing remove API
This was accidently removed as we refactored the standy stuff. Re-add this
user facing remove endpoint that matches the config endpoints.
2014-05-20 20:01:10 -07:00
Xiang Li
1e7a7b11dd Merge pull request #799 from xiangli-cmu/deny_unknow_peer
hack(server): notify removed peers when they try to become candidates
2014-05-20 13:37:14 -07:00
Yicheng Qin
934c28d498 fix(peer_server): set store and registry when setting raft server
New raft server needs new store and registry.
2014-05-20 13:12:12 -07:00
Xiang Li
189fece683 hack(server): notify removed peers when they try to become candidates
A peer might be removed during a network partiton. When it comes back it
will not have received any of the log entries that would have notified
it of its removal and go onto propose a vote. This will disrupt the
cluster and the cluster should give the machine feedback that it is no
longer a member.

The term of a denied vote is MaxUint64. The notification of the removal
is a raft event. These two modification are quick heck.

In reaction to this notification the machine should shutdown. In this
case the shutdown just moves it towards becoming a standby server.
2014-05-20 10:17:32 -07:00
Yicheng Qin
c0027bfc78 feat(cluster_config): change field from int to float64
This is modified for better flexibility, especially for testing.
2014-05-12 22:42:18 -04:00
Yicheng Qin
5367c1c998 chore(standby): minor changes based on comments 2014-05-09 15:38:03 -07:00
Yicheng Qin
c6b1a738c3 feat(option): add cluster config option
It will be used when creating a brand-new cluster.
2014-05-09 15:22:11 -07:00
Yicheng Qin
6d4f018887 chore(cluster_config): rename SyncClusterInterval to SyncInterval
for better naming
2014-05-09 13:28:21 -07:00
Yicheng Qin
765cd5d8b3 refactor(find_cluster): make it simpler 2014-05-09 02:27:04 -07:00
Yicheng Qin
baadf63912 feat: implement standby mode
Change log:
1. PeerServer
- estimate initial mode from its log through removedInLog variable
- refactor FindCluster to return the estimation
- refactor Start to call FindCluster explicitly
- move raftServer start and cluster init from FindCluster to Start
- remove stopNotify from PeerServer because it is not used anymore
2. Etcd
- refactor Run logic to fit the specification
3. ClusterConfig
- rename promoteDelay to removeDelay for better naming
- add SyncClusterInterval field to ClusterConfig
- commit command to set default cluster config when cluster is created
- store cluster config info into key space for consistency
- reload cluster config when reboot
4. add StandbyServer
5. Error
- remove unused EcodePromoteError
2014-05-09 01:56:55 -07:00
Yicheng Qin
f1c13e2d9d Merge pull request #774 from unihorn/83
feat(join): check cluster conditions before join
2014-05-08 14:08:38 -07:00
Yicheng Qin
6c950eaf97 Merge pull request #772 from unihorn/81
feat(peer_server): stop service when removed
2014-05-08 14:02:09 -07:00
Yicheng Qin
5c7a963cf0 chore(peer_server): adjust code to make it more clear 2014-05-08 13:20:46 -07:00
Yicheng Qin
c92231c91a Merge branch 'master' of github.com:coreos/etcd
Conflicts:
	server/peer_server_handlers.go
2014-05-08 13:17:51 -07:00
Yicheng Qin
e960a0e03c chore(client): minor changes based on comments
The changes are made on error handling, comments and constant.
2014-05-08 13:15:10 -07:00
Yicheng Qin
0558b546ff fix(registry): fetch peers from store instead of cache
The current cache implmentation may contain removed machines, so we
fetch peers from store for correctness.
2014-05-08 08:44:32 -07:00
Yicheng Qin
5465201292 chore(peer_server): more explanation for asyncRemove 2014-05-07 16:31:17 -07:00
Yicheng Qin
c9ce14c857 chore(peer_server): set client transporter separately
It also moves the hack on timeout from raft transporter to
client transporter.
2014-05-07 13:26:05 -07:00
Yicheng Qin
bed20b7837 chore(peer_server): add more function description 2014-05-07 12:51:41 -07:00
Yicheng Qin
206881bfec fix(peer_server): check running status before start/stop
This makes peer server more robust.
2014-05-07 12:44:48 -07:00
Yicheng Qin
001b1fcd46 feat(join): check cluster conditions before join 2014-05-07 11:46:21 -07:00
Yicheng Qin
4e14604e5c refactor(server): add Client struct
This is used to send request to web API.
It will do this behavior a lot in standby mode, so I abstract this
struct first.
2014-05-07 11:46:15 -07:00
Yicheng Qin
ba36a16bc5 feat(peer_server): stop service when removed
It doesn't modify the exit logic, but makes external code know
when removal happens and be able to determine what it should do.
2014-05-07 10:00:27 -07:00
Yicheng Qin
997e7d3bf4 Merge pull request #771 from unihorn/80
refactor(peer_server): remove standby mode in peer server
2014-05-07 09:57:02 -07:00
Yicheng Qin
17e299995c refactor(peer_server): remove standby mode in peer server 2014-05-07 09:10:09 -07:00
Yicheng Qin
d78116c35b Merge pull request #675 from unihorn/56
fix(peer_server): exit all server goroutines in Stop()
2014-05-07 08:09:14 -07:00
Yicheng Qin
6516cf854c chore(server): rename daemon to startRoutine
For better understanding.
2014-05-07 07:51:44 -07:00
Yicheng Qin
e55512f60b fix(peer_server): graceful stop for peer server run
Peer server will be started and stopped repeatedly in the design.
This step ensures its stop doesn't affect the next start.
The patch includes goroutine stop and timer trigger remove.
2014-05-07 07:43:27 -07:00
Yicheng Qin
0c95e1eabb feat(peer_server): forbid rejoining with different name
Or it will confuse the cluster, especially the heartbeat between nodes.
2014-04-17 15:46:33 -07:00
Yicheng Qin
82dee82bfd chore: gofmt go files 2014-04-17 08:47:48 -07:00
Yicheng Qin
67600603c5 chore: rename proxy mode to standby mode
It makes the name more reasonable.
2014-04-17 08:04:42 -07:00
Yicheng Qin
65b872c8b5 Merge pull request #725 from dougm/server-lifecycle-fixes
fix(server): avoid race conditions in Run/Stop
2014-04-15 11:54:35 -07:00
Yicheng Qin
adf4acf947 chore: gofmt go files 2014-04-15 09:42:25 -07:00
Doug MacEachern
d73390a674 fix(server): avoid race conditions in Run/Stop
- don't close ready channel until PeerServer is listening.
  avoids possible panic in Stop() if PeerServer is nil.

- avoid data race in Run() (err variable was shared between 2 goroutines)

- avoid data race in PeerServer Start/Stop (PeerServer.closeChan)
2014-04-15 09:24:54 -07:00
Yicheng Qin
8bcfb2ecaf Merge pull request #707 from unihorn/62
fix(peer_server): recover from outage with discovery
2014-04-14 13:58:43 -07:00
Yicheng Qin
03839ca806 fix(peer_server): recover from outage with discovery
This patch also contains the refactor of find cluster process.
It is changed based on @xiangli-cmu 's commits in 627 issue.
2014-04-14 13:56:47 -07:00
Xiang Li
0b790abd46 Merge pull request #705 from unihorn/61
feat: set NOCOW for log directory when in btrfs
2014-04-14 16:40:38 -04:00
Xiang Li
4fd9e627c0 fix(peer join) fix wrong join command redirection
1. We use PUT request to do a V2 join. So we should redirect a PUT request rather than a POST.
2. /admin only accept V2Join request. Send out V2Join instead of V1Join.
2014-04-13 21:33:02 -04:00
Yicheng Qin
56ef6fbcae make necessary changes 2014-04-11 17:00:14 -07:00
Yicheng Qin
79a89dcb82 Revert "Revert "fix(server): only set NOCOW for log file""
This reverts commit 9540575690db92a362f66d9aa4f2671265b87eb1.

Conflicts:
	etcd/etcd.go
2014-04-11 16:33:50 -07:00
Yicheng Qin
9540575690 Revert "fix(server): only set NOCOW for log file"
This reverts commit 1eff547af61c2453b106f65691f928ddf8088a6b.
2014-04-09 14:39:16 -07:00
Yicheng Qin
1eff547af6 fix(server): only set NOCOW for log file 2014-04-09 12:35:32 -07:00