94 Commits

Author SHA1 Message Date
Gyuho Lee
c6e3401255 etcdserver: make raft log configured by top level logger
To make it consistent

Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
2019-07-29 15:43:19 -07:00
Tobias Schottdorf
b9c051e7a7 raftpb: clean up naming in ConfChange 2019-07-23 10:40:03 +02:00
Tobias Schottdorf
eb4d9b640a etcdserver: fix createConfChangeEnts
It created a sequence of conf changes that could intermittently cause an
empty set of voters, which Raft asserts against as of #10889.

This fixes TestCtlV2BackupSnapshot and TestCtlV2BackupV3Snapshot, see:
https://github.com/etcd-io/etcd/issues/10700#issuecomment-512358126
2019-07-19 17:13:08 +02:00
Tobias Schottdorf
c9491d7861 raft: clean up bootstrap
This is the first (maybe not last) step in cleaning up the bootstrap
code around StartNode.

Initializing a Raft group for the first time is awkward, since a
configuration has to be pulled from thin air. The way this is solved
today is unclean: The app is supposed to pass peers to StartNode(),
we add configuration changes for them to the log, immediately pretend
that they are applied, but actually leave them unapplied (to give the
app a chance to observe them, though if the app did decide to not apply
them things would really go off the rails), and then return control to
the app. The app will then process the initial Readys and as a result
the configuration will be persisted to disk; restarts of the node then
use RestartNode which doesn't take any peers.

The code that did this lived awkwardly in two places fairly deep down
the callstack, though it was really only necessary in StartNode(). This
commit refactors things to make this more obvious: only StartNode does
this dance now. In particular, RawNode does not support this at all any
more; it expects the app to set up its Storage correctly.

Future work may provide helpers to make this "preseeding" of the Storage
more user-friendly. It isn't entirely straightforward to do so since
the Storage interface doesn't provide the right accessors for this
purpose. Briefly speaking, we want to make sure that a non-bootstrapped
node can never catch up via the log so that we can implicitly use one
of the "skipped" log entries to represent the configuration change into
the bootstrap configuration. This is an invasive change that affects
all consumers of raft, and it is of lower urgency since the code (post
this commit) already encapsulates the complexity sufficiently.
2019-07-19 10:02:02 +02:00
Tobias Schottdorf
f9c2d00fb3 raft: extract 'tracker' package
Mechanically extract `progressTracker`, `Progress`, and `inflights`
to their own package named `tracker`. Add lots of comments in the
progress, and take the opportunity to rename and clarify various
fields.
2019-06-21 22:15:00 +02:00
Gyuho Lee
34bd797e67 *: revert module import paths
Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
2019-05-28 15:39:35 -07:00
shivaramr
9150bf52d6 go modules: Fix module path version to include version number 2019-04-26 15:29:50 -07:00
Gyuho Lee
877f11bed8 etcdserver: improve heartbeat send failures logging
Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
2019-04-19 10:58:17 -07:00
James Shubin
368f70a37c etcdserver: Use panic instead of fatal on no space left error
When using the embed package to embed etcd, sometimes the storage prefix
being used might be full. In this case, this code path triggers, causing
an: `etcdserver: create wal error: no space left on device` error, which
causes a fatal. A fatal differs from a panic in that it also calls
os.Exit(1). In this situation, the calling program that embeds the etcd
server will be abruptly killed, which prevents it from cleaning up
safely, and giving a proper error message. Depending on what the calling
program is, this can cause corruption and data loss.

This patch switches the fatal to a panic. Ideally this would be a
regular error which would get propagated upwards to the StartEtcd
command, but in the meantime at least this can be caught with recover().

This fixes the most common fatal that I've experienced, but there are
surely more that need looking into. If possible, the errors should be
threaded down into the code path so that embedding etcd can be more
robust.

Fixes: https://github.com/etcd-io/etcd/issues/10588
2019-03-27 15:24:33 -04:00
Gyuho Lee
8d1a62e7ef *: use default log configuration for server
Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
2019-02-21 10:57:26 -08:00
Gyuho Lee
1399bc69ce etcdserver: update import paths to "go.etcd.io/etcd"
Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
2018-08-28 17:47:55 -07:00
Gyuho Lee
a1aade8c1b etcdserver: rename to "heartbeat_send_failures_total"
Signed-off-by: Gyuho Lee <gyuhox@gmail.com>
2018-05-23 13:11:08 -07:00
Gyuho Lee
896a5e4a2b etcdserver: add "etcd_server_heartbeat_failures_total"
{"level":"warn","ts":1527101858.4149103,"caller":"etcdserver/raft.go:370","msg":"failed to send out heartbeat; took too long, server is overloaded likely from slow disk","heartbeat-interval":0.1,"expected-duration":0.2,"exceeded-duration":0.025771662}
{"level":"warn","ts":1527101858.4149644,"caller":"etcdserver/raft.go:370","msg":"failed to send out heartbeat; took too long, server is overloaded likely from slow disk","heartbeat-interval":0.1,"expected-duration":0.2,"exceeded-duration":0.034015766}

Signed-off-by: Gyuho Lee <gyuhox@gmail.com>
2018-05-23 13:09:42 -07:00
Gyuho Lee
7940113906 *: move internal "etcdserver/api/rafthttp"
Signed-off-by: Gyuho Lee <gyuhox@gmail.com>
2018-05-21 10:31:16 -07:00
Gyuho Lee
9149565cb3 *: move to "etcdserver/api/membership"
Signed-off-by: Gyuho Lee <gyuhox@gmail.com>
2018-05-21 10:31:16 -07:00
Gyuho Lee
955fd99bc9
Merge pull request #9746 from gyuho/raft-logger
etcdserver: set default Raft logger with zap.Logger
2018-05-18 16:32:48 -07:00
Gyuho Lee
58ae15bd29 etcdserver: set default Raft logger with zap.Logger
Signed-off-by: Gyuho Lee <gyuhox@gmail.com>
2018-05-18 15:38:39 -07:00
Gyuho Lee
49d672ff9b etcdserver: rename "SnapshotCount", add "SnapshotCatchUpEntries"
Signed-off-by: Gyuho Lee <gyuhox@gmail.com>
2018-05-18 14:37:50 -07:00
Gyuho Lee
3ea7a5d0bd etcdserver: add "LoggerCore" field for Raft logger
Signed-off-by: Gyuho Lee <gyuhox@gmail.com>
2018-04-25 10:16:54 -07:00
Maciej Borsz
46bc966aa7 etcdserver: add is_leader prometheus metric that is 1 on the leader.
Before this change, we had now way to find a leader using /metrics
endpoint. This commit adds a metric to do that.
2018-04-19 11:47:40 +02:00
Gyuho Lee
d0847f4f25 *: clean up/fix server structured logs
Signed-off-by: Gyuho Lee <gyuhox@gmail.com>
2018-04-18 12:54:43 -07:00
Gyuho Lee
bdbed26f64 etcdserver: support structured logging
Signed-off-by: Gyuho Lee <gyuhox@gmail.com>
2018-04-16 17:36:00 -07:00
Gyuho Lee
041b9069a2 *: configure server logger
- Add/Document "logger" to support structured logging.
  - This makes functional tests run easier, since zap logger
    provides built-in log redirect to files.
  - "etcd --logger-option=zap" to enable structured logging.
- Current "capnslog" will still be used as "default".
  - We may switch the default or deprecate "capnslog" in v3.5.
  - Either way, will clearly be documented.

Signed-off-by: Gyuho Lee <gyuhox@gmail.com>
2018-04-16 17:36:00 -07:00
Gyuho Lee
4f754c1850 etcdserver: clean up with "RaftStatusGetter"
Signed-off-by: Gyuho Lee <gyuhox@gmail.com>
2018-03-15 19:30:08 -04:00
Gyuho Lee
9680b8a157 etcdserver: adjust election ticks on restart
Signed-off-by: Gyuho Lee <gyuhox@gmail.com>
2018-03-10 19:09:38 -08:00
Gyuho Lee
edec229e10 etcdserver: make "advanceTicks" method
Signed-off-by: Gyuho Lee <gyuhox@gmail.com>
2018-03-10 18:50:50 -08:00
Gyuho Lee
78918848bd etcdserver: support Raft Pre-Vote
Signed-off-by: Gyuho Lee <gyuhox@gmail.com>
2018-03-06 09:55:55 -08:00
Gyuho Lee
69357adf33 etcdserver: enable "CheckQuorum" when starting with "ForceNewCluster"
We enable "raft.Config.CheckQuorum" by default in other
Raft initial starts. So should start with "ForceNewCluster".

Signed-off-by: Gyuho Lee <gyuhox@gmail.com>
2018-02-23 00:26:42 -08:00
dvonthenen
25cdf4ed92 *: expose Raft Applied Index through to "etcdctl endpoint status"
Fixed based on feedback

Fixed spacing

Fix gofmt
2018-01-22 07:37:21 -08:00
Gyu-Ho Lee
75110dd839 *: fix naked returns
Signed-off-by: Gyu-Ho Lee <gyuhox@gmail.com>
2017-11-10 18:46:15 -08:00
Anthony Romano
dcf52bbfac etcdserver, embed, integration: don't use pointer for ServerConfig
ServerConfig is owned by etdcserver and unshared, so don't pass or store by
pointer. Also removes duplicated field 'snapCount'.
2017-06-15 13:02:13 -07:00
fanmin shi
8b7b7222dd etcdserver: renaming db happens after snapshot persists to wal and snap files
In the case that follower recieves a snapshot from leader
and crashes before renaming xxx.snap.db to db but after
snapshot has persisted to .wal and .snap, restarting
follower results loading old db, new .wal, and new .snap.
This will causes a index mismatch between snap metadata index
and consistent index from db.

This pr forces an ordering where saving/renaming db must
happen after snapshot is persisted to wal and snap file.
this guarantees wal and snap files are newer than db.
on server restart, etcd server checks if snap index > db consistent index.
if yes, etcd server attempts to load xxx.snap.db where xxx=snap index
if there is any and panic other wise.

FIXES #7628
2017-05-09 14:00:12 -07:00
Gyu-Ho Lee
327f09fcb4 etcdserver: do not block on raft stopping
Signed-off-by: Gyu-Ho Lee <gyuhox@gmail.com>
2017-04-25 13:35:43 -07:00
Gyu-Ho Lee
91f6aee4f2 etcdserver: ensure waitForApply sync with applyAll
Problem is:

`Step1`: `etcdserver/raft.go`'s `Ready` process routine sends config-change entries via `r.applyc <- ap` (https://github.com/coreos/etcd/blob/master/etcdserver/raft.go#L193-L203)

`Step2`: `etcdserver/server.go`'s `*EtcdServer.run` routine receives this via `ap := <-s.r.apply()` (https://github.com/coreos/etcd/blob/master/etcdserver/server.go#L735-L738)

`StepA`: `Step1` proceeds without sync, right after sending `r.applyc <- ap`.

`StepB`: `Step2` proceeds without sync, right after `sched.Schedule(s.applyAll(&ep,&ap))`.

`StepC`: `etcdserver` tries to sync with `s.applyAll(&ep,&ap)` by calling `rh.waitForApply()`.

`rh.waitForApply()` waits for all pending jobs to finish in `pkg/schedule`
side. However, the order of `StepA`,`StepB`,`StepC` is not guaranteed. It
is possible that `StepC` happens first, and proceeds without waiting on
apply. And the restarting member comes back as a leader in single-node
cluster, when there is no synchronization between apply-layer and
config-change Raft entry apply. Confirmed with more debugging lines below,
only reproducible with slow CPU VM (~2 vCPU).

```
~:24.005397 I | etcdserver: starting server... [version: 3.2.0+git, cluster version: to_be_decided]
~:24.011136 I | etcdserver: [DEBUG] 29b2d24047a277df waitForApply before
~:24.011194 I | etcdserver: [DEBUG] 29b2d24047a277df starts wait for 0 pending jobs
~:24.011234 I | etcdserver: [DEBUG] 29b2d24047a277df finished wait for 0 pending jobs (current pending 0)
~:24.011268 I | etcdserver: [DEBUG] 29b2d24047a277df waitForApply after
~:24.011348 I | etcdserver: [DEBUG] [0] 29b2d24047a277df is scheduling conf change on 29b2d24047a277df
~:24.011396 I | etcdserver: [DEBUG] [1] 29b2d24047a277df is scheduling conf change on 5edf80e32a334cf0
~:24.011437 I | etcdserver: [DEBUG] [2] 29b2d24047a277df is scheduling conf change on e32e31e76c8d2678
~:24.011477 I | etcdserver: [DEBUG] 29b2d24047a277df scheduled conf change on 29b2d24047a277df
~:24.011509 I | etcdserver: [DEBUG] 29b2d24047a277df scheduled conf change on 5edf80e32a334cf0
~:24.011545 I | etcdserver: [DEBUG] 29b2d24047a277df scheduled conf change on e32e31e76c8d2678
~:24.012500 I | etcdserver: [DEBUG] 29b2d24047a277df applyConfChange on 29b2d24047a277df before
~:24.013014 I | etcdserver/membership: added member 29b2d24047a277df [unix://127.0.0.1:2100515039] to cluster 9250d4ae34216949
~:24.013066 I | etcdserver: [DEBUG] 29b2d24047a277df applyConfChange on 29b2d24047a277df after
~:24.013113 I | etcdserver: [DEBUG] 29b2d24047a277df applyConfChange on 29b2d24047a277df after trigger
~:24.013158 I | etcdserver: [DEBUG] 29b2d24047a277df applyConfChange on 5edf80e32a334cf0 before
~:24.013666 W | etcdserver: failed to send out heartbeat on time (exceeded the 10ms timeout for 11.964739ms)
~:24.013709 W | etcdserver: server is likely overloaded
~:24.013750 W | etcdserver: failed to send out heartbeat on time (exceeded the 10ms timeout for 12.057265ms)
~:24.013775 W | etcdserver: server is likely overloaded
~:24.013950 I | raft: 29b2d24047a277df is starting a new election at term 4
~:24.014012 I | raft: 29b2d24047a277df became candidate at term 5
~:24.014051 I | raft: 29b2d24047a277df received MsgVoteResp from 29b2d24047a277df at term 5
~:24.014107 I | raft: 29b2d24047a277df became leader at term 5
~:24.014146 I | raft: raft.node: 29b2d24047a277df elected leader 29b2d24047a277df at term 5
```

I am printing out the number of pending jobs before we call
`sched.WaitFinish(0)`, and there was no pending jobs, so it returned
immediately (before we schedule `applyAll`).

This is the root cause to:

- https://github.com/coreos/etcd/issues/7595
- https://github.com/coreos/etcd/issues/7739
- https://github.com/coreos/etcd/issues/7802

`sched.WaitFinish(0)` doesn't work when `len(f.pendings)==0` and
`f.finished==0`. Config-change is the first job to apply, so
`f.finished` is 0 in this case.

`f.finished` monotonically increases, so we need `WaitFinish(finished+1)`.
And `finished` must be the one before calling `Schedule`. This is safe
because `Schedule(applyAll)` is the only place adding jobs to `sched`.
Then scheduler waits on the single job of `applyAll`, by getting the
current number of finished jobs before sending `Schedule`.

Or just make it be blocked until `applyAll` routine triggers on the
config-change job. This patch just removes `waitForApply`, and
signal `raftDone` to wait until `applyAll` finishes applying entries.

Confirmed that it fixes the issue, as below:

```
~:43.198354 I | rafthttp: started streaming with peer 36cda5222aba364b (stream MsgApp v2 reader)
~:43.198740 I | etcdserver: [DEBUG] 3988bc20c2b2e40c waitForApply before
~:43.198836 I | etcdserver: [DEBUG] 3988bc20c2b2e40c starts wait for 0 pending jobs, 1 finished jobs
~:43.200696 I | integration: launched 3169361310155633349 ()
~:43.201784 I | etcdserver: [DEBUG] [0] 3988bc20c2b2e40c is scheduling conf change on 36cda5222aba364b
~:43.201884 I | etcdserver: [DEBUG] [1] 3988bc20c2b2e40c is scheduling conf change on 3988bc20c2b2e40c
~:43.201965 I | etcdserver: [DEBUG] [2] 3988bc20c2b2e40c is scheduling conf change on cf5d6cbc2a121727
~:43.202070 I | etcdserver: [DEBUG] 3988bc20c2b2e40c scheduled conf change on 36cda5222aba364b
~:43.202139 I | etcdserver: [DEBUG] 3988bc20c2b2e40c scheduled conf change on 3988bc20c2b2e40c
~:43.202204 I | etcdserver: [DEBUG] 3988bc20c2b2e40c scheduled conf change on cf5d6cbc2a121727
~:43.202444 I | etcdserver: [DEBUG] 3988bc20c2b2e40c applyConfChange on 36cda5222aba364b (request ID: 0) before
~:43.204486 I | etcdserver/membership: added member 36cda5222aba364b [unix://127.0.0.1:2100913646] to cluster 425d73f1b7b01674
~:43.204588 I | etcdserver: [DEBUG] 3988bc20c2b2e40c applyConfChange on 36cda5222aba364b (request ID: 0) after
~:43.204703 I | etcdserver: [DEBUG] 3988bc20c2b2e40c applyConfChange on 36cda5222aba364b (request ID: 0) after trigger
~:43.204791 I | etcdserver: [DEBUG] 3988bc20c2b2e40c applyConfChange on 3988bc20c2b2e40c (request ID: 0) before
~:43.205689 I | etcdserver/membership: added member 3988bc20c2b2e40c [unix://127.0.0.1:2101113646] to cluster 425d73f1b7b01674
~:43.205783 I | etcdserver: [DEBUG] 3988bc20c2b2e40c applyConfChange on 3988bc20c2b2e40c (request ID: 0) after
~:43.205929 I | etcdserver: [DEBUG] 3988bc20c2b2e40c applyConfChange on 3988bc20c2b2e40c (request ID: 0) after trigger
~:43.206056 I | etcdserver: [DEBUG] 3988bc20c2b2e40c applyConfChange on cf5d6cbc2a121727 (request ID: 0) before
~:43.207353 I | etcdserver/membership: added member cf5d6cbc2a121727 [unix://127.0.0.1:2100713646] to cluster 425d73f1b7b01674
~:43.207516 I | etcdserver: [DEBUG] 3988bc20c2b2e40c applyConfChange on cf5d6cbc2a121727 (request ID: 0) after
~:43.207619 I | etcdserver: [DEBUG] 3988bc20c2b2e40c applyConfChange on cf5d6cbc2a121727 (request ID: 0) after trigger
~:43.207710 I | etcdserver: [DEBUG] 3988bc20c2b2e40c finished scheduled conf change on 36cda5222aba364b
~:43.207781 I | etcdserver: [DEBUG] 3988bc20c2b2e40c finished scheduled conf change on 3988bc20c2b2e40c
~:43.207843 I | etcdserver: [DEBUG] 3988bc20c2b2e40c finished scheduled conf change on cf5d6cbc2a121727
~:43.207951 I | etcdserver: [DEBUG] 3988bc20c2b2e40c finished wait for 0 pending jobs (current pending 0, finished 1)
~:43.208029 I | rafthttp: started HTTP pipelining with peer cf5d6cbc2a121727
~:43.210339 I | rafthttp: peer 3988bc20c2b2e40c became active
~:43.210435 I | rafthttp: established a TCP streaming connection with peer 3988bc20c2b2e40c (stream MsgApp v2 reader)
~:43.210861 I | rafthttp: started streaming with peer 3988bc20c2b2e40c (writer)
~:43.211732 I | etcdserver: [DEBUG] 3988bc20c2b2e40c waitForApply after
```

Signed-off-by: Gyu-Ho Lee <gyuhox@gmail.com>
2017-04-25 10:22:27 -07:00
Anthony Romano
714b48a4b4 etcdserver: initialize raftNode with constructor
raftNode was being initialized in start(), which was causing
hangs when trying to stop the etcd server since the stop channel
would not be initialized in time for the stop call. Instead,
setup non-configurable bits in a constructor.

Fixes #7668
2017-04-18 09:33:59 -07:00
Gyu-Ho Lee
04354f32ab etcdserver: wait apply on conf change Raft entry
When apply-layer sees configuration change entry in
raft.Ready.CommittedEntries, the server should not proceed
until that entry is applied. Otherwise, follower's raft
layer advances, possibly election-timeouts, and becomes
the leader in single-node cluster, before add-node conf
change of other nodes is applied.

Fix https://github.com/coreos/etcd/issues/7595.

Signed-off-by: Gyu-Ho Lee <gyuhox@gmail.com>
2017-04-13 15:59:24 -07:00
Xiang
7f0733cf46 etcdserver: candidate should wait for applying all configuration changes 2017-03-14 17:20:20 -07:00
Gyu-Ho Lee
3d75395875 *: remove never-unused vars, minor lint fix
Signed-off-by: Gyu-Ho Lee <gyuhox@gmail.com>
2017-03-06 14:59:12 -08:00
fanmin shi
2a1bae0c2a etcdserver: consistent naming in raftReadyHandler 2016-12-29 11:27:16 -08:00
fanmin shi
2faf72f47c etcdserver: rework update committed index logic 2016-12-27 10:11:40 -08:00
Gyu-Ho Lee
3fd1d951f8
etcdserver: time out when readStateC is blocking
Otherwise, it will block forever when the server is overloaded.

Fix https://github.com/coreos/etcd/issues/6891.
2016-12-05 15:34:46 -08:00
Gyu-Ho Lee
6ec03d3f7c etcdserver: move 'EtcdServer.send' to raft.go
Clear 'TODO'
2016-10-26 16:26:00 -07:00
Gyu-Ho Lee
e011ea25ca etcdserver: separate EtcdServer from raftNode 2016-10-07 13:18:39 -07:00
Xiang Li
0f0c048e29 etcdserver: fix early lessor promotion issue
If we promote the lessor before finish applying all
entries from the last term, we might incorrectly renew
the already revoked leases.

Here is an example:

- Term 1: revoke lease A accepted by raft
- Old leader failed, new election happened
- Term 2: promote
- Term 2: keep alive A succeed. A now has 10 seconds TTL
- Term 2: revoke lease A from Term 1 got committed and applied
- Term 2: the lease A with 10 seconds TTL is revoked

To solve this, the new leader MUST apply all entries from old term
before promote its lessor to start accept renew requests.
2016-10-05 14:41:47 -07:00
Xiang Li
e3e3993022 etcdserver: support read index
Use read index to achieve l-read.
2016-09-27 13:41:40 +08:00
Anthony Romano
de68818f03 etcdserver: add some failpoints 2016-06-21 14:43:20 -07:00
Xiang Li
9c78cda088 etcdserver: save state before save snapshot 2016-06-15 22:00:33 -07:00
Gyu-Ho Lee
32d766d749 etcdserver: preallocate slice 2016-06-15 13:03:10 -07:00
Gyu-Ho Lee
abb4cd5646 etcdserver: update LICENSE header 2016-05-12 20:49:40 -07:00
Xiang Li
9c103dd0de *: cancel required leader streams when memeber lost its leader 2016-05-12 19:42:21 -07:00