This is the first (maybe not last) step in cleaning up the bootstrap
code around StartNode.
Initializing a Raft group for the first time is awkward, since a
configuration has to be pulled from thin air. The way this is solved
today is unclean: The app is supposed to pass peers to StartNode(),
we add configuration changes for them to the log, immediately pretend
that they are applied, but actually leave them unapplied (to give the
app a chance to observe them, though if the app did decide to not apply
them things would really go off the rails), and then return control to
the app. The app will then process the initial Readys and as a result
the configuration will be persisted to disk; restarts of the node then
use RestartNode which doesn't take any peers.
The code that did this lived awkwardly in two places fairly deep down
the callstack, though it was really only necessary in StartNode(). This
commit refactors things to make this more obvious: only StartNode does
this dance now. In particular, RawNode does not support this at all any
more; it expects the app to set up its Storage correctly.
Future work may provide helpers to make this "preseeding" of the Storage
more user-friendly. It isn't entirely straightforward to do so since
the Storage interface doesn't provide the right accessors for this
purpose. Briefly speaking, we want to make sure that a non-bootstrapped
node can never catch up via the log so that we can implicitly use one
of the "skipped" log entries to represent the configuration change into
the bootstrap configuration. This is an invasive change that affects
all consumers of raft, and it is of lower urgency since the code (post
this commit) already encapsulates the complexity sufficiently.
Mechanically extract `progressTracker`, `Progress`, and `inflights`
to their own package named `tracker`. Add lots of comments in the
progress, and take the opportunity to rename and clarify various
fields.
When using the embed package to embed etcd, sometimes the storage prefix
being used might be full. In this case, this code path triggers, causing
an: `etcdserver: create wal error: no space left on device` error, which
causes a fatal. A fatal differs from a panic in that it also calls
os.Exit(1). In this situation, the calling program that embeds the etcd
server will be abruptly killed, which prevents it from cleaning up
safely, and giving a proper error message. Depending on what the calling
program is, this can cause corruption and data loss.
This patch switches the fatal to a panic. Ideally this would be a
regular error which would get propagated upwards to the StartEtcd
command, but in the meantime at least this can be caught with recover().
This fixes the most common fatal that I've experienced, but there are
surely more that need looking into. If possible, the errors should be
threaded down into the code path so that embedding etcd can be more
robust.
Fixes: https://github.com/etcd-io/etcd/issues/10588
{"level":"warn","ts":1527101858.4149103,"caller":"etcdserver/raft.go:370","msg":"failed to send out heartbeat; took too long, server is overloaded likely from slow disk","heartbeat-interval":0.1,"expected-duration":0.2,"exceeded-duration":0.025771662}
{"level":"warn","ts":1527101858.4149644,"caller":"etcdserver/raft.go:370","msg":"failed to send out heartbeat; took too long, server is overloaded likely from slow disk","heartbeat-interval":0.1,"expected-duration":0.2,"exceeded-duration":0.034015766}
Signed-off-by: Gyuho Lee <gyuhox@gmail.com>
- Add/Document "logger" to support structured logging.
- This makes functional tests run easier, since zap logger
provides built-in log redirect to files.
- "etcd --logger-option=zap" to enable structured logging.
- Current "capnslog" will still be used as "default".
- We may switch the default or deprecate "capnslog" in v3.5.
- Either way, will clearly be documented.
Signed-off-by: Gyuho Lee <gyuhox@gmail.com>
We enable "raft.Config.CheckQuorum" by default in other
Raft initial starts. So should start with "ForceNewCluster".
Signed-off-by: Gyuho Lee <gyuhox@gmail.com>
In the case that follower recieves a snapshot from leader
and crashes before renaming xxx.snap.db to db but after
snapshot has persisted to .wal and .snap, restarting
follower results loading old db, new .wal, and new .snap.
This will causes a index mismatch between snap metadata index
and consistent index from db.
This pr forces an ordering where saving/renaming db must
happen after snapshot is persisted to wal and snap file.
this guarantees wal and snap files are newer than db.
on server restart, etcd server checks if snap index > db consistent index.
if yes, etcd server attempts to load xxx.snap.db where xxx=snap index
if there is any and panic other wise.
FIXES#7628
Problem is:
`Step1`: `etcdserver/raft.go`'s `Ready` process routine sends config-change entries via `r.applyc <- ap` (https://github.com/coreos/etcd/blob/master/etcdserver/raft.go#L193-L203)
`Step2`: `etcdserver/server.go`'s `*EtcdServer.run` routine receives this via `ap := <-s.r.apply()` (https://github.com/coreos/etcd/blob/master/etcdserver/server.go#L735-L738)
`StepA`: `Step1` proceeds without sync, right after sending `r.applyc <- ap`.
`StepB`: `Step2` proceeds without sync, right after `sched.Schedule(s.applyAll(&ep,&ap))`.
`StepC`: `etcdserver` tries to sync with `s.applyAll(&ep,&ap)` by calling `rh.waitForApply()`.
`rh.waitForApply()` waits for all pending jobs to finish in `pkg/schedule`
side. However, the order of `StepA`,`StepB`,`StepC` is not guaranteed. It
is possible that `StepC` happens first, and proceeds without waiting on
apply. And the restarting member comes back as a leader in single-node
cluster, when there is no synchronization between apply-layer and
config-change Raft entry apply. Confirmed with more debugging lines below,
only reproducible with slow CPU VM (~2 vCPU).
```
~:24.005397 I | etcdserver: starting server... [version: 3.2.0+git, cluster version: to_be_decided]
~:24.011136 I | etcdserver: [DEBUG] 29b2d24047a277df waitForApply before
~:24.011194 I | etcdserver: [DEBUG] 29b2d24047a277df starts wait for 0 pending jobs
~:24.011234 I | etcdserver: [DEBUG] 29b2d24047a277df finished wait for 0 pending jobs (current pending 0)
~:24.011268 I | etcdserver: [DEBUG] 29b2d24047a277df waitForApply after
~:24.011348 I | etcdserver: [DEBUG] [0] 29b2d24047a277df is scheduling conf change on 29b2d24047a277df
~:24.011396 I | etcdserver: [DEBUG] [1] 29b2d24047a277df is scheduling conf change on 5edf80e32a334cf0
~:24.011437 I | etcdserver: [DEBUG] [2] 29b2d24047a277df is scheduling conf change on e32e31e76c8d2678
~:24.011477 I | etcdserver: [DEBUG] 29b2d24047a277df scheduled conf change on 29b2d24047a277df
~:24.011509 I | etcdserver: [DEBUG] 29b2d24047a277df scheduled conf change on 5edf80e32a334cf0
~:24.011545 I | etcdserver: [DEBUG] 29b2d24047a277df scheduled conf change on e32e31e76c8d2678
~:24.012500 I | etcdserver: [DEBUG] 29b2d24047a277df applyConfChange on 29b2d24047a277df before
~:24.013014 I | etcdserver/membership: added member 29b2d24047a277df [unix://127.0.0.1:2100515039] to cluster 9250d4ae34216949
~:24.013066 I | etcdserver: [DEBUG] 29b2d24047a277df applyConfChange on 29b2d24047a277df after
~:24.013113 I | etcdserver: [DEBUG] 29b2d24047a277df applyConfChange on 29b2d24047a277df after trigger
~:24.013158 I | etcdserver: [DEBUG] 29b2d24047a277df applyConfChange on 5edf80e32a334cf0 before
~:24.013666 W | etcdserver: failed to send out heartbeat on time (exceeded the 10ms timeout for 11.964739ms)
~:24.013709 W | etcdserver: server is likely overloaded
~:24.013750 W | etcdserver: failed to send out heartbeat on time (exceeded the 10ms timeout for 12.057265ms)
~:24.013775 W | etcdserver: server is likely overloaded
~:24.013950 I | raft: 29b2d24047a277df is starting a new election at term 4
~:24.014012 I | raft: 29b2d24047a277df became candidate at term 5
~:24.014051 I | raft: 29b2d24047a277df received MsgVoteResp from 29b2d24047a277df at term 5
~:24.014107 I | raft: 29b2d24047a277df became leader at term 5
~:24.014146 I | raft: raft.node: 29b2d24047a277df elected leader 29b2d24047a277df at term 5
```
I am printing out the number of pending jobs before we call
`sched.WaitFinish(0)`, and there was no pending jobs, so it returned
immediately (before we schedule `applyAll`).
This is the root cause to:
- https://github.com/coreos/etcd/issues/7595
- https://github.com/coreos/etcd/issues/7739
- https://github.com/coreos/etcd/issues/7802
`sched.WaitFinish(0)` doesn't work when `len(f.pendings)==0` and
`f.finished==0`. Config-change is the first job to apply, so
`f.finished` is 0 in this case.
`f.finished` monotonically increases, so we need `WaitFinish(finished+1)`.
And `finished` must be the one before calling `Schedule`. This is safe
because `Schedule(applyAll)` is the only place adding jobs to `sched`.
Then scheduler waits on the single job of `applyAll`, by getting the
current number of finished jobs before sending `Schedule`.
Or just make it be blocked until `applyAll` routine triggers on the
config-change job. This patch just removes `waitForApply`, and
signal `raftDone` to wait until `applyAll` finishes applying entries.
Confirmed that it fixes the issue, as below:
```
~:43.198354 I | rafthttp: started streaming with peer 36cda5222aba364b (stream MsgApp v2 reader)
~:43.198740 I | etcdserver: [DEBUG] 3988bc20c2b2e40c waitForApply before
~:43.198836 I | etcdserver: [DEBUG] 3988bc20c2b2e40c starts wait for 0 pending jobs, 1 finished jobs
~:43.200696 I | integration: launched 3169361310155633349 ()
~:43.201784 I | etcdserver: [DEBUG] [0] 3988bc20c2b2e40c is scheduling conf change on 36cda5222aba364b
~:43.201884 I | etcdserver: [DEBUG] [1] 3988bc20c2b2e40c is scheduling conf change on 3988bc20c2b2e40c
~:43.201965 I | etcdserver: [DEBUG] [2] 3988bc20c2b2e40c is scheduling conf change on cf5d6cbc2a121727
~:43.202070 I | etcdserver: [DEBUG] 3988bc20c2b2e40c scheduled conf change on 36cda5222aba364b
~:43.202139 I | etcdserver: [DEBUG] 3988bc20c2b2e40c scheduled conf change on 3988bc20c2b2e40c
~:43.202204 I | etcdserver: [DEBUG] 3988bc20c2b2e40c scheduled conf change on cf5d6cbc2a121727
~:43.202444 I | etcdserver: [DEBUG] 3988bc20c2b2e40c applyConfChange on 36cda5222aba364b (request ID: 0) before
~:43.204486 I | etcdserver/membership: added member 36cda5222aba364b [unix://127.0.0.1:2100913646] to cluster 425d73f1b7b01674
~:43.204588 I | etcdserver: [DEBUG] 3988bc20c2b2e40c applyConfChange on 36cda5222aba364b (request ID: 0) after
~:43.204703 I | etcdserver: [DEBUG] 3988bc20c2b2e40c applyConfChange on 36cda5222aba364b (request ID: 0) after trigger
~:43.204791 I | etcdserver: [DEBUG] 3988bc20c2b2e40c applyConfChange on 3988bc20c2b2e40c (request ID: 0) before
~:43.205689 I | etcdserver/membership: added member 3988bc20c2b2e40c [unix://127.0.0.1:2101113646] to cluster 425d73f1b7b01674
~:43.205783 I | etcdserver: [DEBUG] 3988bc20c2b2e40c applyConfChange on 3988bc20c2b2e40c (request ID: 0) after
~:43.205929 I | etcdserver: [DEBUG] 3988bc20c2b2e40c applyConfChange on 3988bc20c2b2e40c (request ID: 0) after trigger
~:43.206056 I | etcdserver: [DEBUG] 3988bc20c2b2e40c applyConfChange on cf5d6cbc2a121727 (request ID: 0) before
~:43.207353 I | etcdserver/membership: added member cf5d6cbc2a121727 [unix://127.0.0.1:2100713646] to cluster 425d73f1b7b01674
~:43.207516 I | etcdserver: [DEBUG] 3988bc20c2b2e40c applyConfChange on cf5d6cbc2a121727 (request ID: 0) after
~:43.207619 I | etcdserver: [DEBUG] 3988bc20c2b2e40c applyConfChange on cf5d6cbc2a121727 (request ID: 0) after trigger
~:43.207710 I | etcdserver: [DEBUG] 3988bc20c2b2e40c finished scheduled conf change on 36cda5222aba364b
~:43.207781 I | etcdserver: [DEBUG] 3988bc20c2b2e40c finished scheduled conf change on 3988bc20c2b2e40c
~:43.207843 I | etcdserver: [DEBUG] 3988bc20c2b2e40c finished scheduled conf change on cf5d6cbc2a121727
~:43.207951 I | etcdserver: [DEBUG] 3988bc20c2b2e40c finished wait for 0 pending jobs (current pending 0, finished 1)
~:43.208029 I | rafthttp: started HTTP pipelining with peer cf5d6cbc2a121727
~:43.210339 I | rafthttp: peer 3988bc20c2b2e40c became active
~:43.210435 I | rafthttp: established a TCP streaming connection with peer 3988bc20c2b2e40c (stream MsgApp v2 reader)
~:43.210861 I | rafthttp: started streaming with peer 3988bc20c2b2e40c (writer)
~:43.211732 I | etcdserver: [DEBUG] 3988bc20c2b2e40c waitForApply after
```
Signed-off-by: Gyu-Ho Lee <gyuhox@gmail.com>
raftNode was being initialized in start(), which was causing
hangs when trying to stop the etcd server since the stop channel
would not be initialized in time for the stop call. Instead,
setup non-configurable bits in a constructor.
Fixes#7668
When apply-layer sees configuration change entry in
raft.Ready.CommittedEntries, the server should not proceed
until that entry is applied. Otherwise, follower's raft
layer advances, possibly election-timeouts, and becomes
the leader in single-node cluster, before add-node conf
change of other nodes is applied.
Fix https://github.com/coreos/etcd/issues/7595.
Signed-off-by: Gyu-Ho Lee <gyuhox@gmail.com>
If we promote the lessor before finish applying all
entries from the last term, we might incorrectly renew
the already revoked leases.
Here is an example:
- Term 1: revoke lease A accepted by raft
- Old leader failed, new election happened
- Term 2: promote
- Term 2: keep alive A succeed. A now has 10 seconds TTL
- Term 2: revoke lease A from Term 1 got committed and applied
- Term 2: the lease A with 10 seconds TTL is revoked
To solve this, the new leader MUST apply all entries from old term
before promote its lessor to start accept renew requests.