In the case that follower recieves a snapshot from leader
and crashes before renaming xxx.snap.db to db but after
snapshot has persisted to .wal and .snap, restarting
follower results loading old db, new .wal, and new .snap.
This will causes a index mismatch between snap metadata index
and consistent index from db.
This pr forces an ordering where saving/renaming db must
happen after snapshot is persisted to wal and snap file.
this guarantees wal and snap files are newer than db.
on server restart, etcd server checks if snap index > db consistent index.
if yes, etcd server attempts to load xxx.snap.db where xxx=snap index
if there is any and panic other wise.
FIXES#7628
500ms keepalive delay on proxy side causes client to sometimes send
a second keepalive since it waits more than 500ms for the first response.
Fixes#7658
Fix https://github.com/coreos/etcd/issues/7865.
It is also possible to have mis-matched key file
while renaming directories.
Signed-off-by: Gyu-Ho Lee <gyuhox@gmail.com>
Connection pausing added another exit condition in the listener
path, causing the bridge to leak connections instead of closing
them when signalled to close. Also adds some additional Close
paranoia.
Fixes#7823
If the balancer update notification loop starts with a downed
connection and endpoints are switched while the old connection is up,
the balancer can potentially wait forever for an up connection without
refreshing the connections to reflect the current endpoints.
Instead, fetch upc/downc together, only caring about a single transition
either from down->up or up->down for each iteration
Simple way to reproduce failures: add time.Sleep(time.Second) to the
beginning of the update notification loop.
previously, apply() doesn't set consistIndex for EntryConfChange type.
this causes a misalignment between consistIndex and applied index
where EntryConfChange entry results setting applied index but not consistIndex.
suppose that addMember() is called and leader reflects that change.
1. applied index and consistIndex is now misaligned.
2. a new follower node joined.
3. leader sends the snapshot to follower
where the applied index is the snapshot metadata index.
4. follower node saves the snapshot and database(includes consistIndex) from leader.
5. restarting follower loads snapshot and database.
6. follower checks snapshot metadata index(same as applied index) and database consistIndex,
finds them don't match, and then panic.
FIXES#7834
If 'StartEtcd' returns before starting gRPC server
(e.g. mismatch snapshot, misconfiguration),
receiving from grpcServerC blocks forever. This patch
just closes the channel to not block on grpcServerC,
and proceeds to next stop operations in Close.
This was masking the issues in https://github.com/coreos/etcd/issues/7834
Signed-off-by: Gyu-Ho Lee <gyuhox@gmail.com>