This fix avoids the assumption of knowing the current version of the
binary. We can query the binary with the version flag to get the actual
version of the given binary we upgrade and downgrade to. The
respectively reported versions should match what is returned by the
version endpoint.
Signed-off-by: Thomas Jungblut <tjungblu@redhat.com>
In the TestDowngradeUpgradeCluster case, the brand-new cluster is using
simple-config-changer, which means that entries has been committed
before leader election and these entries will be applied when etcdserver
starts to receive apply-requests. The simple-config-changer will mark
the `confState` dirty and the storage backend precommit hook will update
the `confState`.
For the new cluster, the storage version is nil at the beginning. And
it will be v3.5 if the `confState` record has been committed. And it
will be >v3.5 if the `storageVersion` record has been committed.
When the new cluster is ready, the leader will set init cluster version
with v3.6.x. And then it will trigger the `monitorStorageVersion` to
update the `storageVersion` to v3.6.x. If the `confState` record has
been updated before cluster version update, we will get storageVersion
record.
If the storage backend doesn't commit in time, the
`monitorStorageVersion` won't update the version because of `cannot
detect storage schema version: missing confstate information`.
And then we file the downgrade request before next round of
`monitorStorageVersion`(per 4 second), the cluster version will be
v3.5.0 which is equal to the `UnsafeDetectSchemaVersion`'s result.
And we won't see that `The server is ready to downgrade`.
It is easy to reproduce the issue if you use cpuset or taskset to limit
in two cpus.
So, we should wait for the new cluster's storage ready before downgrade
request.
Fixes: #14540
Signed-off-by: Wei Fu <fuweid89@gmail.com>
When we can't reach quorum, we were waiting forever and never sending
the systemd notify message. As a result, systemd would eventually time out
and restart the etcd process which likely would make the unhealthy cluster
in an even worse state
Improves #13785
Signed-off-by: Nicolai Moore <niconorsk@gmail.com>