A WAL object was closed by defer, however the WAL was rewritten afterwards,
so defer closed already closed WAL but not the new one. It caused a data
race between writing file and cleaning up a temporary test directory,
which led to a non-deterministic bug.
Fixes#14332
Signed-off-by: Vladimir Sokolov <vsvastey@gmail.com>
Currently the max size of each WAL entry is hard coded as 10MB. If users
set a value > 10MB for the flag --max-request-bytes, then etcd may run
into a situation that it successfully processes a big request, but fails
to decode it when replaying the WAL file on startup.
On the other hand, we can't just remove the limitation, because if a
WAL entry is somehow corrupted, and its recByte is a huge value, then
etcd may run out of memory. So the solution is to restrict the max size
of each WAL entry as a dynamic value, which is the remaining size of
the WAL file.
Signed-off-by: Benjamin Wang <wachao@vmware.com>
While it appears that etcd is not vulnerable to CVE-2021-3121,
it is a good idea to update to the new generator so that new
vulnerable code isn't generated in any future APIs. Also, this
lays the issue to rest of whether there is any issue with
etcd and CVE-2021-3121.
There are situations where we don't wish to fsync but we do want to
write the data.
Typically this occurs in clusters where fsync latency (often the
result of firmware) transiently spikes. For Kubernetes clusters this
causes (many) elections which have knock-on effects such that the API
server will transiently fail causing other components fail in turn.
By writing the data (buffered and asynchronously flushed, so in most
situations the write is fast) and avoiding the fsync we no longer
trigger this situation and opportunistically write out the data.
Anecdotally:
Because the fsync is missing there is the argument that certain
types of failure events will cause data corruption or loss, in
testing this wasn't seen. If this was to occur the expectation is
the member can be readded to a cluster or worst-case restored from a
robust persisted snapshot.
The etcd members are deployed across isolated racks with different
power feeds. An instantaneous failure of all of them simultaneously
is unlikely.
Testing was usually of the form:
* create (Kubernetes) etcd write-churn by creating replicasets of
some 1000s of pods
* break/fail the leader
Failure testing included:
* hard node power-off events
* disk removal
* orderly reboots/shutdown
In all cases when the node recovered it was able to rejoin the
cluster and synchronize.