There are situations where we don't wish to fsync but we do want to
write the data.
Typically this occurs in clusters where fsync latency (often the
result of firmware) transiently spikes. For Kubernetes clusters this
causes (many) elections which have knock-on effects such that the API
server will transiently fail causing other components fail in turn.
By writing the data (buffered and asynchronously flushed, so in most
situations the write is fast) and avoiding the fsync we no longer
trigger this situation and opportunistically write out the data.
Anecdotally:
Because the fsync is missing there is the argument that certain
types of failure events will cause data corruption or loss, in
testing this wasn't seen. If this was to occur the expectation is
the member can be readded to a cluster or worst-case restored from a
robust persisted snapshot.
The etcd members are deployed across isolated racks with different
power feeds. An instantaneous failure of all of them simultaneously
is unlikely.
Testing was usually of the form:
* create (Kubernetes) etcd write-churn by creating replicasets of
some 1000s of pods
* break/fail the leader
Failure testing included:
* hard node power-off events
* disk removal
* orderly reboots/shutdown
In all cases when the node recovered it was able to rejoin the
cluster and synchronize.
This makes it possible to run an etcd node for testing and development
without placing lots of load on the file system.
Fixes#11930.
Signed-off-by: David Crawshaw <crawshaw@tailscale.com>
a. add comment of reopening file in cut function.
b. add const frameSizeBytes in decoder.
c. return directly if locked files empty in ReleaseLockTo function.
Use the path/filepath package instead of the path package. The
path package assumes slash-separated paths, which doesn't work
on Windows. But path/filepath manipulates filename paths in a way
that's compatible across OSes.
Windows requires this lock to be released before the directory is
renamed. But on unix-like operating systems, releasing the lock and
trying to reacquire it immediately can be flaky if a process is forked
around the same time. The file descriptors are marked as close-on-exec
by the Go runtime, but there is a window between the fork and exec where
another process will be holding the lock.
Whenever the WAL is opened for writes, it should write zeroes to its tail
starting from the first zero record. Otherwise, if there are entries past
the first zero record due to a torn write, any new writes that overlap the
old entries will lead to a garbage record on the tail and cause a CRC
mismatch.
In test situations, it's useful to create smaller than usual WAL files
to test rotation and to avoid the overhead of preallocation on old-style
filesystems that don't handle it efficiently. This commit changes
segmentSizeBytes to an exported variable so that tests can override it
from an init() function.