This PR resolves an issue where the `/metrics` endpoints exposed by the proxy were not returning metrics of the etcd members servers but of the proxy itself.
Signed-off-by: Sam Batschelet <sbatsche@redhat.com>
When using the embed package to embed etcd, sometimes the storage prefix
being used might be full. In this case, this code path triggers, causing
an: `etcdserver: create wal error: no space left on device` error, which
causes a fatal. A fatal differs from a panic in that it also calls
os.Exit(1). In this situation, the calling program that embeds the etcd
server will be abruptly killed, which prevents it from cleaning up
safely, and giving a proper error message. Depending on what the calling
program is, this can cause corruption and data loss.
This patch switches the fatal to a panic. Ideally this would be a
regular error which would get propagated upwards to the StartEtcd
command, but in the meantime at least this can be caught with recover().
This fixes the most common fatal that I've experienced, but there are
surely more that need looking into. If possible, the errors should be
threaded down into the code path so that embedding etcd can be more
robust.
Fixes: https://github.com/etcd-io/etcd/issues/10588
This is a cherry-picked version of upstream: 368f70a37cf25b432f01921d3f05a3bc0357297a
This PR adds another probing routine to monitor the connection
for Raft message transports. Previously, we only monitored
snapshot transports.
In our production cluster, we found one TCP connection had >8-sec
latencies to a remote peer, but "etcd_network_peer_round_trip_time_seconds"
metrics shows <1-sec latency distribution, which means etcd server
was not sampling enough while such latency spikes happen
outside of snapshot pipeline connection.
Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
Convenient invariant:
- if werr == nil then lock is supposed to be locked at the moment.
While we could not be confident in stronger invariant ('is exactly locked'),
it were inconvenient that previous code could return `werr == nil` after
Mutex.Unlock.
It could happen when ctx is canceled/timeouted exactly after waitDeletes
successfully returned werr == nil and before `<-ctx.Done()` checked.
While such situation is very rare, it is still possible.
fixes#10111
Distribution would be:
0.1 second or more
...
25.6 seconds or more
51.2 seconds or more
etcd_network_snapshot_send_success
etcd_network_snapshot_send_failures
etcd_network_snapshot_send_total_duration_seconds
etcd_network_snapshot_receive_success
etcd_network_snapshot_receive_failures
etcd_network_snapshot_receive_total_duration_seconds
Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
To improve debuggability of etcd v3. Added a grpc interceptor to log
info on incoming requests to etcd server. The log output includes
remote client info, request content (with value field redacted), request
handling latency, response size, etc. Uses zap logger if available,
otherwise uses capnslog.
Also did some clean up on the chaining of grpc interceptors on server
side.