Problem Observed
----------------
When there is no etcd process behind the proxy,
clients repeat resending lease grant requests without delay.
This behavior can cause abnormal resource consumption on CPU/RAM and
network.
Problem Detail
--------------
`LeaseGrant()` uses a bare protobuf client to forward requests.
However, it doesn't use `grpc.FailFast(false)`, which means the method returns
an `Unavailable` error immediately when no etcd process is available.
In clientv3, `Unavailable` errors are not considered the "Halt" error,
and library retries the request without delay.
Both clients and the proxy consume much CPU cycles to process retry requests.
Resolution
----------
Add `grpc.FailFast(false))` to `LeaseGrant()` of the `leaseProxy`.
This makes the proxy not to return immediately when no etcd process is
available. Clients will simply timeout requests instead.
Since the current revision is 0, it'll always be less than the compaction
revision. If the proxy sees a compaction, it would always reject the
current revision requests since it's less than the compaction revision.
Instead, check if the revision is historical before trying to reject on
compaction revision.
Fixes#7599
Leadership timeout can sometimes take too long, such as in test cases.
However, it is possible to infer a leader is available based on RPCs
that must go through consensus. Therefore, have a way to update the
leadership status off the watch path.
groupcache needs a write lock and has no way to expire keys; ccache can
do this, though.
Also removes the key count metric, since there's no way to efficiently
calculate it using ccache.
If client closes but all watch streams are not canceled, the outstanding
watch will wait until it is canceled, causing watch server to potentially
wait forever to close.
Fixes#7102
Checking empty() wasn't grabbing the broadcasts lock so the race detector
flags it as a data race with coalesce(). Instead, just return the number
of remaining watches following delete() and get rid of empty().
The create header revision is the current etcd revision. For watches with
rev=0, the next revision is hdr.rev+1. For watches with rev=n, the next
revision should be n.
Fixes TestDoubleBarrier timeouts.
The single watcher / group watcher distinction limited and
complicated watcher coalescing more than necessary. Reworked:
Each server watcher is represented by a WatchBroadcast, each
client "Watcher" attaches to some WatchBroadcast. WatchBroadcasts
hold all WatchBroadcast instances for a range. WatchRanges holds
all WatchBroadcasts for the proxy.
WatchProxyStreams represent a grpc watch stream between the proxy and
a client. When a client requests a new watcher through its grpc stream,
the ProxyStream will allocate a Watcher and WatchRanges assigns it to
some WatchBroadcast based on its range.
Coalescing is done by WatchBroadcasts when it receives an update
notification from a WatchBroadcast.
Supports leader failure detection so watches on a bad member
can migrate to other members. Coincidentally, Fixes#6303.