Due to a duplicate call of clientConfigFromCmd, the move-leader command
would fail with "conflicting environment variable is shadowed by corresponding command-line flag".
Also in scenarios where no command-line flag was supplied.
Signed-off-by: Thomas Jungblut <tjungblu@redhat.com>
prefixArgs uses os.Setenv in e2e tests instead envMap.
This creates overwrites in some test cases and have an impact
on test quality and isolation between tests.
This PR uses ctlcontext envMap in each tests with high priority
and merges os environment variables with low priority.
Narrowly prevent etcd from crashing when given a bad ACTIVATE payload, e.g.:
$ curl -d "{\"action\":\"ACTIVATE\"}" ${ETCD}/v3/maintenance/alarm
curl: (52) Empty reply from server
Motivation is as follows:
- etcdctl we only depend on clientv3 APIs, no dependencies of bolt, backend, mvcc, file-layout
- etcdctl can be officially supported across wide range of versions, while etcdutl is pretty specific to file format at particular version.
it's step towards desired modules layout, documented in: https://etcd.io/docs/next/dev-internal/modules/
2 problems:
- spawnCmdWithLogger was not implemented (when built with 'cov' tag)
- the logic was depending on relative paths. We change it to absolute
to be able to run in the test-specific temporary directories.
In some environments, the CA is not able to sign certificates with both
'client auth' and 'server auth' extended usage parameters and so an operator
needs to be able to set a seperate client certificate to use when making
requests which is different to the certificate used for accepting requests.
This applies to both proxy and etcd member mode and is available as both a CLI
flag and config file field for peer TLS.
Signed-off-by: Ben Meier <ben.meier@oracle.com>
Prior to this PR, the e2e tests where creating dirs like:
```
/tmp/testname1.etcd030299846
/tmp/testname0.etcd039445123
/tmp/testname0.etcd206372065
```
and not cleaning them, that led to disk-space-exceeded flakes.
After the PR, the testing.TB tempdir mechanism is used and the names are
being cleaned and are more miningful:
```
../../bin/etcd --name test-TestCtlV3EndpointHashKV-2 --listen-client-urls http://localhost:20010 --advertise-client-urls http://localhost:20010 --listen-peer-urls https://localhost:20011 --initial-advertise-peer-urls https://localhost:20011 --initial-cluster-token new --data-dir /tmp/TestCtlV3EndpointHashKV429176179/003 --snapshot-count 100000 --experimental-initial-corrupt-check --peer-auto-tls --initial-cluster test-TestCtlV3EndpointHashKV-0=https://localhost:20001,test-TestCtlV3EndpointHashKV-1=https://localhost:20006,test-TestCtlV3EndpointHashKV-2=https://localhost:20011
```
The root reason of flakes, was that server was considered as ready to
early.
In particular:
```
../../bin/etcd-2456648: {"level":"info","ts":"2021-01-11T09:56:44.474+0100","caller":"rafthttp/stream.go:274","msg":"established TCP streaming connection with remote peer","stream-writer-type":"stream Message","local-member-id":"ed5f620d34a8e61b","remote-peer-id":"ca50e9357181d758"}
../../bin/etcd-2456648: {"level":"warn","ts":"2021-01-11T09:56:49.040+0100","caller":"etcdserver/server.go:1942","msg":"failed to publish local member to cluster through raft","local-member-id":"ed5f620d34a8e61b","local-member-attributes":"{Name:infra2 ClientURLs:[http://localhost:20030]}","request-path":"/0/members/ed5f620d34a8e61b/attributes","publish-timeout":"7s","error":"etcdserver: request timed out, possibly due to connection lost"}
../../bin/etcd-2456648: {"level":"info","ts":"2021-01-11T09:56:49.049+0100","caller":"etcdserver/server.go:1921","msg":"published local member to cluster through raft","local-member-id":"ed5f620d34a8e61b","local-member-attributes":"{Name:infra2 ClientURLs:[http://localhost:20030]}","request-path":"/0/members/ed5f620d34a8e61b/attributes","cluster-id":"34f27e83b3bc2ff","publish-timeout":"7s"}
```
was taking 5s. If this was happening concurrently with etcdctl, the
etcdctl could timeout.
The fix, requires servers to report 'ready to serve client requests' to consider them up.
Fixed also some whitelisted 'goroutines'.
The commit ensures that spawned etcdctl processes are "closed",
so they perform proper os wait processing.
This might have contributed to file-descriptor/open-files limit being
exceeded.
Before this patch, a client which cancels the context for a watch results in the
server generating a `rpctypes.ErrGRPCNoLeader` error that leads the recording of
a gRPC `Unavailable` metric in association with the client watch cancellation.
The metric looks like this:
grpc_server_handled_total{grpc_code="Unavailable",grpc_method="Watch",grpc_service="etcdserverpb.Watch",grpc_type="bidi_stream"}
So, the watch server has misidentified the error as a server error and then
propagates the mistake to metrics, leading to a false indicator that the leader
has been lost. This false signal then leads to false alerting.
The commit 9c103dd0dedfc723cd4f33b6a5e81343d8a6bae7 introduced an interceptor which wraps
watch streams requiring a leader, causing those streams to be actively canceled
when leader loss is detected.
However, the error handling code assumes all stream context cancellations are
from the interceptor. This assumption is broken when the context was canceled
because of a client stream cancelation.
The core challenge is lack of information conveyed via `context.Context` which
is shared by both the send and receive sides of the stream handling and is
subject to cancellation by all paths (including the gRPC library itself). If any
piece of the system cancels the shared context, there's no way for a context
consumer to understand who cancelled the context or why.
To solve the ambiguity of the stream interceptor code specifically, this patch
introduces a custom context struct which the interceptor uses to expose a custom
error through the context when the interceptor decides to actively cancel a
stream. Now the consuming side can more safely assume a generic context
cancellation can be propagated as a cancellation, and the server generated
leader error is preserved and propagated normally without any special inference.
When a client cancels the stream, there remains a race in the error handling
code between the send and receive goroutines whereby the underlying gRPC error
is lost in the case where the send path returns and is handled first, but this
issue can be taken separately as no matter which paths wins, we can detect a
generic cancellation.
This is a replacement of https://github.com/etcd-io/etcd/pull/11375.
Fixes#10289, #9725, #9576, #9166