client: Move client specific code (protos, version) to the api/
directory. Thanks to this change /client directory will not need to depend on
the server code. In next commits we make "/api" a module on its own.
Mechanical consequences of execution:
% git mv version/version.go api/version
% git mv etcdserver/api/v3rpc/rpctypes api/v3rpc
% git mv mvcc/mvccpb api/
% git mv etcdserver/etcdserverpb api/
% git mv auth/authpb api/
% git mv etcdserver/api/membership/membershippb api/
```
```
The flakes manifested as:
```
--- FAIL: TestV3WatchRestoreSnapshotUnsync (3.59s)
v3_watch_restore_test.go:82: inflight snapshot sends expected 0 or 1, got ""
FAIL
coverage: 55.2% of statements
FAIL go.etcd.io/etcd/v3/integration 3.646s
FAIL
```
The root reason is that all the SnapMsg processing happends on both ends
(leader, follower) assynchronously in goroutines, e.g. on Fifo schedule
within EtcdServer.run, so when we observe through metrics, we don't
know whether it finised (or even got started).
Idally we should have EtcdServer.Drain() call that exits when the server
processed or internal 'queues' and is idle.
The races was manifesting as following flakes:
```
```
See:
https://github.com/etcd-io/etcd/issues/12336
I'm taking the locks for short-duration of time (instead of the whole
duriation of Restore) to allow metrics being gather when the server
restoration is in progress.
```
{"level":"warn","ts":"2020-09-26T13:33:13.010Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-c9c21e47-2013-4776-8e83-e331b2caa9ae/localhost:14422410081761184170","attempt":0,"error":"rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial unix localhost:14422410081761184170: connect: no such file or directory\""}
{"level":"warn","ts":"2020-09-26T13:33:13.011Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-c9c21e47-2013-4776-8e83-e331b2caa9ae/localhost:14422410081761184170","attempt":0,"error":"rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial unix localhost:14422410081761184170: connect: no such file or directory\""}
{"level":"warn","ts":"2020-09-26T13:33:16.285Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-b504e954-e000-42a4-aa4f-70ded8dbef39/localhost:55672762955698614610","attempt":0,"error":"rpc error: code = NotFound desc = etcdserver: requested lease not found"}
{"level":"warn","ts":"2020-09-26T13:33:21.434Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-7945004b-f67e-42aa-af11-a7b40fbbe6fc/localhost:49623072144007561240","attempt":0,"error":"rpc error: code = Canceled desc = context canceled"}
==================
WARNING: DATA RACE
Write at 0x00c000905f78 by goroutine 764:
go.etcd.io/etcd/v3/mvcc.(*store).restore()
/go/src/go.etcd.io/etcd/mvcc/kvstore.go:397 +0x773
go.etcd.io/etcd/v3/mvcc.(*store).Restore()
/go/src/go.etcd.io/etcd/mvcc/kvstore.go:343 +0x5f1
go.etcd.io/etcd/v3/mvcc.(*watchableStore).Restore()
/go/src/go.etcd.io/etcd/mvcc/watchable_store.go:199 +0xe2
go.etcd.io/etcd/v3/etcdserver.(*EtcdServer).applySnapshot()
/go/src/go.etcd.io/etcd/etcdserver/server.go:1107 +0xa49
go.etcd.io/etcd/v3/etcdserver.(*EtcdServer).applyAll()
/go/src/go.etcd.io/etcd/etcdserver/server.go:1031 +0x6d
go.etcd.io/etcd/v3/etcdserver.(*EtcdServer).run.func8()
/go/src/go.etcd.io/etcd/etcdserver/server.go:986 +0x53
go.etcd.io/etcd/v3/pkg/schedule.(*fifo).run()
/go/src/go.etcd.io/etcd/pkg/schedule/schedule.go:157 +0x11e
Previous read at 0x00c000905f78 by goroutine 180:
[failed to restore the stack]
Goroutine 764 (running) created at:
go.etcd.io/etcd/v3/pkg/schedule.NewFIFOScheduler()
/go/src/go.etcd.io/etcd/pkg/schedule/schedule.go:70 +0x2b1
go.etcd.io/etcd/v3/etcdserver.(*EtcdServer).run()
/go/src/go.etcd.io/etcd/etcdserver/server.go:871 +0x32c
Goroutine 180 (running) created at:
net/http.(*Server).Serve()
/usr/local/go/src/net/http/server.go:2933 +0x5b6
net/http/httptest.(*Server).goServe.func1()
/usr/local/go/src/net/http/httptest/server.go:308 +0xd3
==================
--- FAIL: TestV3WatchRestoreSnapshotUnsync (6.74s)
testing.go:906: race detected during execution of test
FAIL
coverage: 83.5% of statements
FAIL go.etcd.io/etcd/v3/integration 231.272s
FAIL
Command 'go test -timeout=30m -cpu=1 --race --cover=true go.etcd.io/etcd/v3/integration' failed.
```
- We were leaking goroutines in auth-test
- The go-routines were depending / modifying global test environment
variables (simpleTokenTTLDefault) leading to races
Removed the leaked go-routines, and expanded 'auth' package to
be covered we leaked go-routines detection.
1. setting environment variable cannot be in quote
2. "--race" testing for unit tests is supposed to be part of linux-amd64-unit-4-cpu-race config.
3. 'run' function in test script should log_error in case of failed
command (wrong operator for ints comparison in bash).
* ./tests: Remove legacy coverage collection code
The legacy tests/cover.test.bash script was not ./test script
compatible for a long time.
The following method of coverage collection works (also across
packages) and does not make all the test execution slower.
```
COVERDIR=coverage PASSES="build build_cov cov" ./test
go tool cover -html ./coverage/cover.out
```
* CI: Reduce duplicated coverage between different variants on Travis
We used to execute unit tests in 3 different jobs,
every time with --race detection and every time in 3 variants:1,2,4
CPUS.
The proposed change makes each of the jobs use different variant of
CPUS, and only 4-cpu variant is running with --race detection
(as the more-parallel variant is more likely to experience races),
Commit inspired by this failure:
https://travis-ci.com/github/etcd-io/etcd/jobs/391164537
This is not happanning locally - but can be forced by removal of go.sum
file. Let's watch how frequently we will need to refresh go.sum.
This refactoring offers following benefits:
- Unified way how go test commands are being called (in terms of flags intepretation)
- Uses standard go mechanisms (like go lists) to find files/packages that are subject for test. The mechanism are module aware.
- Added instruction how to install tools needed for the tests/checkers.
- Added colors to the output to make it easier to spot any failure.
Confirmed to work using:
- COVERDIR="./coverage" CPU="4" RACE=false COVER=false PASSES="build build_cov cov" ./test
- CPU="4" RACE=false COVER=false PASSES="e2e functional integration" ./test
- COVERDIR="./coverage" COVER="false" CPU="4" RACE="false" PASSES="fmt build unit build_cov integration e2e integration_e2e grpcproxy cov" ./test
- PASSES=unit PKG=./wal TIMEOUT=1m ./test
- PASSES=integration PKG=./clientv3 TIMEOUT=1m ./test
- PASSES=unit PKG=./wal TESTCASE=TestNew TIMEOUT=1m ./test
- PASSES=unit PKG=./wal TESTCASE="\bTestNew\b" TIMEOUT=1m ./test
- PASSES=integration PKG=./client/integration TESTCASE="\bTestV2NoRetryEOF\b" TIMEOUT=1m ./test
- COVERDIR=coverage PASSES="build_cov cov" ./test
To improve debuggability of `agreement among raft nodes before
linearized reading`, we added some tracing inside
`linearizableReadLoop`.
This will allow us to know the timing of `s.r.ReadIndex` vs
`s.applyWait.Wait(rs.Index)`.
Examplar flake: https://travis-ci.com/github/etcd-io/etcd/jobs/388806782
```
go test -timeout=5m -cpu=1 --run=Example ./client/...
ok go.etcd.io/etcd/v3/client 0.085s
testing: warning: no tests to run
PASS
Unexpected goroutines running after all test(s).
1 instances of:
text/template/parse.(*lexer).emit(...)
/usr/local/go/src/text/template/parse/lex.go:157
text/template/parse.lexText(...)
/usr/local/go/src/text/template/parse/lex.go:269 +0x4f0
text/template/parse.(*lexer).run(...)
/usr/local/go/src/text/template/parse/lex.go:230 +0x37
created by text/template/parse.lex
/usr/local/go/src/text/template/parse/lex.go:223 +0x190
FAIL go.etcd.io/etcd/v3/client/integration 0.013s
```
The grpc-proxy test logic was assuming that the context associated to client is closed,
while in practice all tests called client.Close() without explicit context close.
The current testing strategy is complicated 2 fold:
- grpc proxy works like man-in-the middle of each Connection issues
from integration tests and its lifetime is bound to the connection.
- both connections (client -> proxy, and proxy -> etcd-server) are
represented by the same ClientV3 object instance (with substituted
implementations of KV or watcher).
The fix splits context representing proxy from context representing proxy -> etcd-server connection,
thus allowing cancelation of the proxy context.
SubTraceStart and SubTraceEnd steps are only placeholders, not really
steps, we should skip them when logging the long duration steps,
otherwise these steps will lead to incorrect start time and duration
of subsequent steps.
This commit:
- Fires a critical alert when the etcd database quota is 95% full
at any given point of time to alert the user to defrag or increase
the quota in order to avoid the alarm getting triggered which blocks
all the writes to etcd meaning there can't be any new objects created.
This is needed to make sure the cluster supports running large number
of nodes and objects.
- Fires a warning when there is a sudden surge in etcd writes leading to
increase in the etcd database quota size at an alarming rate as it
is disruptive. It might be because of a rougue process and it's
important to alert the admin.
* etcdserver: check authinfo if it is not InternalAuthenticateRequest.
* credentials: let GetRequestMetadata() return nil when authToken isn't initialized.
* clientv3: get AuthToken gracefully without extra connection.
Direct syscalls using syscall.Syscall(SYS_*, ...) should no longer be
used on darwin, see [1]. Instead, use the fcntl libSystem wrappers
provided by the golang.org/x/sys/unix package which implement the same
functionality.
[1] https://golang.org/doc/go1.12#darwin
Fix test case errors in etcd-dump-logs and the error is the time zone.
When the GOST time zone is CST or PST and the test case will be failed.
So we should set UTC as the standard time zone.
Fixes:
go test -tags cluster_proxy ./clientv3/integration -v -run TestWatchRequestProgress
Does not fail the grpc-server (completely) by a not implemented RPC.
Failing whole server by remote request is anti-pattern and security
risk.
Prior to the fix, the command line above was failing with:
```
=== RUN TestWatchRequestProgress/0-watcher
panic: not implemented
goroutine 602 [running]:
go.etcd.io/etcd/v3/proxy/grpcproxy.(*watchProxyStream).recvLoop(0xc0004779d0, 0x0, 0x0)
/home/ptab/corp/etcd/proxy/grpcproxy/watch.go:275 +0xac5
go.etcd.io/etcd/v3/proxy/grpcproxy.(*watchProxy).Watch.func1(0xc0034f94a0, 0xc0004779d0)
/home/ptab/corp/etcd/proxy/grpcproxy/watch.go:129 +0x53
created by go.etcd.io/etcd/v3/proxy/grpcproxy.(*watchProxy).Watch
/home/ptab/corp/etcd/proxy/grpcproxy/watch.go:127 +0x3c8
FAIL go.etcd.io/etcd/v3/clientv3/integration 0.215s
FAIL
```
The flake happened e.g. in:
https://travis-ci.com/github/etcd-io/etcd/jobs/386607570
```
--- PASS: TestWatchClose (0.37s)
PASS
Unexpected goroutines running after all test(s).
1 instances of:
testing.runTests.func1.1(...)
/usr/local/go/src/testing/testing.go:1289 +0x60
created by testing.runTests.func1
/usr/local/go/src/testing/testing.go:1289 +0xdb
FAIL go.etcd.io/etcd/v3/clientv3/integration 344.389s
FAIL
```
This is implementation detail of Go testing.lib and we should not worry.
Marked all 'integrational, e2e' as skipped in the --short mode.
Thanks to this we will be able to significantly simplify ./test script.
The run currently takes ~23s.
With (follow up) move of ~clientv3/snapshot to integration tests (as
part of modularization), we can expect this to fall to 5-10s.
```
% time go test --short ./... --count=1
ok go.etcd.io/etcd/v3 0.098s
? go.etcd.io/etcd/v3/Documentation/learning/lock/client [no test files]
? go.etcd.io/etcd/v3/Documentation/learning/lock/storage [no test files]
ok go.etcd.io/etcd/v3/auth 0.724s
? go.etcd.io/etcd/v3/auth/authpb [no test files]
ok go.etcd.io/etcd/v3/client 0.166s
ok go.etcd.io/etcd/v3/client/integration 0.166s
ok go.etcd.io/etcd/v3/clientv3 3.219s
ok go.etcd.io/etcd/v3/clientv3/balancer 1.102s
? go.etcd.io/etcd/v3/clientv3/balancer/connectivity [no test files]
? go.etcd.io/etcd/v3/clientv3/balancer/picker [no test files]
? go.etcd.io/etcd/v3/clientv3/balancer/resolver/endpoint [no test files]
ok go.etcd.io/etcd/v3/clientv3/clientv3util 0.096s [no tests to run]
ok go.etcd.io/etcd/v3/clientv3/concurrency 3.323s
? go.etcd.io/etcd/v3/clientv3/credentials [no test files]
ok go.etcd.io/etcd/v3/clientv3/integration 0.131s
? go.etcd.io/etcd/v3/clientv3/leasing [no test files]
? go.etcd.io/etcd/v3/clientv3/mirror [no test files]
ok go.etcd.io/etcd/v3/clientv3/namespace 0.041s
ok go.etcd.io/etcd/v3/clientv3/naming 0.115s
ok go.etcd.io/etcd/v3/clientv3/ordering 0.121s
ok go.etcd.io/etcd/v3/clientv3/snapshot 19.325s
ok go.etcd.io/etcd/v3/clientv3/yaml 0.090s
ok go.etcd.io/etcd/v3/contrib/raftexample 7.572s
? go.etcd.io/etcd/v3/contrib/recipes [no test files]
ok go.etcd.io/etcd/v3/embed 0.282s
ok go.etcd.io/etcd/v3/etcdctl 0.054s
? go.etcd.io/etcd/v3/etcdctl/ctlv2 [no test files]
ok go.etcd.io/etcd/v3/etcdctl/ctlv2/command 0.117s
? go.etcd.io/etcd/v3/etcdctl/ctlv3 [no test files]
ok go.etcd.io/etcd/v3/etcdctl/ctlv3/command 0.070s
ok go.etcd.io/etcd/v3/etcdmain 0.172s
ok go.etcd.io/etcd/v3/etcdserver 1.698s
? go.etcd.io/etcd/v3/etcdserver/api [no test files]
ok go.etcd.io/etcd/v3/etcdserver/api/etcdhttp 0.075s
ok go.etcd.io/etcd/v3/etcdserver/api/membership 0.104s
? go.etcd.io/etcd/v3/etcdserver/api/membership/membershippb [no test files]
ok go.etcd.io/etcd/v3/etcdserver/api/rafthttp 0.181s
ok go.etcd.io/etcd/v3/etcdserver/api/snap 0.078s
? go.etcd.io/etcd/v3/etcdserver/api/snap/snappb [no test files]
ok go.etcd.io/etcd/v3/etcdserver/api/v2auth 0.142s
ok go.etcd.io/etcd/v3/etcdserver/api/v2discovery 0.035s
ok go.etcd.io/etcd/v3/etcdserver/api/v2error 0.043s
ok go.etcd.io/etcd/v3/etcdserver/api/v2http 0.070s
ok go.etcd.io/etcd/v3/etcdserver/api/v2http/httptypes 0.031s
? go.etcd.io/etcd/v3/etcdserver/api/v2stats [no test files]
ok go.etcd.io/etcd/v3/etcdserver/api/v2store 0.645s
ok go.etcd.io/etcd/v3/etcdserver/api/v2v3 0.218s
? go.etcd.io/etcd/v3/etcdserver/api/v3alarm [no test files]
? go.etcd.io/etcd/v3/etcdserver/api/v3client [no test files]
ok go.etcd.io/etcd/v3/etcdserver/api/v3compactor 1.765s
? go.etcd.io/etcd/v3/etcdserver/api/v3election [no test files]
? go.etcd.io/etcd/v3/etcdserver/api/v3election/v3electionpb [no test files]
? go.etcd.io/etcd/v3/etcdserver/api/v3election/v3electionpb/gw [no test files]
? go.etcd.io/etcd/v3/etcdserver/api/v3lock [no test files]
? go.etcd.io/etcd/v3/etcdserver/api/v3lock/v3lockpb [no test files]
? go.etcd.io/etcd/v3/etcdserver/api/v3lock/v3lockpb/gw [no test files]
ok go.etcd.io/etcd/v3/etcdserver/api/v3rpc 0.091s
ok go.etcd.io/etcd/v3/etcdserver/api/v3rpc/rpctypes 0.012s
ok go.etcd.io/etcd/v3/etcdserver/cindex 0.054s
ok go.etcd.io/etcd/v3/etcdserver/etcdserverpb 0.039s
? go.etcd.io/etcd/v3/etcdserver/etcdserverpb/gw [no test files]
ok go.etcd.io/etcd/v3/functional/agent 0.094s
? go.etcd.io/etcd/v3/functional/cmd/etcd-agent [no test files]
? go.etcd.io/etcd/v3/functional/cmd/etcd-proxy [no test files]
? go.etcd.io/etcd/v3/functional/cmd/etcd-runner [no test files]
? go.etcd.io/etcd/v3/functional/cmd/etcd-tester [no test files]
ok go.etcd.io/etcd/v3/functional/rpcpb 0.060s
? go.etcd.io/etcd/v3/functional/runner [no test files]
ok go.etcd.io/etcd/v3/functional/tester 0.079s
ok go.etcd.io/etcd/v3/integration 0.684s
ok go.etcd.io/etcd/v3/integration/embed 0.101s
ok go.etcd.io/etcd/v3/lease 3.455s
ok go.etcd.io/etcd/v3/lease/leasehttp 2.185s
? go.etcd.io/etcd/v3/lease/leasepb [no test files]
ok go.etcd.io/etcd/v3/mvcc 7.246s
ok go.etcd.io/etcd/v3/mvcc/backend 0.354s
? go.etcd.io/etcd/v3/mvcc/mvccpb [no test files]
ok go.etcd.io/etcd/v3/pkg/adt 0.025s
? go.etcd.io/etcd/v3/pkg/contention [no test files]
? go.etcd.io/etcd/v3/pkg/cpuutil [no test files]
ok go.etcd.io/etcd/v3/pkg/crc 0.008s
? go.etcd.io/etcd/v3/pkg/debugutil [no test files]
ok go.etcd.io/etcd/v3/pkg/expect 0.015s
ok go.etcd.io/etcd/v3/pkg/fileutil 0.268s
ok go.etcd.io/etcd/v3/pkg/flags 0.021s
ok go.etcd.io/etcd/v3/pkg/httputil 0.020s
ok go.etcd.io/etcd/v3/pkg/idutil 0.008s
ok go.etcd.io/etcd/v3/pkg/ioutil 0.025s
ok go.etcd.io/etcd/v3/pkg/logutil 0.047s
? go.etcd.io/etcd/v3/pkg/mock/mockserver [no test files]
? go.etcd.io/etcd/v3/pkg/mock/mockstorage [no test files]
? go.etcd.io/etcd/v3/pkg/mock/mockstore [no test files]
? go.etcd.io/etcd/v3/pkg/mock/mockwait [no test files]
ok go.etcd.io/etcd/v3/pkg/netutil 1.024s
ok go.etcd.io/etcd/v3/pkg/osutil 0.021s
ok go.etcd.io/etcd/v3/pkg/pathutil 0.008s
ok go.etcd.io/etcd/v3/pkg/pbutil 0.008s
ok go.etcd.io/etcd/v3/pkg/proxy 4.081s
ok go.etcd.io/etcd/v3/pkg/report 0.008s
? go.etcd.io/etcd/v3/pkg/runtime [no test files]
ok go.etcd.io/etcd/v3/pkg/schedule 0.009s
ok go.etcd.io/etcd/v3/pkg/srv 0.019s
ok go.etcd.io/etcd/v3/pkg/stringutil 0.008s
? go.etcd.io/etcd/v3/pkg/systemd [no test files]
ok go.etcd.io/etcd/v3/pkg/testutil 0.023s
ok go.etcd.io/etcd/v3/pkg/tlsutil 3.965s
ok go.etcd.io/etcd/v3/pkg/traceutil 0.034s
ok go.etcd.io/etcd/v3/pkg/transport 0.532s
ok go.etcd.io/etcd/v3/pkg/types 0.028s
ok go.etcd.io/etcd/v3/pkg/wait 0.023s
ok go.etcd.io/etcd/v3/proxy/grpcproxy 0.101s
? go.etcd.io/etcd/v3/proxy/grpcproxy/adapter [no test files]
? go.etcd.io/etcd/v3/proxy/grpcproxy/cache [no test files]
ok go.etcd.io/etcd/v3/proxy/httpproxy 0.044s
ok go.etcd.io/etcd/v3/proxy/tcpproxy 0.047s
ok go.etcd.io/etcd/v3/raft 0.312s
ok go.etcd.io/etcd/v3/raft/confchange 0.183s
ok go.etcd.io/etcd/v3/raft/quorum 0.316s
ok go.etcd.io/etcd/v3/raft/raftpb 0.024s
ok go.etcd.io/etcd/v3/raft/rafttest 0.640s
ok go.etcd.io/etcd/v3/raft/tracker 0.026s
ok go.etcd.io/etcd/v3/tests/e2e 0.077s
? go.etcd.io/etcd/v3/tools/benchmark [no test files]
? go.etcd.io/etcd/v3/tools/benchmark/cmd [no test files]
? go.etcd.io/etcd/v3/tools/etcd-dump-db [no test files]
ok go.etcd.io/etcd/v3/tools/etcd-dump-logs 0.088s
? go.etcd.io/etcd/v3/tools/etcd-dump-metrics [no test files]
? go.etcd.io/etcd/v3/tools/local-tester/bridge [no test files]
? go.etcd.io/etcd/v3/version [no test files]
ok go.etcd.io/etcd/v3/wal 1.517s
? go.etcd.io/etcd/v3/wal/walpb [no test files]
go test --short ./... --count=1 76.12s user 12.57s system 375% cpu 23.635 total
```
As we get rid of ./vendor there is no need to update the dependencies.
If anyone needs up-to-date vendor directory locally, getting it is as
simple as:
```go mod vendor```
Fixes:
go test --tags cluster_proxy --timeout=30m -run TestLeasingTxnOwnerGet -v ./clientv3/integration/...
The explicit code to close client is needed due to:
76e769ce95/clientv3/watch.go (L72)
as just ctx close by LeasingKeyValue store does not interrupts opened Watches.
The only way to interrupt open Watch is to close the 'whole' Watcher / Client.
The code used to:
- report time since previous compaction success as a 'duration' of compaction process itself.
- mix real clock with 'injected' clock. This led to strange log-lines
in tests in order of 234543543h of duration.