There was undetected 'conflict' between
11ba1a610939218c6779a960f764b2bcfdd7fb83
and 2c66612e0ecb42abde6a4761cd708f5d285e0635.
Moving the file to proper location.
The raft.status expvar is added at init time.
This change ensures that evaluating that expvar variable
doesn't panic during evaluation, even when there is
no server running.
* namespace: check IsWithFromKey if keyLen equal 0.
Rename function isWithFromKey/isWithPrefix to IsOptsWithFromKey/IsOptsWithPrefix.
fixes: #12282
* integration: add test while WithFromKey/WithPrefix called in opts.
We introduce a LazyCluster abstraction (instead of copy-pasted logic)
that makes clusters to be created only if there are runnable tests
in need for the infrastructure.
This CL tries to connect 2 objectives:
- Examples should be close (the same package) to the original code,
such that they can participate in documentation.
- Examples should be runnable - such that they are not getting out of
sync with underlying API/implementation.
In case of etcd-client, the examples are assuming running 'integration'
style, i.e. thay do connect to fully functional etcd-server.
That would lead to a cyclic dependencies between modules:
- server depends on client (as client need to be lightweight)
- client (for test purposes) depend on server.
Go modules does not allow to distingush testing dependency from
prod-code dependency.
Thus to meet the objective:
- The examples are getting executed within testing/integration packages against real etcd
- The examples are symlinked to 'unit' tests, such that they included in documentation.
- Long-term the unit examples should get rewritten to use 'mocks' instead of real integration tests.
Should fail following goword complains:
```
clientv3/config.go.48: // ("--max-request-bytes" flag to etcd or "embed.Config.MaxRequestBytes"). (spell: MaxRequestBytes -> ?)
clientv3/config.go.55: // ("--max-request-bytes" flag to etcd or "embed.Config.MaxRequestBytes"). (spell: MaxRequestBytes -> ?)
clientv3/leasing/doc.go.15: // Package leasing serves linearizable reads from a local cache by acquiring (spell: linearizable -> infeasible?)
clientv3/op.go.413: // it's linearizable. Serializable requests are better for lower latency (spell: linearizable -> infeasible?)
clientv3/retry.go.49: // an obvious server-side error (e.g. rpctypes.ErrRequestTooLarge). (spell: ErrRequestTooLarge -> Erectile?)
```
The module is supposed to contain minimal set of files that establish
public etcd server API. In particular client libraries for etcd built in
different languages might want to depend on this file.
client: Move client specific code (protos, version) to the api/
directory. Thanks to this change /client directory will not need to depend on
the server code. In next commits we make "/api" a module on its own.
Mechanical consequences of execution:
% git mv version/version.go api/version
% git mv etcdserver/api/v3rpc/rpctypes api/v3rpc
% git mv mvcc/mvccpb api/
% git mv etcdserver/etcdserverpb api/
% git mv auth/authpb api/
% git mv etcdserver/api/membership/membershippb api/
```
```
The flakes manifested as:
```
--- FAIL: TestV3WatchRestoreSnapshotUnsync (3.59s)
v3_watch_restore_test.go:82: inflight snapshot sends expected 0 or 1, got ""
FAIL
coverage: 55.2% of statements
FAIL go.etcd.io/etcd/v3/integration 3.646s
FAIL
```
The root reason is that all the SnapMsg processing happends on both ends
(leader, follower) assynchronously in goroutines, e.g. on Fifo schedule
within EtcdServer.run, so when we observe through metrics, we don't
know whether it finised (or even got started).
Idally we should have EtcdServer.Drain() call that exits when the server
processed or internal 'queues' and is idle.
The races was manifesting as following flakes:
```
```
See:
https://github.com/etcd-io/etcd/issues/12336
I'm taking the locks for short-duration of time (instead of the whole
duriation of Restore) to allow metrics being gather when the server
restoration is in progress.
```
{"level":"warn","ts":"2020-09-26T13:33:13.010Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-c9c21e47-2013-4776-8e83-e331b2caa9ae/localhost:14422410081761184170","attempt":0,"error":"rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial unix localhost:14422410081761184170: connect: no such file or directory\""}
{"level":"warn","ts":"2020-09-26T13:33:13.011Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-c9c21e47-2013-4776-8e83-e331b2caa9ae/localhost:14422410081761184170","attempt":0,"error":"rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial unix localhost:14422410081761184170: connect: no such file or directory\""}
{"level":"warn","ts":"2020-09-26T13:33:16.285Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-b504e954-e000-42a4-aa4f-70ded8dbef39/localhost:55672762955698614610","attempt":0,"error":"rpc error: code = NotFound desc = etcdserver: requested lease not found"}
{"level":"warn","ts":"2020-09-26T13:33:21.434Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-7945004b-f67e-42aa-af11-a7b40fbbe6fc/localhost:49623072144007561240","attempt":0,"error":"rpc error: code = Canceled desc = context canceled"}
==================
WARNING: DATA RACE
Write at 0x00c000905f78 by goroutine 764:
go.etcd.io/etcd/v3/mvcc.(*store).restore()
/go/src/go.etcd.io/etcd/mvcc/kvstore.go:397 +0x773
go.etcd.io/etcd/v3/mvcc.(*store).Restore()
/go/src/go.etcd.io/etcd/mvcc/kvstore.go:343 +0x5f1
go.etcd.io/etcd/v3/mvcc.(*watchableStore).Restore()
/go/src/go.etcd.io/etcd/mvcc/watchable_store.go:199 +0xe2
go.etcd.io/etcd/v3/etcdserver.(*EtcdServer).applySnapshot()
/go/src/go.etcd.io/etcd/etcdserver/server.go:1107 +0xa49
go.etcd.io/etcd/v3/etcdserver.(*EtcdServer).applyAll()
/go/src/go.etcd.io/etcd/etcdserver/server.go:1031 +0x6d
go.etcd.io/etcd/v3/etcdserver.(*EtcdServer).run.func8()
/go/src/go.etcd.io/etcd/etcdserver/server.go:986 +0x53
go.etcd.io/etcd/v3/pkg/schedule.(*fifo).run()
/go/src/go.etcd.io/etcd/pkg/schedule/schedule.go:157 +0x11e
Previous read at 0x00c000905f78 by goroutine 180:
[failed to restore the stack]
Goroutine 764 (running) created at:
go.etcd.io/etcd/v3/pkg/schedule.NewFIFOScheduler()
/go/src/go.etcd.io/etcd/pkg/schedule/schedule.go:70 +0x2b1
go.etcd.io/etcd/v3/etcdserver.(*EtcdServer).run()
/go/src/go.etcd.io/etcd/etcdserver/server.go:871 +0x32c
Goroutine 180 (running) created at:
net/http.(*Server).Serve()
/usr/local/go/src/net/http/server.go:2933 +0x5b6
net/http/httptest.(*Server).goServe.func1()
/usr/local/go/src/net/http/httptest/server.go:308 +0xd3
==================
--- FAIL: TestV3WatchRestoreSnapshotUnsync (6.74s)
testing.go:906: race detected during execution of test
FAIL
coverage: 83.5% of statements
FAIL go.etcd.io/etcd/v3/integration 231.272s
FAIL
Command 'go test -timeout=30m -cpu=1 --race --cover=true go.etcd.io/etcd/v3/integration' failed.
```
- We were leaking goroutines in auth-test
- The go-routines were depending / modifying global test environment
variables (simpleTokenTTLDefault) leading to races
Removed the leaked go-routines, and expanded 'auth' package to
be covered we leaked go-routines detection.
1. setting environment variable cannot be in quote
2. "--race" testing for unit tests is supposed to be part of linux-amd64-unit-4-cpu-race config.
3. 'run' function in test script should log_error in case of failed
command (wrong operator for ints comparison in bash).
* ./tests: Remove legacy coverage collection code
The legacy tests/cover.test.bash script was not ./test script
compatible for a long time.
The following method of coverage collection works (also across
packages) and does not make all the test execution slower.
```
COVERDIR=coverage PASSES="build build_cov cov" ./test
go tool cover -html ./coverage/cover.out
```
* CI: Reduce duplicated coverage between different variants on Travis
We used to execute unit tests in 3 different jobs,
every time with --race detection and every time in 3 variants:1,2,4
CPUS.
The proposed change makes each of the jobs use different variant of
CPUS, and only 4-cpu variant is running with --race detection
(as the more-parallel variant is more likely to experience races),
Commit inspired by this failure:
https://travis-ci.com/github/etcd-io/etcd/jobs/391164537
This is not happanning locally - but can be forced by removal of go.sum
file. Let's watch how frequently we will need to refresh go.sum.
This refactoring offers following benefits:
- Unified way how go test commands are being called (in terms of flags intepretation)
- Uses standard go mechanisms (like go lists) to find files/packages that are subject for test. The mechanism are module aware.
- Added instruction how to install tools needed for the tests/checkers.
- Added colors to the output to make it easier to spot any failure.
Confirmed to work using:
- COVERDIR="./coverage" CPU="4" RACE=false COVER=false PASSES="build build_cov cov" ./test
- CPU="4" RACE=false COVER=false PASSES="e2e functional integration" ./test
- COVERDIR="./coverage" COVER="false" CPU="4" RACE="false" PASSES="fmt build unit build_cov integration e2e integration_e2e grpcproxy cov" ./test
- PASSES=unit PKG=./wal TIMEOUT=1m ./test
- PASSES=integration PKG=./clientv3 TIMEOUT=1m ./test
- PASSES=unit PKG=./wal TESTCASE=TestNew TIMEOUT=1m ./test
- PASSES=unit PKG=./wal TESTCASE="\bTestNew\b" TIMEOUT=1m ./test
- PASSES=integration PKG=./client/integration TESTCASE="\bTestV2NoRetryEOF\b" TIMEOUT=1m ./test
- COVERDIR=coverage PASSES="build_cov cov" ./test
To improve debuggability of `agreement among raft nodes before
linearized reading`, we added some tracing inside
`linearizableReadLoop`.
This will allow us to know the timing of `s.r.ReadIndex` vs
`s.applyWait.Wait(rs.Index)`.
Examplar flake: https://travis-ci.com/github/etcd-io/etcd/jobs/388806782
```
go test -timeout=5m -cpu=1 --run=Example ./client/...
ok go.etcd.io/etcd/v3/client 0.085s
testing: warning: no tests to run
PASS
Unexpected goroutines running after all test(s).
1 instances of:
text/template/parse.(*lexer).emit(...)
/usr/local/go/src/text/template/parse/lex.go:157
text/template/parse.lexText(...)
/usr/local/go/src/text/template/parse/lex.go:269 +0x4f0
text/template/parse.(*lexer).run(...)
/usr/local/go/src/text/template/parse/lex.go:230 +0x37
created by text/template/parse.lex
/usr/local/go/src/text/template/parse/lex.go:223 +0x190
FAIL go.etcd.io/etcd/v3/client/integration 0.013s
```
The grpc-proxy test logic was assuming that the context associated to client is closed,
while in practice all tests called client.Close() without explicit context close.
The current testing strategy is complicated 2 fold:
- grpc proxy works like man-in-the middle of each Connection issues
from integration tests and its lifetime is bound to the connection.
- both connections (client -> proxy, and proxy -> etcd-server) are
represented by the same ClientV3 object instance (with substituted
implementations of KV or watcher).
The fix splits context representing proxy from context representing proxy -> etcd-server connection,
thus allowing cancelation of the proxy context.
SubTraceStart and SubTraceEnd steps are only placeholders, not really
steps, we should skip them when logging the long duration steps,
otherwise these steps will lead to incorrect start time and duration
of subsequent steps.
This commit:
- Fires a critical alert when the etcd database quota is 95% full
at any given point of time to alert the user to defrag or increase
the quota in order to avoid the alarm getting triggered which blocks
all the writes to etcd meaning there can't be any new objects created.
This is needed to make sure the cluster supports running large number
of nodes and objects.
- Fires a warning when there is a sudden surge in etcd writes leading to
increase in the etcd database quota size at an alarming rate as it
is disruptive. It might be because of a rougue process and it's
important to alert the admin.