1638 Commits

Author SHA1 Message Date
Gyuho Lee
e5c2dff346 etcdserver: detect leader change on reads
Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
2019-08-14 09:32:10 -07:00
Gyuho Lee
5a678bb4e3 etcdserver/api/v3rpc: support watch fragmentation
Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
2019-08-14 01:22:29 -07:00
Gyuho Lee
d167714b36 *: regenerate proto
Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
2019-08-14 01:22:23 -07:00
Gyuho Lee
9f7294f1e0 etcdserver/etcdserverpb/rpc.proto: add watch progress/fragment
Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
2019-08-14 01:17:29 -07:00
Gyuho Lee
08124105ad *: use new adt.IntervalTree interface
Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
2019-08-09 11:15:49 -07:00
Gyuho Lee
4527f4c4b0 etcdserver: add "etcd_server_snapshot_apply_inflights_total"
Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
2019-08-08 15:13:14 -07:00
Gyuho Lee
f179d4d6a3 etcdserver: improve heartbeat send failures logging
Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
2019-05-02 10:02:28 -07:00
Sam Batschelet
43386ac29b *: Change gRPC proxy to expose etcd server endpoint /metrics
This PR resolves an issue where the `/metrics` endpoints exposed by the proxy were not returning metrics of the etcd members servers but of the proxy itself.

Signed-off-by: Sam Batschelet <sbatsche@redhat.com>
2019-04-11 17:07:40 -04:00
James Shubin
7814718c73 etcdserver: Use panic instead of fatal on no space left error
When using the embed package to embed etcd, sometimes the storage prefix
being used might be full. In this case, this code path triggers, causing
an: `etcdserver: create wal error: no space left on device` error, which
causes a fatal. A fatal differs from a panic in that it also calls
os.Exit(1). In this situation, the calling program that embeds the etcd
server will be abruptly killed, which prevents it from cleaning up
safely, and giving a proper error message. Depending on what the calling
program is, this can cause corruption and data loss.

This patch switches the fatal to a panic. Ideally this would be a
regular error which would get propagated upwards to the StartEtcd
command, but in the meantime at least this can be caught with recover().

This fixes the most common fatal that I've experienced, but there are
surely more that need looking into. If possible, the errors should be
threaded down into the code path so that embedding etcd can be more
robust.

Fixes: https://github.com/etcd-io/etcd/issues/10588

This is a cherry-picked version of upstream: 368f70a37cf25b432f01921d3f05a3bc0357297a
2019-03-29 17:45:48 -04:00
Gyuho Lee
957700f444 etcdserver: add "etcd_server_read_indexes_failed_total"
Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
2018-10-09 18:22:02 -07:00
Gyuho Lee
8491137b55 etcdserver: add "etcd_server_health_success/failures"
Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
2018-10-09 17:54:30 -07:00
Jingyi Hu
9eee0b078e etcdserver: remove duplicated imports
Removed duplicated imports of package 'context' in server.go
2018-09-13 20:44:03 -07:00
Gyuho Lee
d1acb5a5c8 etcdserver: add "etcd_server_id"
Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
2018-08-29 14:50:17 -07:00
Gyuho Lee
73c1100b04 etcdserver: clarify read index wait timeout warnings
Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
2018-08-29 14:38:59 -07:00
Gyuho Lee
0dc4632e28 Merge pull request #9861 from gyuho/race
etcdserver/api/v3rpc: remove duplicate gRPC logger set
2018-08-17 22:32:10 -04:00
Jingyi Hu
264bb51a9a etcdserver: code clean up
Code clean up in interceptor.go
2018-08-14 17:08:45 -07:00
Jingyi Hu
c6c0d03522 vendor: add go-grpc-middleware
Rebased to master PR #9994.  Fixed a Go format issue in
v3rpc/interceptor.go.  Updated vendor to include go-grpc-middleware.
2018-08-14 17:08:45 -07:00
Jingyi Hu
94f81368ae etcdserver: add grpc interceptor to log info on incoming requests to etcd server
To improve debuggability of etcd v3. Added a grpc interceptor to log
info on incoming requests to etcd server. The log output includes
remote client info, request content (with value field redacted), request
handling latency, response size, etc. Uses zap logger if available,
otherwise uses capnslog.

Also did some clean up on the chaining of grpc interceptors on server
side.
2018-08-14 16:20:13 -07:00
Gyuho Lee
ea40e9f059 etcdserver: add "etcd_server_go_version" metric
Currently, one has to look at server logs manually,
to see what Go version was used to build etcd server.

Signed-off-by: Gyuho Lee <gyuhox@gmail.com>
2018-07-23 16:39:24 -07:00
Wenjia
7f421efe48
remove "github.com/gogo/protobuf/plugin/stringer" 2018-07-19 17:15:32 -07:00
Gyuho Lee
d509620793 etcdserver: rename to "heartbeat_send_failures_total"
Signed-off-by: Gyuho Lee <gyuhox@gmail.com>
2018-07-19 16:58:14 -07:00
Gyuho Lee
e43224c3b6 etcdserver: add "etcd_server_slow_apply_total"
{"level":"warn","ts":1527101858.6985068,"caller":"etcdserver/util.go:115","msg":"apply request took too long","took":0.114101529,"expected-duration":0.1,"prefix":"","request":"header:<ID:1029181977902852337> put:<key:\"\\000\\000...

Signed-off-by: Gyuho Lee <gyuhox@gmail.com>
2018-07-19 16:52:37 -07:00
Gyuho Lee
4c7bf51030 etcdserver: add "etcd_server_heartbeat_failures_total"
{"level":"warn","ts":1527101858.4149103,"caller":"etcdserver/raft.go:370","msg":"failed to send out heartbeat; took too long, server is overloaded likely from slow disk","heartbeat-interval":0.1,"expected-duration":0.2,"exceeded-duration":0.025771662}
{"level":"warn","ts":1527101858.4149644,"caller":"etcdserver/raft.go:370","msg":"failed to send out heartbeat; took too long, server is overloaded likely from slow disk","heartbeat-interval":0.1,"expected-duration":0.2,"exceeded-duration":0.034015766}

Signed-off-by: Gyuho Lee <gyuhox@gmail.com>
2018-07-19 16:51:08 -07:00
Gyuho Lee
72c51d3e12 etcdserver: add "etcd_server_quota_backend_bytes"
Signed-off-by: Gyuho Lee <gyuhox@gmail.com>
2018-07-03 13:26:49 -07:00
Gyuho Lee
4481238224 etcdserver: add "etcd_server_slow_read_indexes_total"
Signed-off-by: Gyuho Lee <gyuhox@gmail.com>
2018-07-03 13:00:08 -07:00
Gyuho Lee
82e670766a etcdserver: clarify read index warnings
Signed-off-by: Gyuho Lee <gyuhox@gmail.com>
2018-07-03 12:53:21 -07:00
Joe Betz
b7c19232bc etcdserver: Fix txn request 'took too long' warnings to use loggable request stringer 2018-06-12 09:33:33 -07:00
Joe Betz
07f833ae3e etcdserver: Add response byte size and range response count to took too long warning 2018-06-11 11:26:26 -07:00
Joe Betz
ef154094b3 etcdserver: Replace value contents with value_size in request took too long warning 2018-06-08 09:49:43 -07:00
Gyuho Lee
870138accb etcdserver: log skipping initial election tick
Signed-off-by: Gyuho Lee <gyuhox@gmail.com>
2018-04-23 10:59:01 -07:00
Gyuho Lee
b923c74fe5 etcdserver: add "InitialElectionTickAdvance"
Signed-off-by: Gyuho Lee <gyuhox@gmail.com>
2018-04-23 10:21:51 -07:00
Maciej Borsz
7cbc2f1068 etcdserver: add is_leader prometheus metric that is 1 on the leader.
Before this change, we had now way to find a leader using /metrics
endpoint. This commit adds a metric to do that.
2018-04-19 14:59:31 -07:00
disksing
095fc0b411 etcdserver/stats: make all fields guarded by mutex. 2018-04-11 19:49:00 -07:00
disksing
d40abbb502 etcdserver/stats: fix stats data race. 2018-04-11 19:49:00 -07:00
Gyuho Lee
cdbb8ffdc1 etcdserver: fix "lease_expired_total" metrics
Signed-off-by: Gyuho Lee <gyuhox@gmail.com>
2018-04-10 17:57:35 -07:00
Gyuho Lee
3282d90707 etcdserver: adjust election ticks on restart
Signed-off-by: Gyuho Lee <gyuhox@gmail.com>
2018-03-10 20:05:56 -08:00
Gyuho Lee
b2d5c6c7bd etcdserver: make "advanceTicks" method
Signed-off-by: Gyuho Lee <gyuhox@gmail.com>
2018-03-10 20:05:50 -08:00
Gyuho Lee
6fe7316ec4 rafthttp: add "ActivePeers" to "Transport"
Signed-off-by: Gyuho Lee <gyuhox@gmail.com>
2018-03-10 20:05:35 -08:00
Iwasaki Yudai
eaa0050d4d *: enforce max lease TTL with 9,000,000,000 seconds
math.MaxInt64 / time.Second is 9,223,372,036. 9,000,000,000 is easier to
remember/document.
2018-03-08 10:34:12 -08:00
Gyuho Lee
bb8a5377ce api/v3election: error on missing "leader" field
Signed-off-by: Gyuho Lee <gyuhox@gmail.com>
2018-03-02 10:40:34 -08:00
Gyuho Lee
a5b31087e8 etcdserver: enable "CheckQuorum" when starting with "ForceNewCluster"
We enable "raft.Config.CheckQuorum" by default in other
Raft initial starts. So should start with "ForceNewCluster".

Signed-off-by: Gyuho Lee <gyuhox@gmail.com>
2018-03-02 10:40:08 -08:00
Xiang
9942f904fb etcdserver: improve request took too long warning 2018-02-06 16:58:04 -08:00
Gyuho Lee
50d2a00f01 etcdserver: clarify warnings on backend open taking >10 seconds
If db file is 10 GiB, it can take more than 1-second.

Signed-off-by: Gyuho Lee <gyuhox@gmail.com>
2018-01-26 10:55:16 -08:00
Gyuho Lee
c5bba152ee etcdserver: add detailed errors in "ValidateClusterAndAssignIDs"
Signed-off-by: Gyuho Lee <gyuhox@gmail.com>
2018-01-25 12:00:21 -08:00
Gyuho Lee
f9b7fccf1b etcdserver: add error details on DNS resolution failure on advertise URLs
Signed-off-by: Gyuho Lee <gyuhox@gmail.com>
2018-01-25 12:00:07 -08:00
Gyuho Lee
8a18cc96d0 etcdserver/api/v3rpc: debug-log client disconnect on TLS, http/2 stream CANCEL
Signed-off-by: Gyuho Lee <gyuhox@gmail.com>
2018-01-19 12:50:20 -08:00
Gyuho Lee
02d362ccde etcdserver/api/etcdhttp: remove "errors" field in /health
Signed-off-by: Gyuho Lee <gyuhox@gmail.com>
2018-01-17 15:22:54 -08:00
Jordan Liggitt
d292337d14 api/etcdhttp: change /health type back to string for backwards compatibility 2018-01-17 12:44:38 -08:00
Sahdev P. Zala
ec43197344 etcdserver/api/v3rpc: debug user cancellation and log warning for rest
The context error with cancel code is typically for user cancellation which
should be at debug level. For other error codes we should display a warning.

Fixes #9085
2018-01-08 10:14:37 -08:00
Gyuho Lee
325913d6fb etcdserver/api/v3rpc: set grpclog once
Signed-off-by: Gyuho Lee <gyuhox@gmail.com>
2018-01-02 11:02:17 -08:00