10686 Commits

Author SHA1 Message Date
Joe Betz
992dbd4d1e version: bump up to 3.1.20 v3.1.20 2018-10-10 11:02:11 -07:00
Gyuho Lee
b39c0f9471 etcdserver: add "etcd_server_read_indexes_failed_total"
Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
2018-10-09 18:21:06 -07:00
Gyuho Lee
3381ef1602 rafthttp: probe all raft transports
This PR adds another probing routine to monitor the connection
for Raft message transports. Previously, we only monitored
snapshot transports.

In our production cluster, we found one TCP connection had >8-sec
latencies to a remote peer, but "etcd_network_peer_round_trip_time_seconds"
metrics shows <1-sec latency distribution, which means etcd server
was not sampling enough while such latency spikes happen
outside of snapshot pipeline connection.

Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
2018-10-09 18:16:08 -07:00
Gyuho Lee
c096dc2cc5 etcdserver: add "etcd_server_health_success/failures"
Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
2018-10-09 18:06:37 -07:00
Jingyi Hu
3e99b42612
Merge pull request #10163 from jingyih/automated-cherry-pick-of-#10153-origin-release-3.1
clientv3: automated cherry pick of #10153 to release 3.1
2018-10-08 18:46:55 -07:00
yura
cf7be488a9 clientv3: concurrency.Mutex.Lock() - preserve invariant
Convenient invariant:
- if werr == nil then lock is supposed to be locked at the moment.

While we could not be confident in stronger invariant ('is exactly locked'),
it were inconvenient that previous code could return `werr == nil` after
Mutex.Unlock.

It could happen when ctx is canceled/timeouted exactly after waitDeletes
successfully returned werr == nil and before `<-ctx.Done()` checked.
While such situation is very rare, it is still possible.

fixes #10111
2018-10-08 16:53:39 -07:00
Gyuho Lee
65fff06adc
Merge pull request #10124 from jingyih/cherry-pick-of-#10109-origin-release-3.1
etcdctl: cherry pick of #10109 to release-3.1
2018-09-25 19:55:23 -07:00
Jingyi Hu
87b4e08c29 etcdctl: cherry pick of #10109 to release-3.1
Add snapshot file integrity verification when querying snapshot status.
2018-09-25 17:51:12 -07:00
Gyuho Lee
216be8b79b etcdserver: add "etcd_server_id"
Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
2018-08-29 14:49:01 -07:00
Gyuho Lee
dfcf82b6ff etcdserver: clarify read index wait timeout warnings
Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
2018-08-29 14:38:22 -07:00
Gyuho Lee
9197907515 rafthttp: clarify "became inactive" warning
Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
2018-08-29 14:33:17 -07:00
Gyuho Lee
14883cad78
Merge pull request #10043 from wenjiaswe/automated-cherry-pick-of-#9997-upstream-release-3.1
Automated cherry pick of #9997
2018-08-29 12:42:05 -07:00
Wenjia
4e7691ddcc
remove automatic added imports 2018-08-28 15:00:44 -07:00
Gyuho Lee
8a68ae95ec etcdserver/api/rafthttp: add v3 snapshot send/receive metrics
Distribution would be:
0.1 second or more
...
25.6 seconds or more
51.2 seconds or more

etcd_network_snapshot_send_success
etcd_network_snapshot_send_failures
etcd_network_snapshot_send_total_duration_seconds
etcd_network_snapshot_receive_success
etcd_network_snapshot_receive_failures
etcd_network_snapshot_receive_total_duration_seconds

Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
2018-08-28 14:34:21 -07:00
Gyuho Lee
ef1d332298 etcdserver/api/snap: add v3 snapshot fsync metrics
etcd_snap_db_fsync_duration_seconds_count
etcd_snap_db_save_total_duration_seconds_bucket

Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
2018-08-28 14:16:05 -07:00
Xiang Li
116c442615
Merge pull request #10034 from gyuho/init-metrics-3.1
etcdserver/api/v3rpc: display all registered gRPC metrics at start (v3.1)
2018-08-24 18:52:40 -07:00
Gyuho Lee
e07fb41140 etcdserver/api/v3rpc: display all registered gRPC metrics at start
Previously, only display the one that has been requested at least once.
Now it shows all metrics, as we do in v3.3 and v3.4+.

grpc_server_started_total{grpc_method="Alarm",grpc_service="etcdserverpb.Maintenance",grpc_type="unary"} 0
grpc_server_started_total{grpc_method="AuthDisable",grpc_service="etcdserverpb.Auth",grpc_type="unary"} 0
grpc_server_started_total{grpc_method="AuthEnable",grpc_service="etcdserverpb.Auth",grpc_type="unary"} 0
grpc_server_started_total{grpc_method="Authenticate",grpc_service="etcdserverpb.Auth",grpc_type="unary"} 0
grpc_server_started_total{grpc_method="Compact",grpc_service="etcdserverpb.KV",grpc_type="unary"} 0
grpc_server_started_total{grpc_method="Defragment",grpc_service="etcdserverpb.Maintenance",grpc_type="unary"} 0
grpc_server_started_total{grpc_method="DeleteRange",grpc_service="etcdserverpb.KV",grpc_type="unary"} 0

Should help document metrics.

Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
2018-08-22 19:14:58 -07:00
Joe Betz
2c616b0c3d
Merge pull request #10030 from jingyih/cherry-pick-of-#9990-origin-release-3.1
etcdserver: cherry pick of #9990 to release 3.1
2018-08-20 15:33:27 -07:00
Jingyi Hu
dd2803c4a6 etcdserver: add grpc interceptor to log info on incoming request to
etcdserver

To improve debuggability of etcd v3.1. Added a grpc interceptor to log
info on incoming requests to etcd server. The log output includes remote
client info, request content (with value field redacted), request
handling latency, response size, etc.

Dependency on zap logger and grpc_middleware is removed during
backporting.

Added checking in logging interceptor. If debug level is disabled, skip
logUnaryRequestStats() to avoid potential performance degradation. (PR #10021)
2018-08-20 14:32:48 -07:00
Jingyi Hu
4855ca62b5 etcdserver: add grpc interceptor to log info on incoming request to etcdserver.
To improve debuggability of etcd v3. Added a grpc interceptor to log
info on incoming requests to etcd server. The log output includes remote
client info, request content (with value field redacted), request
handling latency, response size, etc.

Dependency on zap logger and grpc_middleware is removed during
backporting.

Added checking in logging interceptor. If debug level is disabled, skip
logUnaryRequestStats() to avoid potential performance degradation. (PR #10021)
2018-08-20 13:54:24 -07:00
Joe Betz
bb205caa68 version: bump up to 3.1.19+git 2018-07-24 10:07:31 -07:00
Joe Betz
a1d6802da2 version: bump up to 3.1.19 v3.1.19 2018-07-24 10:04:37 -07:00
Gyuho Lee
79d80bd259 etcdserver: add "etcd_server_go_version" metric
Currently, one has to look at server logs manually,
to see what Go version was used to build etcd server.

Signed-off-by: Gyuho Lee <gyuhox@gmail.com>
2018-07-23 16:38:10 -07:00
Gyuho Lee
081519c323 clientv3: fix keepalive send interval when response queue is full
client should update next keepalive send time
even when lease keepalive response queue becomes full.

Otherwise, client sends keepalive request every 500ms
regardless of TTL when the send is only expected to happen
with the interval of TTL / 3 at minimum.

Signed-off-by: Gyuho Lee <gyuhox@gmail.com>
2018-07-23 08:50:07 -07:00
Gyuho Lee
e0d5a028d5
Merge pull request #9944 from wenjiaswe/automated-cherry-pick-of-#9761-upstream-release-3.1
Automated cherry pick of #9761
2018-07-20 14:51:20 -07:00
Wenjia
a421a604d6
remove hashRevDurations 2018-07-20 13:49:58 -07:00
Gyuho Lee
0fbf49df11 etcdserver: rename to "heartbeat_send_failures_total"
Signed-off-by: Gyuho Lee <gyuhox@gmail.com>
2018-07-20 11:40:37 -07:00
Gyuho Lee
fb5080b306 mvcc: add "etcd_mvcc_hash_(rev)_duration_seconds"
etcd_mvcc_hash_duration_seconds
etcd_mvcc_hash_rev_duration_seconds

Signed-off-by: Gyuho Lee <gyuhox@gmail.com>
2018-07-20 11:37:06 -07:00
Gyuho Lee
cac6ce756d mvcc/backend: fix defrag duration scale
Signed-off-by: Gyuho Lee <gyuhox@gmail.com>
2018-07-20 10:53:26 -07:00
Gyuho Lee
9f58e57a3c mvcc/backend: add "etcd_disk_backend_defrag_duration_seconds"
Signed-off-by: Gyuho Lee <gyuhox@gmail.com>
2018-07-20 10:53:26 -07:00
Gyuho Lee
22c25dd4e7 mvcc/backend: document metrics ExponentialBuckets
Signed-off-by: Gyuho Lee <gyuhox@gmail.com>
2018-07-20 10:44:52 -07:00
Gyuho Lee
92a7b5df80 mvcc/backend: clean up mutex, logging
Signed-off-by: Gyuho Lee <gyuhox@gmail.com>
2018-07-20 10:35:39 -07:00
Gyuho Lee
3f1fe618ad etcdserver: add "etcd_server_slow_apply_total"
{"level":"warn","ts":1527101858.6985068,"caller":"etcdserver/util.go:115","msg":"apply request took too long","took":0.114101529,"expected-duration":0.1,"prefix":"","request":"header:<ID:1029181977902852337> put:<key:\"\\000\\000...

Signed-off-by: Gyuho Lee <gyuhox@gmail.com>
2018-07-20 10:25:16 -07:00
Gyuho Lee
b8547734ae etcdserver: add "etcd_server_heartbeat_failures_total"
{"level":"warn","ts":1527101858.4149103,"caller":"etcdserver/raft.go:370","msg":"failed to send out heartbeat; took too long, server is overloaded likely from slow disk","heartbeat-interval":0.1,"expected-duration":0.2,"exceeded-duration":0.025771662}
{"level":"warn","ts":1527101858.4149644,"caller":"etcdserver/raft.go:370","msg":"failed to send out heartbeat; took too long, server is overloaded likely from slow disk","heartbeat-interval":0.1,"expected-duration":0.2,"exceeded-duration":0.034015766}

Signed-off-by: Gyuho Lee <gyuhox@gmail.com>
2018-07-20 10:24:40 -07:00
Gyuho Lee
78a13e67a0 mvcc/backend: avoid unnecessary metrics update
https://github.com/coreos/etcd/pull/9300

Signed-off-by: Gyuho Lee <gyuhox@gmail.com>
2018-07-03 14:53:20 -07:00
Gyuho Lee
84d11a51c1 mvcc: use "t.tx.DB()" to fetch DB
Signed-off-by: Gyuho Lee <gyuhox@gmail.com>
2018-07-03 14:34:20 -07:00
Gyuho Lee
a9c4b98756 mvcc: add "etcd_mvcc_db_total_size_in_use_in_bytes"
Signed-off-by: Gyuho Lee <gyuhox@gmail.com>
2018-07-03 14:21:11 -07:00
Gyuho Lee
5531e3b0f5 mvcc: add "etcd_mvcc_db_total_size_in_bytes"
Signed-off-by: Gyuho Lee <gyuhox@gmail.com>
2018-07-03 13:51:06 -07:00
Gyuho Lee
c2623bb840 etcdserver: add "etcd_server_quota_backend_bytes"
Signed-off-by: Gyuho Lee <gyuhox@gmail.com>
2018-07-03 13:30:10 -07:00
Gyuho Lee
f46b4677c0 etcdserver: add "etcd_server_slow_read_indexes_total"
Signed-off-by: Gyuho Lee <gyuhox@gmail.com>
2018-07-03 12:58:29 -07:00
Gyuho Lee
09843d5d90 etcdserver: clarify read index warnings
Signed-off-by: Gyuho Lee <gyuhox@gmail.com>
2018-07-03 12:55:31 -07:00
Gyuho Lee
be3e6f6ed5 tests: update test scripts
Signed-off-by: Gyuho Lee <gyuhox@gmail.com>
2018-06-18 14:15:52 -07:00
Joe Betz
d84dd18637 version: bump up to 3.1.18+git 2018-06-15 09:51:30 -07:00
Joe Betz
b7ff47f9d5 version: bump up to 3.1.18 v3.1.18 2018-06-15 09:47:04 -07:00
Gyuho Lee
fab24fbdab
Merge pull request #9848 from wenjiaswe/automated-cherry-pick-of-#8960-upstream-release-3.1
Automated cherry pick of #8960
2018-06-13 16:49:48 -07:00
Joe Betz
b3ee996629 metrics: Add server_version metric 2018-06-13 16:31:18 -07:00
Gyuho Lee
06da6cf983 tests/semaphore.test.bash: update
Signed-off-by: Gyuho Lee <gyuhox@gmail.com>
2018-06-13 14:42:45 -07:00
Gyuho Lee
9c00100550 Makefile: update
Signed-off-by: Gyuho Lee <gyuhox@gmail.com>
2018-06-13 14:42:10 -07:00
Gyuho Lee
1d7a2ca520
Merge pull request #9838 from jpbetz/automated-cherry-pick-of-#9821-origin-release-3.1-1528833932
etcdserver: Automated cherry pick of detailed "took too long" warnings to release-3.1
2018-06-12 13:54:40 -07:00
Joe Betz
e90934ec71 etcdserver: Fix txn request 'took too long' warnings to use loggable request stringer 2018-06-12 13:22:45 -07:00