From 47d5ae4971ccff2528d552e14243ad1ce5d25b38 Mon Sep 17 00:00:00 2001 From: Anthony Romano Date: Fri, 18 Aug 2017 17:57:55 -0700 Subject: [PATCH] op-guide: add /debug details Fixes #8418 --- Documentation/op-guide/monitoring.md | 50 +++++++++++++++++++++++++--- 1 file changed, 46 insertions(+), 4 deletions(-) diff --git a/Documentation/op-guide/monitoring.md b/Documentation/op-guide/monitoring.md index cbd4356d4..fc3dd0e56 100644 --- a/Documentation/op-guide/monitoring.md +++ b/Documentation/op-guide/monitoring.md @@ -1,6 +1,49 @@ # Monitoring etcd -Each etcd server exports metrics under the `/metrics` path on its client port. +Each etcd server provides local monitoring information on its client port through http endpoints. The monitoring data is useful for both system health checking and cluster debugging. + +## Debug endpoint + +If `--debug` is set, the etcd server exports debugging information on its client port under the `/debug` path. Take care when setting `--debug`, since there will be degraded performance and verbose logging. + +The `/debug/pprof` endpoint is the standard go runtime profiling endpoint. This can be used to profile CPU, heap, mutex, and goroutine utilization. For example, here `go tool pprof` gets the top 10 functions where etcd spends its time: + +```sh +$ go tool pprof http://localhost:2379/debug/pprof/profile +Fetching profile from http://localhost:2379/debug/pprof/profile +Please wait... (30s) +Saved profile in /home/etcd/pprof/pprof.etcd.localhost:2379.samples.cpu.001.pb.gz +Entering interactive mode (type "help" for commands) +(pprof) top10 +310ms of 480ms total (64.58%) +Showing top 10 nodes out of 157 (cum >= 10ms) + flat flat% sum% cum cum% + 130ms 27.08% 27.08% 130ms 27.08% runtime.futex + 70ms 14.58% 41.67% 70ms 14.58% syscall.Syscall + 20ms 4.17% 45.83% 20ms 4.17% github.com/coreos/etcd/cmd/vendor/golang.org/x/net/http2/hpack.huffmanDecode + 20ms 4.17% 50.00% 30ms 6.25% runtime.pcvalue + 20ms 4.17% 54.17% 50ms 10.42% runtime.schedule + 10ms 2.08% 56.25% 10ms 2.08% github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver.(*EtcdServer).AuthInfoFromCtx + 10ms 2.08% 58.33% 10ms 2.08% github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver.(*EtcdServer).Lead + 10ms 2.08% 60.42% 10ms 2.08% github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/pkg/wait.(*timeList).Trigger + 10ms 2.08% 62.50% 10ms 2.08% github.com/coreos/etcd/cmd/vendor/github.com/prometheus/client_golang/prometheus.(*MetricVec).hashLabelValues + 10ms 2.08% 64.58% 10ms 2.08% github.com/coreos/etcd/cmd/vendor/golang.org/x/net/http2.(*Framer).WriteHeaders +``` + +The `/debug/requests` endpoint gives gRPC traces and performance statistics through a web browser. For example, here is a `Range` request for the key `abc`: + +``` +When Elapsed (s) +2017/08/18 17:34:51.999317 0.000244 /etcdserverpb.KV/Range +17:34:51.999382 . 65 ... RPC: from 127.0.0.1:47204 deadline:4.999377747s +17:34:51.999395 . 13 ... recv: key:"abc" +17:34:51.999499 . 104 ... OK +17:34:51.999535 . 36 ... sent: header: kvs: count:1 +``` + +## Metrics endpoint + +Each etcd server exports metrics under the `/metrics` path on its client port and optionally on interfaces given by `--listen-metrics-urls`. The metrics can be fetched with `curl`: @@ -16,7 +59,6 @@ etcd_disk_backend_commit_duration_seconds_bucket{le="0.016"} 406464 ... ``` - ## Prometheus Running a [Prometheus][prometheus] monitoring service is the easiest way to ingest and record etcd's metrics. @@ -56,13 +98,13 @@ nohup /tmp/prometheus \ Now Prometheus will scrape etcd metrics every 10 seconds. -## Alerting +### Alerting There is a [set of default alerts for etcd v3 clusters](./etcd3_alert.rules). > Note: `job` labels may need to be adjusted to fit a particular need. The rules were written to apply to a single cluster so it is recommended to choose labels unique to a cluster. -## Grafana +### Grafana [Grafana][grafana] has built-in Prometheus support; just add a Prometheus data source: