From 47d5ae4971ccff2528d552e14243ad1ce5d25b38 Mon Sep 17 00:00:00 2001
From: Anthony Romano <anthony.romano@coreos.com>
Date: Fri, 18 Aug 2017 17:57:55 -0700
Subject: [PATCH] op-guide: add /debug details

Fixes #8418
---
 Documentation/op-guide/monitoring.md | 50 +++++++++++++++++++++++++---
 1 file changed, 46 insertions(+), 4 deletions(-)

diff --git a/Documentation/op-guide/monitoring.md b/Documentation/op-guide/monitoring.md
index cbd4356d4..fc3dd0e56 100644
--- a/Documentation/op-guide/monitoring.md
+++ b/Documentation/op-guide/monitoring.md
@@ -1,6 +1,49 @@
 # Monitoring etcd
 
-Each etcd server exports metrics under the `/metrics` path on its client port.
+Each etcd server provides local monitoring information on its client port through http endpoints. The monitoring data is useful for both system health checking and cluster debugging.
+
+## Debug endpoint
+
+If `--debug` is set, the etcd server exports debugging information on its client port under the `/debug` path. Take care when setting `--debug`, since there will be degraded performance and verbose logging.
+
+The `/debug/pprof` endpoint is the standard go runtime profiling endpoint. This can be used to profile CPU, heap, mutex, and goroutine utilization. For example, here `go tool pprof` gets the top 10 functions where etcd spends its time:
+
+```sh
+$ go tool pprof http://localhost:2379/debug/pprof/profile
+Fetching profile from http://localhost:2379/debug/pprof/profile
+Please wait... (30s)
+Saved profile in /home/etcd/pprof/pprof.etcd.localhost:2379.samples.cpu.001.pb.gz
+Entering interactive mode (type "help" for commands)
+(pprof) top10
+310ms of 480ms total (64.58%)
+Showing top 10 nodes out of 157 (cum >= 10ms)
+    flat  flat%   sum%        cum   cum%
+   130ms 27.08% 27.08%      130ms 27.08%  runtime.futex
+    70ms 14.58% 41.67%       70ms 14.58%  syscall.Syscall
+    20ms  4.17% 45.83%       20ms  4.17%  github.com/coreos/etcd/cmd/vendor/golang.org/x/net/http2/hpack.huffmanDecode
+    20ms  4.17% 50.00%       30ms  6.25%  runtime.pcvalue
+    20ms  4.17% 54.17%       50ms 10.42%  runtime.schedule
+    10ms  2.08% 56.25%       10ms  2.08%  github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver.(*EtcdServer).AuthInfoFromCtx
+    10ms  2.08% 58.33%       10ms  2.08%  github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver.(*EtcdServer).Lead
+    10ms  2.08% 60.42%       10ms  2.08%  github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/pkg/wait.(*timeList).Trigger
+    10ms  2.08% 62.50%       10ms  2.08%  github.com/coreos/etcd/cmd/vendor/github.com/prometheus/client_golang/prometheus.(*MetricVec).hashLabelValues
+    10ms  2.08% 64.58%       10ms  2.08%  github.com/coreos/etcd/cmd/vendor/golang.org/x/net/http2.(*Framer).WriteHeaders
+```
+
+The `/debug/requests` endpoint gives gRPC traces and performance statistics through a web browser. For example, here is a `Range` request for the key `abc`:
+
+```
+When	Elapsed (s)
+2017/08/18 17:34:51.999317 	0.000244 	/etcdserverpb.KV/Range
+17:34:51.999382 	 .    65 	... RPC: from 127.0.0.1:47204 deadline:4.999377747s
+17:34:51.999395 	 .    13 	... recv: key:"abc"
+17:34:51.999499 	 .   104 	... OK
+17:34:51.999535 	 .    36 	... sent: header:<cluster_id:14841639068965178418 member_id:10276657743932975437 revision:15 raft_term:17 > kvs:<key:"abc" create_revision:6 mod_revision:14 version:9 value:"asda" > count:1 
+```
+
+## Metrics endpoint
+
+Each etcd server exports metrics under the `/metrics` path on its client port and optionally on interfaces given by `--listen-metrics-urls`.
 
 The metrics can be fetched with `curl`:
 
@@ -16,7 +59,6 @@ etcd_disk_backend_commit_duration_seconds_bucket{le="0.016"} 406464
 ...
 ```
 
-
 ## Prometheus
 
 Running a [Prometheus][prometheus] monitoring service is the easiest way to ingest and record etcd's metrics.
@@ -56,13 +98,13 @@ nohup /tmp/prometheus \
 Now Prometheus will scrape etcd metrics every 10 seconds.
 
 
-## Alerting
+### Alerting
 
 There is a [set of default alerts for etcd v3 clusters](./etcd3_alert.rules).
 
 > Note: `job` labels may need to be adjusted to fit a particular need. The rules were written to apply to a single cluster so it is recommended to choose labels unique to a cluster.
 
-## Grafana
+### Grafana
 
 [Grafana][grafana] has built-in Prometheus support; just add a Prometheus data source: