etcd/metrics.md at d0f6432b51e37c402450182ce01203dca8a40108

mirror of https://github.com/etcd-io/etcd.git synced 2024-09-27 06:25:44 +00:00

Xiang Li 5c1d4544fc doc: add doc for metrics feature

2015-06-16 14:18:22 -07:00

3.4 KiB

Raw Blame History

Metrics

NOTE: The metrics feature is considered as an experimental. We might add/change/remove metrics without warning in the future releases.

etcd uses Prometheus for metrics reporting in the server. The metrics can be used for real-time monitoring and debugging.

The simplest way to see the available metrics is to cURL the metrics endpoint /metrics of etcd. The format is described here.

You can also follow the doc here to start a Promethus server and monitor etcd metrics.

The naming of metrics follows the suggested best practice of Promethus. A metric name has an etcd prefix as its namespace and a subsystem prefix (for example wal and etcdserver).

etcd now exposes the following metrics:

etcdserver

Name	Description	Type
file_descriptors_used_total	The total number of file descriptors used	Gauge
proposal_durations_milliseconds	The latency distributions of committing proposal	Summary
pending_proposal_total	The total number of pending proposals	Gauge
proposal_failed_total	The total number of failed proposals	Counter

High file descriptors (file_descriptors_used_total) usage (near the file descriptors limitation of the process) indicates a potential out of file descriptors issue. That might cause etcd fails to create new WAL files and panics.

Proposal durations (proposal_durations_milliseconds) give you an summary about the proposal commit latency. Latency can be introduced into this process by network and disk IO.

Pending proposal (pending_proposal_total) gives you an idea about how many proposal are in the queue and waiting for commit. An increasing pending number indicates a high client load or an unstable cluster.

Failed proposals (proposal_failed_total) are normally related to two issues: temporary failures related to a leader election or longer duration downtime caused by a loss of quorum in the cluster.

wal

Name	Description	Type
fsync_durations_microseconds	The latency distributions of fsync called by wal	Summary
last_index_saved	The index of the last entry saved by wal	Gauge

Abnormally high fsync duration (fsync_durations_microseconds) indicates disk issues and might cause the cluster to be unstable.

snapshot

Name	Description	Type
snapshot_save_total_durations_microseconds	The total latency distributions of save called by snapshot	Summary

Abnormally high snapshot duration (snapshot_save_total_durations_microseconds) indicates disk issues and might cause the cluster to be unstable.

3.4 KiB Raw Blame History

Metrics

etcdserver

wal

snapshot

3.4 KiB

Raw Blame History