3.4 KiB
Metrics
NOTE: The metrics feature is considered as an experimental. We might add/change/remove metrics without warning in the future releases.
etcd uses Prometheus for metrics reporting in the server. The metrics can be used for real-time monitoring and debugging.
The simplest way to see the available metrics is to cURL the metrics endpoint /metrics
of etcd. The format is described here.
You can also follow the doc here to start a Promethus server and monitor etcd metrics.
The naming of metrics follows the suggested best practice of Promethus. A metric name has an etcd
prefix as its namespace and a subsystem prefix (for example wal
and etcdserver
).
etcd now exposes the following metrics:
etcdserver
Name | Description | Type |
---|---|---|
file_descriptors_used_total | The total number of file descriptors used | Gauge |
proposal_durations_milliseconds | The latency distributions of committing proposal | Summary |
pending_proposal_total | The total number of pending proposals | Gauge |
proposal_failed_total | The total number of failed proposals | Counter |
High file descriptors (file_descriptors_used_total
) usage (near the file descriptors limitation of the process) indicates a potential out of file descriptors issue. That might cause etcd fails to create new WAL files and panics.
Proposal durations (proposal_durations_milliseconds
) give you an summary about the proposal commit latency. Latency can be introduced into this process by network and disk IO.
Pending proposal (pending_proposal_total
) gives you an idea about how many proposal are in the queue and waiting for commit. An increasing pending number indicates a high client load or an unstable cluster.
Failed proposals (proposal_failed_total
) are normally related to two issues: temporary failures related to a leader election or longer duration downtime caused by a loss of quorum in the cluster.
wal
Name | Description | Type |
---|---|---|
fsync_durations_microseconds | The latency distributions of fsync called by wal | Summary |
last_index_saved | The index of the last entry saved by wal | Gauge |
Abnormally high fsync duration (fsync_durations_microseconds
) indicates disk issues and might cause the cluster to be unstable.
snapshot
Name | Description | Type |
---|---|---|
snapshot_save_total_durations_microseconds | The total latency distributions of save called by snapshot | Summary |
Abnormally high snapshot duration (snapshot_save_total_durations_microseconds
) indicates disk issues and might cause the cluster to be unstable.