Consul server metrics

Transaction timing

Metric Name	Description
`consul.kvs.apply`	This measures the time it takes to complete an update to the KV store.
`consul.txn.apply`	This measures the time spent applying a transaction operation.
`consul.raft.apply`	This counts the number of Raft transactions occurring over the interval.
`consul.raft.commitTime`	This measures the time it takes to commit a new entry to the Raft log on the leader.

Why they're important: Taken together, these metrics indicate how long it takes to complete write operations in various parts of the Consul cluster. Generally these should all be fairly consistent and no more than a few milliseconds. Sudden changes in any of the timing values could be due to unexpected load on the Consul servers, or due to problems on the servers themselves.

What to look for: Deviations (in any of these metrics) of more than 50% from baseline over the previous hour.

Leadership changes

Metric Name	Description
`consul.raft.leader.lastContact`	Measures the time since the leader was last able to contact the follower nodes when checking its leader lease.
`consul.raft.state.candidate`	This increments whenever a Consul server starts an election.
`consul.raft.state.leader`	This increments whenever a Consul server becomes a leader.

Why they're important: Normally, your Consul cluster should have a stable leader. If there are frequent elections or leadership changes, it would likely indicate network issues between the Consul servers, or that the Consul servers themselves are unable to keep up with the load.

What to look for: If candidate > 0, or leader > 0, or lastContact greater than 200ms.

Autopilot

Metric Name	Description
`consul.autopilot.healthy`	This tracks the overall health of the local server cluster. If all servers are considered healthy by Autopilot, this will be set to 1. If any are unhealthy, this will be 0.

Why it's important: Obviously, you want your cluster to be healthy.

What to look for: Alert if healthy is 0.

Memory usage

Metric Name	Description
`consul.runtime.alloc_bytes`	This measures the number of bytes allocated by the Consul process.
`consul.runtime.sys_bytes`	This is the total number of bytes of memory obtained from the OS.
`mem.total`	Total amount of physical memory (RAM) available on the server.
`mem.used_percent`	Percentage of physical memory in use.
`swap.used_percent`	Percentage of swap space in use.

Why they're important: Consul keeps all of its data in memory. If Consul consumes all available memory, it will crash. You should also monitor total available RAM to make sure some RAM is available for other processes, and swap usage should remain at 0% for best performance.

What to look for: If sys_bytes exceeds 90% of total_bytes, if mem.used_percent is over 90%, or if swap.used_percent is greater than 0.

Garbage collection

Metric Name	Description
`consul.runtime.total_gc_pause_ns`	Number of nanoseconds consumed by stop-the-world garbage collection (GC) pauses since Consul started.

Why it's important: As mentioned above, GC pause is a "stop-the-world" event, meaning that all runtime threads are blocked until GC completes. Normally these pauses last only a few nanoseconds. But if memory usage is high, the Go runtime may GC so frequently that it starts to slow down Consul.

What to look for: Warning if total_gc_pause_ns exceeds 2 seconds/minute, critical if it exceeds 5 seconds/minute.

NOTE: total_gc_pause_ns is a cumulative counter, so in order to calculate rates (such as GC/minute), you will need to apply a function such as non_negative_difference.

File descriptors

Metric Name	Description
`linux_sysctl_fs.file-nr`	Number of file handles being used across all processes on the host.
`linux_sysctl_fs.file-max`	Total number of available file handles.

Why it's important: Practically anything Consul does -- receiving a connection from another host, sending data between servers, writing snapshots to disk -- requires a file descriptor handle. If Consul runs out of handles, it will stop accepting connections. See the Consul FAQ for more details.

By default, process and kernel limits are fairly conservative. You will want to increase these beyond the defaults.

What to look for: If file-nr exceeds 80% of file-max.

CPU usage

Metric Name	Description
`cpu.user_cpu`	Percentage of CPU being used by user processes (such as Vault or Consul).
`cpu.iowait_cpu`	Percentage of CPU time spent waiting for I/O tasks to complete.

Why they're important: Consul is not particularly demanding of CPU time, but a spike in CPU usage might indicate too many operations taking place at once, and iowait_cpu is critical -- it means Consul is waiting for data to be written to disk, a sign that Raft might be writing snapshots to disk too often.

What to look for: if cpu.iowait_cpu greater than 10%.

Network activity

Metric Name	Description
`net.bytes_recv`	Bytes received on each network interface.
`net.bytes_sent`	Bytes transmitted on each network interface.

Why they're important: A sudden spike in network traffic to Consul might be the result of a misconfigured Vault client causing too many requests.

What to look for: Sudden large changes to the net metrics (greater than 50% deviation from baseline).

NOTE: The net metrics are counters, so in order to calculate rates (such as bytes/second), you will need to apply a function such as non_negative_difference.

Disk activity

Metric Name	Description
`diskio.read_bytes`	Bytes read from each block device.
`diskio.write_bytes`	Bytes written to each block device.

Why they're important: Since Consul keeps everything in memory, there normally isn't much disk activity. If the Consul host is writing a lot of data to disk, it probably means that Consul is under heavy write load, and consequently is checkpointing Raft snapshots to disk frequently. It could also mean that debug/trace logging has accidentally been enabled in production, which can impact performance. Too much disk I/O can cause the rest of the system to slow down or become unavailable, as the kernel spends all its time waiting for I/O to complete.

What to look for: Sudden large changes to the diskio metrics (greater than 50% deviation from baseline, or more than 3 standard deviations from baseline).

NOTE: The diskio metrics are counters, so in order to calculate rates (such as bytes/second), you will need to apply a function such as non_negative_difference.

This page is part of the Vault and Consul monitoring guide.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly