Skip to content

Consul server metrics

Todd Radel edited this page May 9, 2018 · 2 revisions

Transaction timing

Metric Name Description
consul.kvs.apply This measures the time it takes to complete an update to the KV store.
consul.txn.apply This measures the time spent applying a transaction operation.
consul.raft.apply This counts the number of Raft transactions occurring over the interval.
consul.raft.commitTime This measures the time it takes to commit a new entry to the Raft log on the leader.

Why they're important: Taken together, these metrics indicate how long it takes to complete write operations in various parts of the Consul cluster. Generally these should all be fairly consistent and no more than a few milliseconds. Sudden changes in any of the timing values could be due to unexpected load on the Consul servers, or due to problems on the servers themselves.

What to look for: Deviations (in any of these metrics) of more than 50% from baseline over the previous hour.

Leadership changes

Metric Name Description
consul.raft.leader.lastContact Measures the time since the leader was last able to contact the follower nodes when checking its leader lease.
consul.raft.state.candidate This increments whenever a Consul server starts an election.
consul.raft.state.leader This increments whenever a Consul server becomes a leader.

Why they're important: Normally, your Consul cluster should have a stable leader. If there are frequent elections or leadership changes, it would likely indicate network issues between the Consul servers, or that the Consul servers themselves are unable to keep up with the load.

What to look for: If candidate > 0, or leader > 0, or lastContact greater than 200ms.

Autopilot

Metric Name Description
consul.autopilot.healthy This tracks the overall health of the local server cluster. If all servers are considered healthy by Autopilot, this will be set to 1. If any are unhealthy, this will be 0.

Why it's important: Obviously, you want your cluster to be healthy.

What to look for: Alert if healthy is 0.

Memory usage

Metric Name Description
consul.runtime.alloc_bytes This measures the number of bytes allocated by the Consul process.
consul.runtime.sys_bytes This is the total number of bytes of memory obtained from the OS.
mem.total Total amount of physical memory (RAM) available on the server.
mem.used_percent Percentage of physical memory in use.
swap.used_percent Percentage of swap space in use.

Why they're important: Consul keeps all of its data in memory. If Consul consumes all available memory, it will crash. You should also monitor total available RAM to make sure some RAM is available for other processes, and swap usage should remain at 0% for best performance.

What to look for: If sys_bytes exceeds 90% of total_bytes, if mem.used_percent is over 90%, or if swap.used_percent is greater than 0.

Garbage collection

Metric Name Description
consul.runtime.total_gc_pause_ns Number of nanoseconds consumed by stop-the-world garbage collection (GC) pauses since Consul started.

Why it's important: As mentioned above, GC pause is a "stop-the-world" event, meaning that all runtime threads are blocked until GC completes. Normally these pauses last only a few nanoseconds. But if memory usage is high, the Go runtime may GC so frequently that it starts to slow down Consul.

What to look for: Warning if total_gc_pause_ns exceeds 2 seconds/minute, critical if it exceeds 5 seconds/minute.

NOTE: total_gc_pause_ns is a cumulative counter, so in order to calculate rates (such as GC/minute), you will need to apply a function such as non_negative_difference.

File descriptors

Metric Name Description
linux_sysctl_fs.file-nr Number of file handles being used across all processes on the host.
linux_sysctl_fs.file-max Total number of available file handles.

Why it's important: Practically anything Consul does -- receiving a connection from another host, sending data between servers, writing snapshots to disk -- requires a file descriptor handle. If Consul runs out of handles, it will stop accepting connections. See the Consul FAQ for more details.

By default, process and kernel limits are fairly conservative. You will want to increase these beyond the defaults.

What to look for: If file-nr exceeds 80% of file-max.

CPU usage

Metric Name Description
cpu.user_cpu Percentage of CPU being used by user processes (such as Vault or Consul).
cpu.iowait_cpu Percentage of CPU time spent waiting for I/O tasks to complete.

Why they're important: Consul is not particularly demanding of CPU time, but a spike in CPU usage might indicate too many operations taking place at once, and iowait_cpu is critical -- it means Consul is waiting for data to be written to disk, a sign that Raft might be writing snapshots to disk too often.

What to look for: if cpu.iowait_cpu greater than 10%.

Network activity

Metric Name Description
net.bytes_recv Bytes received on each network interface.
net.bytes_sent Bytes transmitted on each network interface.

Why they're important: A sudden spike in network traffic to Consul might be the result of a misconfigured Vault client causing too many requests.

What to look for: Sudden large changes to the net metrics (greater than 50% deviation from baseline).

NOTE: The net metrics are counters, so in order to calculate rates (such as bytes/second), you will need to apply a function such as non_negative_difference.

Disk activity

Metric Name Description
diskio.read_bytes Bytes read from each block device.
diskio.write_bytes Bytes written to each block device.

Why they're important: Since Consul keeps everything in memory, there normally isn't much disk activity. If the Consul host is writing a lot of data to disk, it probably means that Consul is under heavy write load, and consequently is checkpointing Raft snapshots to disk frequently. It could also mean that debug/trace logging has accidentally been enabled in production, which can impact performance. Too much disk I/O can cause the rest of the system to slow down or become unavailable, as the kernel spends all its time waiting for I/O to complete.

What to look for: Sudden large changes to the diskio metrics (greater than 50% deviation from baseline, or more than 3 standard deviations from baseline).

NOTE: The diskio metrics are counters, so in order to calculate rates (such as bytes/second), you will need to apply a function such as non_negative_difference.