Skip to content

Vault server metrics

Todd Radel edited this page May 9, 2018 · 6 revisions

Request processing

Metric Name Description
vault.core.handle_request Duration of requests handled by the Vault core.

Why it's important: This is the key measure of Vault's response time.

What to look for: Changes to the count or mean fields that exceed 50% of baseline values, or more than 3 standard deviations above baseline.

Consul response time

Metric Name Description
vault.consul.get Count and duration of GET operations against the Consul storage backend.
vault.consul.put Count and duration of PUT operations against the Consul storage backend.
vault.consul.list Count and duration of LIST operations against the Consul storage backend.
vault.consul.delete Count and duration of DELETE operations against the Consul storage backend.

Why they're important: These metrics indicate how long it takes for Consul to handle requests from Vault.

What to look for: Large deltas in the count, upper, or 90_percentile fields.

Write-ahead log processing

Metric Name Description
vault.wal.persistWALs Amount of time required to persist the Vault write-ahead logs (WAL) to the Consul backend.
vault.wal.flushReady Amount of time required to flush the Vault write-ahead logs (WAL) to the persist queue.

Why they're important: The Vault write-ahead logs (WALs) are used to replicate Vault between clusters. Surprisingly, the WAL's are kept even if replication is not currently enabled. The WAL is purged every few seconds by a garbage collector. But if Vault is under heavy load, the WAL may start to grow, putting pressure on Consul.

What to look for: If flushReady is over 500ms, or if persistWALs is over 1000ms.

Leadership changes

Metric Name Description
vault.core.leadership_lost Total duration of cluster leadership losses in a highly-available cluster.

Why it's important: There should not be a leadership change unless the leader crashes or becomes otherwise unavailable. While the other servers elect a leader, Vault is unable to process any requests.

What to monitor: Any value greater than 0 should cause an alert condition.

Seal status

Metric Name Description
consul_health_checks[check_name="Vault Sealed Status"].passing Value of 1 indicates Vault is unsealed; 0 means sealed.

Why they're important: By default, Vault is sealed on startup, so if this value changes to 0 during the day, Vault has restarted for some reason. And until it's unsealed, it won't answer requests from clients.

What to look for: A value of 0 being reported by any host.

NOTE: This metric is actually reported by the Consul plugin to Telegraf.

Memory usage

Metric Name Description
vault.runtime.alloc_bytes This measures the number of bytes allocated by the Vault process.
vault.runtime.sys_bytes This is the total number of bytes of memory obtained from the OS.
mem.total_bytes Total amount of physical memory (RAM) available on the server.
mem.used_percent Percentage of physical memory in use.
swap.used_percent Percentage of swap space in use.

Why they're important: Vault doesn't need as much memory as Consul, but if it does run out, it too will crash. You should also monitor total available RAM to make sure some RAM is available for other processes, and swap usage should remain at 0% for best performance.

What to look for: If sys_bytes exceeds 90% of total_bytes, if mem.used_percent is over 90%, or if swap.used_percent is greater than 0.

Garbage collection

Metric Name Description
vault.runtime.total_gc_pause_ns Number of nanoseconds consumed by stop-the-world garbage collection (GC) pauses since Vault started.

Why it's important: As mentioned above, GC pause is a "stop-the-world" event, meaning that all runtime threads are blocked until GC completes. Normally these pauses last only a few nanoseconds. But if memory usage is high, the Go runtime may GC so frequently that it starts to slow down Vault.

What to look for: Warning if total_gc_pause_ns exceeds 2 seconds/minute, critical if it exceeds 5 seconds/minute.

Additional notes: total_gc_pause_ns is a cumulative counter, so in order to calculate rates (such as GC/minute), you will need to apply a function such as non_negative_difference.

File descriptors

Metric Name Description
linux_sysctl_fs.file-nr Number of file handles being used across all processes on the host.
linux_sysctl_fs.file-max Total number of available file handles.

Why it's important: Practically anything Vault does -- receiving a connection from another host, sending data between servers, writing snapshots to disk -- requires a file descriptor handle. If Vault runs out of handles, it will stop accepting connections.

By default, process and kernel limits are fairly conservative. You will want to increase these beyond the defaults.

What to look for: If file-nr exceeds 80% of file-max.

CPU usage

Metric Name Description
cpu.user_cpu Percentage of CPU being used by user processes (such as Vault or Consul).
cpu.iowait_cpu Percentage of CPU time spent waiting for I/O tasks to complete.

Why they're important: Encryption can place a heavy demand on CPU. If the CPU is too busy, Vault may have trouble keeping up with the incoming request load. You may also want to monitor each CPU individually to make sure requests are evenly balanced across all CPUs.

What to look for: if cpu.iowait_cpu greater than 10%.

Network activity

Metric Name Description
net.bytes_recv Bytes received on each network interface.
net.bytes_sent Bytes transmitted on each network interface.

Why they're important: A sudden spike in network traffic to Vault might be the result of a misconfigured client causing too many requests, or additional load you didn't plan for.

What to look for: Sudden large changes to the net metrics (greater than 50% deviation from baseline).

NOTE: The net metrics are counters, so in order to calculate rates (such as bytes/second), you will need to apply a function such as non_negative_difference.

Disk activity

Metric Name Description
diskio.read_bytes Bytes read from each block device.
diskio.write_bytes Bytes written to each block device.

Why they're important: Vault generally doesn't require too much disk I/O, so a sudden change in disk activity could mean that debug/trace logging has accidentally been enabled in production, which can impact performance. Too much disk I/O can cause the rest of the system to slow down or become unavailable, as the kernel spends all its time waiting for I/O to complete.

What to look for: Sudden large changes to the diskio metrics (greater than 50% deviation from baseline, or more than 3 standard deviations from baseline).

NOTE: The diskio metrics are counters, so in order to calculate rates (such as bytes/second), you will need to apply a function such as non_negative_difference.