-
Notifications
You must be signed in to change notification settings - Fork 37
Vault server metrics
Metric Name | Description |
---|---|
vault.core.handle_request |
Duration of requests handled by the Vault core. |
Why it's important: This is the key measure of Vault's response time.
What to look for: Changes to the count
or mean
fields that exceed 50% of baseline values, or
more than 3 standard deviations above baseline.
Metric Name | Description |
---|---|
vault.consul.get |
Count and duration of GET operations against the Consul storage backend. |
vault.consul.put |
Count and duration of PUT operations against the Consul storage backend. |
vault.consul.list |
Count and duration of LIST operations against the Consul storage backend. |
vault.consul.delete |
Count and duration of DELETE operations against the Consul storage backend. |
Why they're important: These metrics indicate how long it takes for Consul to handle requests from Vault.
What to look for: Large deltas in the count
, upper
, or 90_percentile
fields.
Metric Name | Description |
---|---|
vault.wal.persistWALs |
Amount of time required to persist the Vault write-ahead logs (WAL) to the Consul backend. |
vault.wal.flushReady |
Amount of time required to flush the Vault write-ahead logs (WAL) to the persist queue. |
Why they're important: The Vault write-ahead logs (WALs) are used to replicate Vault between clusters. Surprisingly, the WAL's are kept even if replication is not currently enabled. The WAL is purged every few seconds by a garbage collector. But if Vault is under heavy load, the WAL may start to grow, putting pressure on Consul.
What to look for: If flushReady
is over 500ms, or if persistWALs
is over 1000ms.
Metric Name | Description |
---|---|
vault.core.leadership_lost |
Total duration of cluster leadership losses in a highly-available cluster. |
Why it's important: There should not be a leadership change unless the leader crashes or becomes otherwise unavailable. While the other servers elect a leader, Vault is unable to process any requests.
What to monitor: Any value greater than 0 should cause an alert condition.
Metric Name | Description |
---|---|
consul_health_checks[check_name="Vault Sealed Status"].passing |
Value of 1 indicates Vault is unsealed; 0 means sealed. |
Why they're important: By default, Vault is sealed on startup, so if this value changes to 0 during the day, Vault has restarted for some reason. And until it's unsealed, it won't answer requests from clients.
What to look for: A value of 0 being reported by any host.
NOTE: This metric is actually reported by the Consul plugin to Telegraf.
Metric Name | Description |
---|---|
vault.runtime.alloc_bytes |
This measures the number of bytes allocated by the Vault process. |
vault.runtime.sys_bytes |
This is the total number of bytes of memory obtained from the OS. |
mem.total_bytes |
Total amount of physical memory (RAM) available on the server. |
mem.used_percent |
Percentage of physical memory in use. |
swap.used_percent |
Percentage of swap space in use. |
Why they're important: Vault doesn't need as much memory as Consul, but if it does run out, it too will crash. You should also monitor total available RAM to make sure some RAM is available for other processes, and swap usage should remain at 0% for best performance.
What to look for: If sys_bytes
exceeds 90% of total_bytes
, if mem.used_percent
is over 90%, or if
swap.used_percent
is greater than 0.
Metric Name | Description |
---|---|
vault.runtime.total_gc_pause_ns |
Number of nanoseconds consumed by stop-the-world garbage collection (GC) pauses since Vault started. |
Why it's important: As mentioned above, GC pause is a "stop-the-world" event, meaning that all runtime threads are blocked until GC completes. Normally these pauses last only a few nanoseconds. But if memory usage is high, the Go runtime may GC so frequently that it starts to slow down Vault.
What to look for: Warning if total_gc_pause_ns
exceeds 2 seconds/minute, critical if it exceeds 5 seconds/minute.
Additional notes: total_gc_pause_ns
is a cumulative counter, so in order to calculate rates (such as GC/minute),
you will need to apply a function such as non_negative_difference.
Metric Name | Description |
---|---|
linux_sysctl_fs.file-nr |
Number of file handles being used across all processes on the host. |
linux_sysctl_fs.file-max |
Total number of available file handles. |
Why it's important: Practically anything Vault does -- receiving a connection from another host, sending data between servers, writing snapshots to disk -- requires a file descriptor handle. If Vault runs out of handles, it will stop accepting connections.
By default, process and kernel limits are fairly conservative. You will want to increase these beyond the defaults.
What to look for: If file-nr
exceeds 80% of file-max
.
Metric Name | Description |
---|---|
cpu.user_cpu |
Percentage of CPU being used by user processes (such as Vault or Consul). |
cpu.iowait_cpu |
Percentage of CPU time spent waiting for I/O tasks to complete. |
Why they're important: Encryption can place a heavy demand on CPU. If the CPU is too busy, Vault may have trouble keeping up with the incoming request load. You may also want to monitor each CPU individually to make sure requests are evenly balanced across all CPUs.
What to look for: if cpu.iowait_cpu
greater than 10%.
Metric Name | Description |
---|---|
net.bytes_recv |
Bytes received on each network interface. |
net.bytes_sent |
Bytes transmitted on each network interface. |
Why they're important: A sudden spike in network traffic to Vault might be the result of a misconfigured client causing too many requests, or additional load you didn't plan for.
What to look for:
Sudden large changes to the net
metrics (greater than 50% deviation from baseline).
NOTE: The net
metrics are counters, so in order to calculate rates (such as bytes/second),
you will need to apply a function such as non_negative_difference.
Metric Name | Description |
---|---|
diskio.read_bytes |
Bytes read from each block device. |
diskio.write_bytes |
Bytes written to each block device. |
Why they're important: Vault generally doesn't require too much disk I/O, so a sudden change in disk activity could mean that debug/trace logging has accidentally been enabled in production, which can impact performance. Too much disk I/O can cause the rest of the system to slow down or become unavailable, as the kernel spends all its time waiting for I/O to complete.
What to look for: Sudden large changes to the diskio
metrics (greater than 50% deviation from baseline,
or more than 3 standard deviations from baseline).
NOTE: The diskio
metrics are counters, so in order to calculate rates (such as bytes/second),
you will need to apply a function such as non_negative_difference.
This page is part of the Vault and Consul monitoring guide.