Skip to content
Todd Radel edited this page Apr 20, 2018 · 24 revisions

Monitoring Options

Almost all of these options start with a local agent on each host.

Grafana/Graphite

Graphite is an open-source tool for storing and graphing time-series data. It doesn't support dashboards or alerts, but Grafana can be layered on top of it. 

TICK Stack

TICK is an acronym for Telegraf, InfluxDB, Chronograf, and Kapacitor. Together they provide a full solution for storing, displaying, and alerting on time-series data. It is available in both open-source and commercial versions from InfluxData.

  • Telegraf provides a statsd-compatible host agent.
  • InfluxDB is the time-series database.
  • Chronograf is a dashboarding engine roughly similar to Grafana.
  • Kapacitor provides alerting.

Amazon Cloudwatch

CloudWatch is Amazon's solution for monitoring AWS cloud resources. It handles both time-series data and log files. If you are running Vault and Consul in AWS, it can be an easy choice to make.

One limitation of CloudWatch is that time-series data is only available at a 1-minute granularity and only for 15 days. After that, the data is rolled up into 5-minute and one-hour buckets. For more details, see here.

Prometheus

Prometheus is a modern alternative to statsd-compatible daemons. Prometheus is increasingly popular in the containerized world.

DataDog

DataDog is a commercial SaaS solution. They provide a customized statsd agent, DogStatsd, that includes several vendor-specific extensions, mostly tagging and service check results.

Other Options

There are many other commercial and open-source choices. Configuring those is beyond the scope of this document.

Setting up monitoring on your cluster

Setting up an InfluxDB and Grafana Server

Installing Telegraf Agents

Configuring Consul Agents

Asking Consul to send telemetry to Telegraf is as simple as adding a telemetry section to your agent configuration:

{
  "telemetry": {
    "dogstatsd_addr": "localhost:8125",
    "disable_hostname": true
  }
}

As you can see, we only need to specify two options. The dogstatsd_addr specifies the hostname and port of the statsd daemon. Note that we specify DogStatsd format instead of plain statsd, which tells Consul to send tags with each metric. Telegraf is compatible with the DogStatsd format and allows us to add our own tags too, as you'll see below.

The second option tells Consul not to insert the hostname in the names of the metrics it sends to statsd, since they will be sent as tags.

If you are using a different agent (e.g. Circonus, Statsite, or plain statsd), you can find the configuration reference here.

Configuring Vault

Similar to Consul, configuring Vault to send us telemetry is painless. Just add one stanza to your Vault config:

telemetry {
  dogstatsd_addr   = "localhost:8125"
  disable_hostname = true
}

The options are the same as they were for Consul. The full reference can be found [here] (https://www.vaultproject.io/docs/configuration/telemetry.html).

Capturing System Metrics With Telegraf

Besides acting as a statsd agent, Telegraf can collect additional metrics of its own. Telegraf itself ships with a wide range of input plugins. We're going to enable some of the most common ones to monitor CPU, memory, disk I/O, networking, and process status.

The telegraf.conf file starts with global options:

[agent]
  interval = "10s"
  flush_interval = "10s"
  omit_hostname = false

We set the default collection interval to 10 seconds and ask Telegraf to include a host tag in each metric.

As mentioned above, Telegraf also allows you to set additional tags on the metrics that pass through it. In this case, we are adding tags for the server role and datacenter. We can then use these tags in Grafana to filter queries (for example, to create a dashboard showing only servers with the consul-server role, or only servers in the us-east-1 datacenter).

[global_tags]
  role = "consul-server"
  datacenter = "us-east-1"

Next, we set up a statsd listener.

[[inputs.statsd]]
  protocol = "udp"
  service_address = ":8125"
  delete_gauges = true
  delete_counters = true
  delete_sets = true
  delete_timings = true
  percentiles = [90]
  metric_separator = "_"
  parse_data_dog_tags = true
  allowed_pending_messages = 10000
  percentile_limit = 1000

You can see complete Telegraf configuration examples for Consul and Vault hosts.

While you're at it, you may as well set up monitoring on the InfluxDB/Grafana server too. Here's an example of how you could do that.

What to Monitor

Alerting