Skip to content
Paul Norman edited this page Nov 10, 2022 · 2 revisions

Runbooks for responding to an outage or degradation of service

Standard Tile Layer

Tile CDN node health check failures


Health check failures can be caused by Fastly issues, routing issues, or backend server issues. The most common is routing issues.

Routing issue runbook

Preconditions: Have the shell variable FASTLY_API_TOKEN set to a fastly API key

  1. Identify the server with health check failures
  2. Using the rendering dashboard and filtering it to the host with failures, verify that the server still has some traffic. If not, this is likely a backend failure.
  3. Using a Prometheus query like fastly_healthcheck_status{host="", backend=~"nidhogg"}, identify the datacenter code for the fastly POP with problems
  4. Identify if the routing problem is production impacting. Sometimes the render server failing a health check would not be used by that POP because there are closer ones. We want to fix the problem in either case, but it helps prioritize it.
  5. Find the POP IP by checking the healthcheck response with curl -s -H "Fastly-Key: ${FASTLY_API_TOKEN}" | jq .. Search for the datacenter code (e.g. FRA) and find the x-cacheip header. Copy this IP
  6. SSH to the render server and run mtr -w -z -c 100 <ip>. This will take a couple of minutes.
  7. Identify if the packet loss is coming from the first hops. If so, contact the NOC of the internet provider for the server. If it is coming later on, check Fastly Status for any errors related the POP.
  8. If it is not a known issue, open a ticket to Fastly support. Open the ticket as "Contact Support" with a category of "performance" stating that there is packet loss between a fastly pop and render server and to please forward the information to NetOps. Include
  9. the MTR results. If there are multiple nodes, include MTRs for all of them.
  10. When the problems started, as established by Prometheus
  11. if it is intermittent
  12. If it is currently impacting production, or if another server is handling the load for that POP Use priority "Normal" for non-impacting and priority "High" for impacting

Sample message

We are having packet loss between the KUL datacenter and our origin server, This is not immediately impacting our service as traffic from that datacenter is being routed to other origin servers by default, but does indicate a network problem. Can you please forward this information to NetOps

An MTR from nidhogg is below

pnorman@nidhogg:~$ mtr -w -z -c 100
Start: 2022-11-10T07:42:07+0000
HOST:                          Loss%   Snt   Last   Avg  Best  Wrst StDev
  1. AS15980               0.0%   100    1.5   1.4   0.2  12.7   2.4
  2. AS1653                       0.0%   100    1.0   0.8   0.3  18.3   2.1
  3. AS1653                   0.0%   100    5.5   3.8   3.4  16.3   1.5
  4. AS1653                       0.0%   100    6.9   7.0   6.0  20.8   2.7
  5. AS1653                     0.0%   100    8.6   8.1   7.3  36.9   3.5
  6. AS1653                   0.0%   100   25.9  11.5   8.5  42.9   6.9
  7. AS2603                            0.0%   100   10.3   9.3   8.6  22.1   1.7
  8. AS2603                           0.0%   100   18.8  20.1  18.7  37.5   3.5
  9. AS2603                           0.0%   100   23.2  24.4  23.0  60.8   4.8
 10. AS2603                            0.0%   100   47.2  31.2  29.1  47.2   3.9
 11. AS2603                            0.0%   100   34.4  35.1  33.8  65.1   3.7
 12. AS???    ???                                        100.0   100    0.0   0.0   0.0   0.0   0.0
 13. AS3491        18.0%   100  208.3 208.5 207.7 210.8   0.6
 14. AS3491  27.0%   100  210.2 208.6 208.0 210.7   0.6
 15. AS54113                             21.0%   100  210.7 208.5 207.7 210.7   0.6

If there is any follow-up communication, do so through the fastly website and add as a CC.