Add performance monitoring tools to understand resilience issues #1427

alanbchristie · 2024-04-25T10:49:47Z

Main purpose: want to understand reslience issues.

Work to add monitoring (prometheus, grafana, sentry) to the deployed stacks. Even minimal work should give us access to kubernetes and, by installing a package in the Fragalysis django app should give us access to basic API performance.

See the google-doc that describes options: -

https://docs.google.com/document/d/1V3D0-dAMQvscYN3bUhcWvHpEWN7k7ZjkwlDAkDeKBqs/edit?usp=sharing

mwinokan · 2024-05-02T10:17:14Z

@tdudgeon has implemented prometheus on a dev cluster but there are some issues that need resolving. to be discussed next meeting

tdudgeon · 2024-05-08T06:43:37Z

Some metrics targets are broken when you do a vanilla install of Prometheus.
I raised this issue to get some assistance with this: rancher/rancher#45363 (comment)

tdudgeon · 2024-05-08T13:43:11Z

One of the STFC users (James Adams) suggested using this to monitor network connectivity within the cluster nodes:
https://oss.oetiker.ch/smokeping/doc/smokeping_master_slave.en.html

mwinokan · 2024-05-09T10:23:33Z

@tdudgeon thinks that SmokePing will aid in monitoring network connectivity, but needs to discuss with @alanbchristie whether we need this or if Prometheus covers this node-level connectivity monitoring. Likely 1-2 days work to implement SmokePing if it's needed.

(Alan back on Monday)

phraenquex · 2024-05-14T11:22:27Z

@tdudgeon says it's firing false alerts - misconfigured out of the box.

Hoping for others to fix the bugs - decide at next meeting what the actions are.

phraenquex · 2024-05-16T10:32:16Z

@tdudgeon says the false alerts are an artefact of being on such an old version of everything: kubernetes, rancher, longhorn, etc.

Please scope out the work.

Immediate action: fire up test cluster with all the upgraded things, see if the monitioring still has issues.

alanbchristie · 2024-05-20T14:55:41Z

With prometheus and grafana installed we can deploy the stack (using deployment playbooks from the fragalysis-stack-kubernetes repository tagged 2024.12 or later). This deploys a ServiceMonitor definition (and an adjusted Service definition) to export the metrics to prometheus.

Once done you can then add the generic django dashboard to grafana by navigating to its Dashboards -> Import and then Import via grafana.com, and enter the dashboard ID 17658. Change the Name and folder if you with and then select the Prometheus instance and then click Import. The dashboard should then be displayed.

alanbchristie · 2024-05-21T08:53:10Z

Here are some overnight "out of the box" metrics from the latest staging stack: -

Interesting, this shows that there are a lot of very long response times and some very large response payload. There are some clear endpoint culprits (response times). These all appear to take longer than 10 seconds for example: -

And these, appear to take more than 25 seconds: -

mwinokan · 2024-05-21T11:41:02Z

@tdudgeon says:

There is a lot of good output from the monitoring as implemented, e.g. as @alanbchristie has shown above. However filesystem/volume mounting is very unstable, frequently (but inconsistently) r/w volumes are incorrectly re-mounted as read-only, potentially due to network connectivity issues.

alanbchristie · 2024-05-24T11:27:31Z

We have now deployed the SmokePing utility (mentioned by STFC). This now runs on each node in the DEV cluster, generating ping performance figures for all the other nodes in the cluster (excluding etcd).

The playbooks and documentation for the utility can be found in our new Ansible repo that is used to deploy the container image and related material: -

https://github.com/InformaticsMatters/smokeping-prober-ansible

This allows us to see metrics generated by each node: -

The expectation is that if networking is a cause of our resilience issues we might see something in the metrics being generated.

mwinokan · 2024-05-28T11:13:04Z

@alanbchristie says that SmokePing has already been useful to help diagnose issues this morning.

mwinokan · 2024-06-04T11:24:38Z

@tdudgeon says we now have performance monitoring, so this ticket is done.

alanbchristie assigned tdudgeon and alanbchristie Apr 25, 2024

phraenquex added 2024-04-25 pink Stack maintenance/monitoring stack labels May 2, 2024

mwinokan changed the title ~~Add performance monitoring~~ Add performance monitoring to the cluster May 14, 2024

phraenquex changed the title ~~Add performance monitoring to the cluster~~ Add performance monitoring of cluster (quick win) May 14, 2024

phraenquex changed the title ~~Add performance monitoring of cluster (quick win)~~ Add performance monitoring of cluster (understand resilience) May 14, 2024

phraenquex changed the title ~~Add performance monitoring of cluster (understand resilience)~~ Add performance monitoring to cluster (understand resilience) May 14, 2024

mwinokan mentioned this issue May 14, 2024

Network connectivity monitoring (SmokePing) #1433

Open

phraenquex changed the title ~~Add performance monitoring to cluster (understand resilience)~~ Add performance monitoring tools to understand resilience issues May 14, 2024

mwinokan mentioned this issue May 14, 2024

F/E gracefully handle stack issues (and B/E monitoring) #1434

Open

mwinokan mentioned this issue May 21, 2024

Enumerate frontend fixes needed to reduce request processing time #1438

Open

mwinokan added this to Fragalysis May 29, 2024

mwinokan moved this to In Progress (DEV) in Fragalysis May 29, 2024

mwinokan moved this from In Progress (DEV) to In production (Done) in Fragalysis Jun 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add performance monitoring tools to understand resilience issues #1427

Add performance monitoring tools to understand resilience issues #1427

alanbchristie commented Apr 25, 2024 •

edited by phraenquex

Loading

mwinokan commented May 2, 2024

tdudgeon commented May 8, 2024

tdudgeon commented May 8, 2024

mwinokan commented May 9, 2024 •

edited

Loading

phraenquex commented May 14, 2024 •

edited

Loading

phraenquex commented May 16, 2024 •

edited

Loading

alanbchristie commented May 20, 2024

alanbchristie commented May 21, 2024

mwinokan commented May 21, 2024 •

edited

Loading

alanbchristie commented May 24, 2024

mwinokan commented May 28, 2024

mwinokan commented Jun 4, 2024

Add performance monitoring tools to understand resilience issues #1427

Add performance monitoring tools to understand resilience issues #1427

Comments

alanbchristie commented Apr 25, 2024 • edited by phraenquex Loading

mwinokan commented May 2, 2024

tdudgeon commented May 8, 2024

tdudgeon commented May 8, 2024

mwinokan commented May 9, 2024 • edited Loading

phraenquex commented May 14, 2024 • edited Loading

phraenquex commented May 16, 2024 • edited Loading

alanbchristie commented May 20, 2024

alanbchristie commented May 21, 2024

mwinokan commented May 21, 2024 • edited Loading

alanbchristie commented May 24, 2024

mwinokan commented May 28, 2024

mwinokan commented Jun 4, 2024

alanbchristie commented Apr 25, 2024 •

edited by phraenquex

Loading

mwinokan commented May 9, 2024 •

edited

Loading

phraenquex commented May 14, 2024 •

edited

Loading

phraenquex commented May 16, 2024 •

edited

Loading

mwinokan commented May 21, 2024 •

edited

Loading