Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add performance monitoring tools to understand resilience issues #1427

Open
alanbchristie opened this issue Apr 25, 2024 · 12 comments
Open

Add performance monitoring tools to understand resilience issues #1427

alanbchristie opened this issue Apr 25, 2024 · 12 comments
Assignees
Labels
2024-04-25 pink Stack maintenance/monitoring stack

Comments

@alanbchristie
Copy link
Collaborator

alanbchristie commented Apr 25, 2024

Main purpose: want to understand reslience issues.

Work to add monitoring (prometheus, grafana, sentry) to the deployed stacks. Even minimal work should give us access to kubernetes and, by installing a package in the Fragalysis django app should give us access to basic API performance.

See the google-doc that describes options: -

https://docs.google.com/document/d/1V3D0-dAMQvscYN3bUhcWvHpEWN7k7ZjkwlDAkDeKBqs/edit?usp=sharing

@mwinokan
Copy link
Collaborator

mwinokan commented May 2, 2024

@tdudgeon has implemented prometheus on a dev cluster but there are some issues that need resolving. to be discussed next meeting

@tdudgeon
Copy link
Collaborator

tdudgeon commented May 8, 2024

Some metrics targets are broken when you do a vanilla install of Prometheus.
I raised this issue to get some assistance with this: rancher/rancher#45363 (comment)

@tdudgeon
Copy link
Collaborator

tdudgeon commented May 8, 2024

One of the STFC users (James Adams) suggested using this to monitor network connectivity within the cluster nodes:
https://oss.oetiker.ch/smokeping/doc/smokeping_master_slave.en.html

@mwinokan
Copy link
Collaborator

mwinokan commented May 9, 2024

@tdudgeon thinks that SmokePing will aid in monitoring network connectivity, but needs to discuss with @alanbchristie whether we need this or if Prometheus covers this node-level connectivity monitoring. Likely 1-2 days work to implement SmokePing if it's needed.

(Alan back on Monday)

@mwinokan mwinokan changed the title Add performance monitoring Add performance monitoring to the cluster May 14, 2024
@phraenquex phraenquex changed the title Add performance monitoring to the cluster Add performance monitoring of cluster (quick win) May 14, 2024
@phraenquex phraenquex changed the title Add performance monitoring of cluster (quick win) Add performance monitoring of cluster (understand resilience) May 14, 2024
@phraenquex phraenquex changed the title Add performance monitoring of cluster (understand resilience) Add performance monitoring to cluster (understand resilience) May 14, 2024
@phraenquex
Copy link
Collaborator

phraenquex commented May 14, 2024

@tdudgeon says it's firing false alerts - misconfigured out of the box.

Hoping for others to fix the bugs - decide at next meeting what the actions are.

@phraenquex phraenquex changed the title Add performance monitoring to cluster (understand resilience) Add performance monitoring tools to understand resilience issues May 14, 2024
@phraenquex
Copy link
Collaborator

phraenquex commented May 16, 2024

@tdudgeon says the false alerts are an artefact of being on such an old version of everything: kubernetes, rancher, longhorn, etc.

Please scope out the work.

Immediate action: fire up test cluster with all the upgraded things, see if the monitioring still has issues.

@alanbchristie
Copy link
Collaborator Author

With prometheus and grafana installed we can deploy the stack (using deployment playbooks from the fragalysis-stack-kubernetes repository tagged 2024.12 or later). This deploys a ServiceMonitor definition (and an adjusted Service definition) to export the metrics to prometheus.

Once done you can then add the generic django dashboard to grafana by navigating to its Dashboards -> Import and then Import via grafana.com, and enter the dashboard ID 17658. Change the Name and folder if you with and then select the Prometheus instance and then click Import. The dashboard should then be displayed.

@alanbchristie
Copy link
Collaborator Author

Here are some overnight "out of the box" metrics from the latest staging stack: -

image

Interesting, this shows that there are a lot of very long response times and some very large response payload. There are some clear endpoint culprits (response times). These all appear to take longer than 10 seconds for example: -

image

And these, appear to take more than 25 seconds: -

image

@mwinokan
Copy link
Collaborator

mwinokan commented May 21, 2024

@tdudgeon says:

There is a lot of good output from the monitoring as implemented, e.g. as @alanbchristie has shown above. However filesystem/volume mounting is very unstable, frequently (but inconsistently) r/w volumes are incorrectly re-mounted as read-only, potentially due to network connectivity issues.

@alanbchristie
Copy link
Collaborator Author

We have now deployed the SmokePing utility (mentioned by STFC). This now runs on each node in the DEV cluster, generating ping performance figures for all the other nodes in the cluster (excluding etcd).

The playbooks and documentation for the utility can be found in our new Ansible repo that is used to deploy the container image and related material: -

This allows us to see metrics generated by each node: -

image

The expectation is that if networking is a cause of our resilience issues we might see something in the metrics being generated.

@mwinokan
Copy link
Collaborator

@alanbchristie says that SmokePing has already been useful to help diagnose issues this morning.

@mwinokan mwinokan moved this to In Progress (DEV) in Fragalysis May 29, 2024
@mwinokan
Copy link
Collaborator

mwinokan commented Jun 4, 2024

@tdudgeon says we now have performance monitoring, so this ticket is done.

@mwinokan mwinokan moved this from In Progress (DEV) to In production (Done) in Fragalysis Jun 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2024-04-25 pink Stack maintenance/monitoring stack
Projects
Status: In production (Done)
Development

No branches or pull requests

4 participants