-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add performance monitoring tools to understand resilience issues #1427
Comments
@tdudgeon has implemented prometheus on a dev cluster but there are some issues that need resolving. to be discussed next meeting |
Some metrics targets are broken when you do a vanilla install of Prometheus. |
One of the STFC users (James Adams) suggested using this to monitor network connectivity within the cluster nodes: |
@tdudgeon thinks that SmokePing will aid in monitoring network connectivity, but needs to discuss with @alanbchristie whether we need this or if Prometheus covers this node-level connectivity monitoring. Likely 1-2 days work to implement SmokePing if it's needed. (Alan back on Monday) |
@tdudgeon says it's firing false alerts - misconfigured out of the box. Hoping for others to fix the bugs - decide at next meeting what the actions are. |
@tdudgeon says the false alerts are an artefact of being on such an old version of everything: kubernetes, rancher, longhorn, etc. Please scope out the work. Immediate action: fire up test cluster with all the upgraded things, see if the monitioring still has issues. |
With prometheus and grafana installed we can deploy the stack (using deployment playbooks from the fragalysis-stack-kubernetes repository tagged Once done you can then add the generic django dashboard to grafana by navigating to its |
@tdudgeon says: There is a lot of good output from the monitoring as implemented, e.g. as @alanbchristie has shown above. However filesystem/volume mounting is very unstable, frequently (but inconsistently) r/w volumes are incorrectly re-mounted as read-only, potentially due to network connectivity issues. |
@alanbchristie says that SmokePing has already been useful to help diagnose issues this morning. |
@tdudgeon says we now have performance monitoring, so this ticket is done. |
Main purpose: want to understand reslience issues.
Work to add monitoring (prometheus, grafana, sentry) to the deployed stacks. Even minimal work should give us access to kubernetes and, by installing a package in the Fragalysis django app should give us access to basic API performance.
See the google-doc that describes options: -
https://docs.google.com/document/d/1V3D0-dAMQvscYN3bUhcWvHpEWN7k7ZjkwlDAkDeKBqs/edit?usp=sharing
The text was updated successfully, but these errors were encountered: