-
-
Notifications
You must be signed in to change notification settings - Fork 727
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GIL prometheus metrics are misleading #8557
Comments
cc @ntabris not to do anything here, but because I think you've also found this annoying in the past. |
I suspect the way this is aggregated here distributed/distributed/system_monitor.py Lines 199 to 203 in b1597b6
I suspect that if the interpreter stands still for a minute, this will report However, I'm just guessing since I don't know what |
The raw XY plot from the system_monitor looks much better. I think there's an additional problem Those points at zero during the second and third phase don't seem right to me. The iteration is to hold the GIL for 15s -> wait ~0.02s for the next task -> hold the GIL for 15s again, so I would expect everything to be close to 1. I suspect the GIL was acquired by my hog function between the sampling of time() and the measure of gilknocker. Or maybe between fetching the measure and the reset call.
|
I've run a benchmark on a 2-workers cluster that simulates these use cases:
In all cases, the Bokeh dashboard displays the GIL metric more or less constantly at 50%. This is improvable, as it hides the disparity between the workers (it would be better IMHO to have something like min+median+max), but otherwise it is accurate.
The same metric on Prometheus (which, on Coiled, scrapes every 5 seconds), however, is misleading.
In all cases, I would expect the metric for the affected worker to read 100%. Instead it reads:
In the first three cases, this leads to a misleading sense that the GIL is not being problematic.
The last case makes no sense to me - how can GIL contention be more than 100%? (again, on the bokeh dashboard it shows 50% cluster average, or 100% on the affected worker).
This uses coiled but a plain LocalCluster will yield the same (assuming you have prometheus scraping it).
Reproducer
The text was updated successfully, but these errors were encountered: