-
Notifications
You must be signed in to change notification settings - Fork 341
Pytorch-lightning AimLogger is finalized after fit, breaking sessions with fit and test routines #3097
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi, how did you solve this? Did you have to change the teardown function within lightning?
disabling the penultimate line? |
Patch notes seem to say that this is fixed (https://aimstack.readthedocs.io/en/latest/generated/CHANGELOG.html#feb-7-2024-fixes), but my code hangs when getting to the test loop when using aim. No error is raised, but the run is marked as finished right as the training loop finishes and then the code hangs when trying to log anything from the test loop. |
Hi, yes, exactly. Although that solution is not great, and forces me to manually teardown the logger when the entire process is complete. |
Looks like #3134 will fix this |
❓Question
I am setting up a pytorch lightning experiment and using the AimLogger object to log training/validation losses, as well as test results.
However, while tracking during
trainer.fit
works perfectly, it breaks whentrainer.test
tries to load the model (TypeError: Timeout.__init__() missing 1 required positional argument: 'lock_file'
is the final exception thrown).My workaround was to disable the
logger.finalize()
call in the fit loop teardown routine, but that shouldn't be a good solution.Is this behavior resulting from some change in how pytorch-lightning deals with teardowns that was not tracked by AimStack? Or is there anything I should set to prevent this from happening?
I am using a remote repository, but have confirmed that if I just use a local one, then there is no issue. Which makes me think that there may be some timing definitions at play here...
Here is a snippet representing my setup:
I am also adding the traceback I got:
The text was updated successfully, but these errors were encountered: