-
Notifications
You must be signed in to change notification settings - Fork 19.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove unintended session initialization on TF backend #8377
Conversation
these look like nice changes |
@fchollet the tests still timeout. Let me know what you think. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me; one question
|
||
|
||
def _normalize_device_name(name): | ||
name = name.lower().replace('device:', '') | ||
name = '/' + name.lower().split('device:')[1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would this line change at all?
Note that tests for this util are not run on Travis, so please run them manually on a multi-GPU machine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The output of the two functions for some reason is different so I need to treat it differently to get the same names out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it different, and what does it look like in both cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no idea why it looks diffferent, this is on the side of Tensorflow. Here is how they look in each case:
>>> from tensorflow.python.client import device_lib
>>> l1=[d.name for d in device_lib.list_local_devices()]
>>> l1
[u'/cpu:0']
>>> from keras import backend as K
>>> l2=[d.name for d in K.get_session().list_devices()]
>>> l2
['/job:localhost/replica:0/task:0/device:CPU:0']
That's why I'm splitting on "device:", lowercase and prepend slash to make them look the same which is required in the GPU utils code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW this device_lib.list_local_devices
method is extremely problematic. In earlier versions of TF it was actually registering devices that were not whitelisted in CUDA_VISIBLE_DEVICES. The K.get_session().list_devices()
method does the trick just fine; I've been using this in large GPU clusters (Keras+Spark) for distributed prediction and never had an issue.
@fchollet I would still be more comfortable if the tests actually pass on Travis. I'll check again on a different machine but it might be worth rerunning them tomorrow on Travis in case they have an issue today. |
I'll rerun and see what happens. There is indeed a chance that this PR is resulting in stalling the tests on Travis. |
Regardless of Travis status, please run the tests for |
The tests on Travis are reliably stalling, but it's impossible at this point to tell whether it's a Travis issue or an issue with the PR. We'll know soon. |
Let's agree not to merge until the tests on Travis are fixed. It's way better to close the PR than merge something that is "looks correct" but renders the tests on Travis unusable. I just got home I'll test the codebase with a profiler on a different machine. I'm more than happy to close the PR, send a new one or let someone have another go. At this point, consider that this PR sketches a potential solution but not necessarily the best one. :) |
Travis tests got to the point of meaningful info. For both Python2 and Python3, only TF backend stalls in |
@ozabluda Thanks for looking into this! Can you provide some more info on that? I wrote that test in the previous release and there is nothing backend specific. Also in all my runs the test seems to pass fine. |
Updates: It's the test_inceptionresnetv2_notop test
Running it on itsown works fine In a small VM with 1 CPU it takes 218 seconds but it does finish. So it seems the test_inceptionresnetv2_notop is stalling. Reverting to the previous get_session() method
My guess is that the 3 reported issues are connected. By removing the GPU count from backend the session is no longer initialized on import but rather on the first time we call |
ff3bac9
to
8201921
Compare
I patched the damn thing... Apparently requesting for the available devices is VERY costly. This is why @TimZaman in his original PR he uses a cache variable which is populated only once. In a hindsight, I should have understood this earlier... This is the reason the unit-tests were stalling. Anyway, the only thing that needed to change on the original PR is where we actually populate the variable and how. The initialization happens the first time the @fchollet I also run the multi_gpu_model on AWS in a p2.8xlarge and it works fine. Here is proof: https://pastebin.com/7EE3Psqa The Travis tests pass, the PR is complete from my side, so if there are no objections please merge. Also if possible I would suggest to create a hotfix release for this. It would be particularly useful for those who do distributed predictions using Keras + Spark. |
Reintroducing the lazy device loading, and swapping out Not sure how the other code you added regarding |
I just tested this fix with Horovod and it works well. Looking forward to merging and a new release. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
In all likelihood we will release 2.1.0 within a week. It turned out there were several bugs in 2.0.9. |
* fix list_devices() function is not available issue When using tensorflow as backend and the version of tensorflow is under r1.3, the list_devices() function is not available of Session instance. This may cause some issue like keras.utils.multi_gpu_model(model, gpus) function can not work under r1.3 of tensorflow. This change is a hack to fix that issue. * Update tensorflow_backend.py fix pep8 check failure * Update tensorflow_backend.py follow fchollet review comment. use `if not`.
This PR tries to address the issues #8353, #8311 and #8374. Unfortunately they affect each other.
A recent PR #8021 added a check at the beginning of the tensorflow_backend.py to determine the number of available GPUs. Unfortunately the
tensorflow.python.client.device_lib.list_local_devices()
method seems to create of a new TF Session and to register of all the available GPUs on the system. To resolve this I replace all thedevice_lib.list_local_devices()
calls withK.get_session().list_devices()
.Moreover on commit 9166733 line 174, if the
candidate_vars
is empty we callsession.run()
with an empty list which causes a RuntimeException. This is addressed by checking if the variable is empty.NOTE: The unit-tests passed on my machine but I've been having issues with timeouts today on Travis. Please check the code, don't merge directly. Happy to update the PR if necessary based on feedback.