Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove unintended session initialization on TF backend #8377

Merged
merged 3 commits into from
Nov 5, 2017

Conversation

datumbox
Copy link
Contributor

@datumbox datumbox commented Nov 3, 2017

This PR tries to address the issues #8353, #8311 and #8374. Unfortunately they affect each other.

A recent PR #8021 added a check at the beginning of the tensorflow_backend.py to determine the number of available GPUs. Unfortunately the tensorflow.python.client.device_lib.list_local_devices() method seems to create of a new TF Session and to register of all the available GPUs on the system. To resolve this I replace all the device_lib.list_local_devices() calls with K.get_session().list_devices().

Moreover on commit 9166733 line 174, if the candidate_vars is empty we call session.run() with an empty list which causes a RuntimeException. This is addressed by checking if the variable is empty.

NOTE: The unit-tests passed on my machine but I've been having issues with timeouts today on Travis. Please check the code, don't merge directly. Happy to update the PR if necessary based on feedback.

@ahundt
Copy link
Contributor

ahundt commented Nov 3, 2017

these look like nice changes

@datumbox
Copy link
Contributor Author

datumbox commented Nov 3, 2017

@fchollet the tests still timeout. Let me know what you think.

Copy link
Collaborator

@fchollet fchollet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me; one question



def _normalize_device_name(name):
name = name.lower().replace('device:', '')
name = '/' + name.lower().split('device:')[1]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would this line change at all?

Note that tests for this util are not run on Travis, so please run them manually on a multi-GPU machine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The output of the two functions for some reason is different so I need to treat it differently to get the same names out.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it different, and what does it look like in both cases?

Copy link
Contributor Author

@datumbox datumbox Nov 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no idea why it looks diffferent, this is on the side of Tensorflow. Here is how they look in each case:

>>> from tensorflow.python.client import device_lib
>>> l1=[d.name for d in device_lib.list_local_devices()]
>>> l1
[u'/cpu:0']
>>> from keras import backend as K
>>> l2=[d.name for d in K.get_session().list_devices()]
>>> l2
['/job:localhost/replica:0/task:0/device:CPU:0']

That's why I'm splitting on "device:", lowercase and prepend slash to make them look the same which is required in the GPU utils code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW this device_lib.list_local_devices method is extremely problematic. In earlier versions of TF it was actually registering devices that were not whitelisted in CUDA_VISIBLE_DEVICES. The K.get_session().list_devices() method does the trick just fine; I've been using this in large GPU clusters (Keras+Spark) for distributed prediction and never had an issue.

@datumbox
Copy link
Contributor Author

datumbox commented Nov 3, 2017

@fchollet I would still be more comfortable if the tests actually pass on Travis. I'll check again on a different machine but it might be worth rerunning them tomorrow on Travis in case they have an issue today.

@fchollet
Copy link
Collaborator

fchollet commented Nov 3, 2017

I'll rerun and see what happens. There is indeed a chance that this PR is resulting in stalling the tests on Travis.

@fchollet
Copy link
Collaborator

fchollet commented Nov 3, 2017

Regardless of Travis status, please run the tests for multi_gpu_model manually on a multi-GPU machine since these aren't run on Travis at all.

@fchollet
Copy link
Collaborator

fchollet commented Nov 3, 2017

The tests on Travis are reliably stalling, but it's impossible at this point to tell whether it's a Travis issue or an issue with the PR. We'll know soon.

@datumbox
Copy link
Contributor Author

datumbox commented Nov 3, 2017

Let's agree not to merge until the tests on Travis are fixed. It's way better to close the PR than merge something that is "looks correct" but renders the tests on Travis unusable.

I just got home I'll test the codebase with a profiler on a different machine. I'm more than happy to close the PR, send a new one or let someone have another go. At this point, consider that this PR sketches a potential solution but not necessarily the best one. :)

@ozabluda
Copy link
Contributor

ozabluda commented Nov 4, 2017

Travis tests got to the point of meaningful info. For both Python2 and Python3, only TF backend stalls in
tests/keras/utils/data_utils_test.py::test_finite_generator_enqueuer_processes

@datumbox
Copy link
Contributor Author

datumbox commented Nov 4, 2017

@ozabluda Thanks for looking into this! Can you provide some more info on that? I wrote that test in the previous release and there is nothing backend specific. Also in all my runs the test seems to pass fine.

@datumbox
Copy link
Contributor Author

datumbox commented Nov 4, 2017

Updates:

It's the test_inceptionresnetv2_notop test
Let's look at the log of https://travis-ci.org/datumbox/keras/jobs/297166691

  • Line 1288: This is the last time gw0 reports a PASS.
  • Line 1289: gw0 seems to start the test_inceptionresnetv2_notop but it never finishes.
  • Line 1290+: only gw1 runs the tests. gw0 is stuck and gw1 at the end runs out of tasks. Travis times out 10 minutes after gw1 reports the last pass.

Running it on itsown works fine
I tried running the code of the test_inceptionresnetv2_notop in python shell:
https://pastebin.com/4UqmQJs4

In a small VM with 1 CPU it takes 218 seconds but it does finish. So it seems the test_inceptionresnetv2_notop is stalling.

Reverting to the previous get_session() method
I started a temp branch on top of the PR where I revert commit 9166733:
datumbox@6af3b2c

Running the test on the python shell finishes in 105 seconds, nevertheless the tests on Travis are still stalling. Edit: This was a single measurement.

My guess is that the 3 reported issues are connected. By removing the GPU count from backend the session is no longer initialized on import but rather on the first time we call get_session(). This might have all sorts of side-effects. I'm running out of ideas here...

@datumbox
Copy link
Contributor Author

datumbox commented Nov 4, 2017

I patched the damn thing...

Apparently requesting for the available devices is VERY costly. This is why @TimZaman in his original PR he uses a cache variable which is populated only once. In a hindsight, I should have understood this earlier... This is the reason the unit-tests were stalling. Anyway, the only thing that needed to change on the original PR is where we actually populate the variable and how.

The initialization happens the first time the _get_available_gpus() is called and we use the get_session().list_devices() to populate it rather than device_lib.list_local_devices(). I was considering putting the initialization on the get_session() but there is one test that calls the _get_available_gpus() method directly without starting a session.

@fchollet I also run the multi_gpu_model on AWS in a p2.8xlarge and it works fine. Here is proof: https://pastebin.com/7EE3Psqa

The Travis tests pass, the PR is complete from my side, so if there are no objections please merge. Also if possible I would suggest to create a hotfix release for this. It would be particularly useful for those who do distributed predictions using Keras + Spark.

@TimZaman
Copy link
Contributor

TimZaman commented Nov 5, 2017

Reintroducing the lazy device loading, and swapping out device_lib.list_local_devices() for get_session().list_devices() LGTM!

Not sure how the other code you added regarding candidate_vars relates to the above; if it doesn't I'd always suggest to keep it atomic to merge fast.

@alsrgv
Copy link
Contributor

alsrgv commented Nov 5, 2017

I just tested this fix with Horovod and it works well. Looking forward to merging and a new release.

Copy link
Collaborator

@fchollet fchollet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@fchollet fchollet merged commit fabb40b into keras-team:master Nov 5, 2017
@fchollet
Copy link
Collaborator

fchollet commented Nov 5, 2017

In all likelihood we will release 2.1.0 within a week. It turned out there were several bugs in 2.0.9.

datumbox referenced this pull request Nov 24, 2017
* fix list_devices() function is not available issue

When using tensorflow as backend and the version of tensorflow is under r1.3, the list_devices() function is not available of Session instance. This may cause some issue like keras.utils.multi_gpu_model(model, gpus) function can not work under r1.3 of tensorflow. This change is a hack to fix that issue.

* Update tensorflow_backend.py

fix pep8 check failure

* Update tensorflow_backend.py

follow fchollet review comment. use `if not`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants