Remove unintended session initialization on TF backend #8377

datumbox · 2017-11-03T17:36:51Z

This PR tries to address the issues #8353, #8311 and #8374. Unfortunately they affect each other.

A recent PR #8021 added a check at the beginning of the tensorflow_backend.py to determine the number of available GPUs. Unfortunately the tensorflow.python.client.device_lib.list_local_devices() method seems to create of a new TF Session and to register of all the available GPUs on the system. To resolve this I replace all the device_lib.list_local_devices() calls with K.get_session().list_devices().

Moreover on commit 9166733 line 174, if the candidate_vars is empty we call session.run() with an empty list which causes a RuntimeException. This is addressed by checking if the variable is empty.

NOTE: The unit-tests passed on my machine but I've been having issues with timeouts today on Travis. Please check the code, don't merge directly. Happy to update the PR if necessary based on feedback.

ahundt · 2017-11-03T18:01:55Z

these look like nice changes

datumbox · 2017-11-03T18:59:42Z

@fchollet the tests still timeout. Let me know what you think.

fchollet

Looks good to me; one question

fchollet · 2017-11-03T19:04:09Z

keras/utils/training_utils.py



 def _normalize_device_name(name):
-    name = name.lower().replace('device:', '')
+    name = '/' + name.lower().split('device:')[1]


Why would this line change at all?

Note that tests for this util are not run on Travis, so please run them manually on a multi-GPU machine.

The output of the two functions for some reason is different so I need to treat it differently to get the same names out.

Why is it different, and what does it look like in both cases?

I have no idea why it looks diffferent, this is on the side of Tensorflow. Here is how they look in each case:

>>> from tensorflow.python.client import device_lib >>> l1=[d.name for d in device_lib.list_local_devices()] >>> l1 [u'/cpu:0']

>>> from keras import backend as K >>> l2=[d.name for d in K.get_session().list_devices()] >>> l2 ['/job:localhost/replica:0/task:0/device:CPU:0']

That's why I'm splitting on "device:", lowercase and prepend slash to make them look the same which is required in the GPU utils code.

BTW this device_lib.list_local_devices method is extremely problematic. In earlier versions of TF it was actually registering devices that were not whitelisted in CUDA_VISIBLE_DEVICES. The K.get_session().list_devices() method does the trick just fine; I've been using this in large GPU clusters (Keras+Spark) for distributed prediction and never had an issue.

datumbox · 2017-11-03T19:11:20Z

@fchollet I would still be more comfortable if the tests actually pass on Travis. I'll check again on a different machine but it might be worth rerunning them tomorrow on Travis in case they have an issue today.

fchollet · 2017-11-03T19:15:30Z

I'll rerun and see what happens. There is indeed a chance that this PR is resulting in stalling the tests on Travis.

fchollet · 2017-11-03T20:01:32Z

Regardless of Travis status, please run the tests for multi_gpu_model manually on a multi-GPU machine since these aren't run on Travis at all.

fchollet · 2017-11-03T20:02:31Z

The tests on Travis are reliably stalling, but it's impossible at this point to tell whether it's a Travis issue or an issue with the PR. We'll know soon.

datumbox · 2017-11-03T20:20:47Z

Let's agree not to merge until the tests on Travis are fixed. It's way better to close the PR than merge something that is "looks correct" but renders the tests on Travis unusable.

I just got home I'll test the codebase with a profiler on a different machine. I'm more than happy to close the PR, send a new one or let someone have another go. At this point, consider that this PR sketches a potential solution but not necessarily the best one. :)

ozabluda · 2017-11-04T01:21:28Z

Travis tests got to the point of meaningful info. For both Python2 and Python3, only TF backend stalls in
tests/keras/utils/data_utils_test.py::test_finite_generator_enqueuer_processes

datumbox · 2017-11-04T09:31:51Z

@ozabluda Thanks for looking into this! Can you provide some more info on that? I wrote that test in the previous release and there is nothing backend specific. Also in all my runs the test seems to pass fine.

datumbox · 2017-11-04T11:58:56Z

Updates:

It's the test_inceptionresnetv2_notop test
Let's look at the log of https://travis-ci.org/datumbox/keras/jobs/297166691

Line 1288: This is the last time gw0 reports a PASS.
Line 1289: gw0 seems to start the test_inceptionresnetv2_notop but it never finishes.
Line 1290+: only gw1 runs the tests. gw0 is stuck and gw1 at the end runs out of tasks. Travis times out 10 minutes after gw1 reports the last pass.

Running it on itsown works fine
I tried running the code of the test_inceptionresnetv2_notop in python shell:
https://pastebin.com/4UqmQJs4

In a small VM with 1 CPU it takes 218 seconds but it does finish. So it seems the test_inceptionresnetv2_notop is stalling.

Reverting to the previous get_session() method
I started a temp branch on top of the PR where I revert commit 9166733:
datumbox@6af3b2c

~~Running the test on the python shell finishes in 105 seconds~~, nevertheless the tests on Travis are still stalling. Edit: This was a single measurement.

My guess is that the 3 reported issues are connected. By removing the GPU count from backend the session is no longer initialized on import but rather on the first time we call get_session(). This might have all sorts of side-effects. I'm running out of ideas here...

datumbox · 2017-11-04T15:22:20Z

I patched the damn thing...

Apparently requesting for the available devices is VERY costly. This is why @TimZaman in his original PR he uses a cache variable which is populated only once. In a hindsight, I should have understood this earlier... This is the reason the unit-tests were stalling. Anyway, the only thing that needed to change on the original PR is where we actually populate the variable and how.

The initialization happens the first time the _get_available_gpus() is called and we use the get_session().list_devices() to populate it rather than device_lib.list_local_devices(). I was considering putting the initialization on the get_session() but there is one test that calls the _get_available_gpus() method directly without starting a session.

@fchollet I also run the multi_gpu_model on AWS in a p2.8xlarge and it works fine. Here is proof: https://pastebin.com/7EE3Psqa

The Travis tests pass, the PR is complete from my side, so if there are no objections please merge. Also if possible I would suggest to create a hotfix release for this. It would be particularly useful for those who do distributed predictions using Keras + Spark.

TimZaman · 2017-11-05T03:56:18Z

Reintroducing the lazy device loading, and swapping out device_lib.list_local_devices() for get_session().list_devices() LGTM!

Not sure how the other code you added regarding candidate_vars relates to the above; if it doesn't I'd always suggest to keep it atomic to merge fast.

alsrgv · 2017-11-05T04:19:33Z

I just tested this fix with Horovod and it works well. Looking forward to merging and a new release.

fchollet

LGTM, thanks!

fchollet · 2017-11-05T05:09:38Z

In all likelihood we will release 2.1.0 within a week. It turned out there were several bugs in 2.0.9.

* fix list_devices() function is not available issue When using tensorflow as backend and the version of tensorflow is under r1.3, the list_devices() function is not available of Session instance. This may cause some issue like keras.utils.multi_gpu_model(model, gpus) function can not work under r1.3 of tensorflow. This change is a hack to fix that issue. * Update tensorflow_backend.py fix pep8 check failure * Update tensorflow_backend.py follow fchollet review comment. use `if not`.

datumbox added 2 commits November 3, 2017 15:49

Removing the list_local_devices() call and check if candidate_vars empty

3908b38

Use list_devices() instead of list_local_devices()

9605ed7

fchollet reviewed Nov 3, 2017

View reviewed changes

Use a cache variable to store local devices.

8201921

datumbox force-pushed the tf_init_and_gpus branch from ff3bac9 to 8201921 Compare November 4, 2017 14:33

fchollet mentioned this pull request Nov 5, 2017

Different behaviour of Keras 2.0.9 and 2.0.8 #8353

Closed

fchollet approved these changes Nov 5, 2017

View reviewed changes

fchollet merged commit fabb40b into keras-team:master Nov 5, 2017

This was referenced Nov 5, 2017

K.get_session() leads to RuntimeError: The Session graph is empty on Keras 2.0.9 #8374

Closed

Importing Keras should not execute code, makes pydoc sad #8311

Closed

tsmolcic mentioned this pull request Nov 21, 2018

tensorflow allows only one process per node horovod/horovod#639

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove unintended session initialization on TF backend #8377

Remove unintended session initialization on TF backend #8377

datumbox commented Nov 3, 2017

ahundt commented Nov 3, 2017

datumbox commented Nov 3, 2017

fchollet left a comment

fchollet Nov 3, 2017

datumbox Nov 3, 2017

fchollet Nov 3, 2017

datumbox Nov 4, 2017 •

edited

Loading

datumbox Nov 4, 2017

datumbox commented Nov 3, 2017

fchollet commented Nov 3, 2017

fchollet commented Nov 3, 2017

fchollet commented Nov 3, 2017

datumbox commented Nov 3, 2017

ozabluda commented Nov 4, 2017

datumbox commented Nov 4, 2017

datumbox commented Nov 4, 2017 •

edited

Loading

datumbox commented Nov 4, 2017

TimZaman commented Nov 5, 2017

alsrgv commented Nov 5, 2017

fchollet left a comment

fchollet commented Nov 5, 2017

Remove unintended session initialization on TF backend #8377

Remove unintended session initialization on TF backend #8377

Conversation

datumbox commented Nov 3, 2017

ahundt commented Nov 3, 2017

datumbox commented Nov 3, 2017

fchollet left a comment

Choose a reason for hiding this comment

fchollet Nov 3, 2017

Choose a reason for hiding this comment

datumbox Nov 3, 2017

Choose a reason for hiding this comment

fchollet Nov 3, 2017

Choose a reason for hiding this comment

datumbox Nov 4, 2017 • edited Loading

Choose a reason for hiding this comment

datumbox Nov 4, 2017

Choose a reason for hiding this comment

datumbox commented Nov 3, 2017

fchollet commented Nov 3, 2017

fchollet commented Nov 3, 2017

fchollet commented Nov 3, 2017

datumbox commented Nov 3, 2017

ozabluda commented Nov 4, 2017

datumbox commented Nov 4, 2017

datumbox commented Nov 4, 2017 • edited Loading

datumbox commented Nov 4, 2017

TimZaman commented Nov 5, 2017

alsrgv commented Nov 5, 2017

fchollet left a comment

Choose a reason for hiding this comment

fchollet commented Nov 5, 2017

datumbox Nov 4, 2017 •

edited

Loading

datumbox commented Nov 4, 2017 •

edited

Loading