Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GCloud tf-gpu Image and tfa compatability? #676

Closed
mwalmsley opened this issue Nov 6, 2019 · 9 comments
Closed

GCloud tf-gpu Image and tfa compatability? #676

mwalmsley opened this issue Nov 6, 2019 · 9 comments
Labels
bug Something isn't working build

Comments

@mwalmsley
Copy link

mwalmsley commented Nov 6, 2019

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04 within Docker on Mac
  • TensorFlow version and how it was installed (source or binary): 2.0-gpu via GCloud image (see below)
  • TensorFlow-Addons version and how it was installed (source or binary): 0.6.0 via pip
  • Python version: 3.5
  • Is GPU used? (yes/no): no

Describe the bug

I'm trying to use tfa on the GCloud tensorflow-gpu image via Docker and I'm having trouble getting it to import. The import fails with the error tensorflow.python.framework.errors_impl.NotFoundError: /root/miniconda3/lib/python3.5/site-packages/tensorflow_addons/custom_ops/activations/_activation_ops.so: undefined symbol: _ZN10tensorflow12OpDefBuilder4AttrESs.

Looks like some gnarly incompatibility between TF and TFA installs/versions, but the error is way above my head.

Code to reproduce the issue

Dockerfile:
FROM gcr.io/deeplearning-platform-release/tf2-gpu.2-0
...
RUN pip install --upgrade pip
RUN pip install -r requirements.txt # including tensorflow-addons==0.6.0

Python:
import tensorflow_addons as tfa

Other info / logs

Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

2019-11-06 15:29:43.940868: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/root/miniconda3/lib/python3.5/site-packages/tensorflow_addons/init.py", line 21, in <module>
from tensorflow_addons import activations
File "/root/miniconda3/lib/python3.5/site-packages/tensorflow_addons/activations/init.py", line 21, in <module>
from tensorflow_addons.activations.gelu import gelu
File "/root/miniconda3/lib/python3.5/site-packages/tensorflow_addons/activations/gelu.py", line 25, in <module>
get_path_to_datafile("custom_ops/activations/_activation_ops.so"))
File "/root/miniconda3/lib/python3.5/site-packages/tensorflow_core/python/framework/load_library.py", line 61, in load_op_library
lib_handle = py_tf.TF_LoadLibrary(library_filename)
tensorflow.python.framework.errors_impl.NotFoundError: /root/miniconda3/lib/python3.5/site-packages/tensorflow_addons/custom_ops/activations/_activation_ops.so: undefined symbol: _ZN10tensorflow12OpDefBuilder4AttrESs
@mwalmsley
Copy link
Author

To investigate if this error was somehow related to the host machine, I re-ran the same container on a GCloud Compute instance with Ubuntu 18.04, P100 GPU and CUDA 10.0 installed. I get the same import error on that machine as my non-CUDA Mac desktop.

@seanpmorgan seanpmorgan added bug Something isn't working build labels Nov 6, 2019
@martin-gorner
Copy link

I saw the same error on Kaggle when running "pip install tensorflow-addons" in a non-GPU Kaggle notebook. If you enable the GPU, the installation succeeds, probably because the underlying image is different.

@seanpmorgan
Copy link
Member

Thanks for the info! Yes this is generally because the TensorFlow has a an ABI incompatibility with the way we package our pip wheel. This issue will one day be resolved by this RFC which is already underway:
tensorflow/community#133

In the mean time though we can look at that docker image and see how it was compiling TF. Does anyone know where we can access the Dockerfiles for these?

@mwalmsley
Copy link
Author

An update - I've changed my setup to use the Tensorflow/CUDA "Deep Learning VM" boot disk from GCP directly (i.e. without docker) and I get the same error. I assume the VM is created the same way as the Dockerfile.

(this was motivated by another op issue in TensorFlow itself, tensorflow/tensorflow#34112 - I'm not having much luck!)

@diesendruck
Copy link

Got the same error from a SageMaker instance with TF2.0.0.

@seanpmorgan
Copy link
Member

So I just wanted to comment on this so it's known that this is still being looked at:

At a high level the issue is that we can only upload a single package to pip, so we make that whl compatible with the pip version of tensorflow. Many docker images will choose to compile TF from source (or install from conda) so that TF will be most performant, but that means ABI incompatibilities with the whl we publish.

For example I checked gcr.io/deeplearning-platform-release/tf2-gpu.2-0 image and the default installed tensorflow uses CXX11_ABI_FLAG=1, but the tensorflow pip package / tfa pip package use CXX11_ABI_FLAG=0

docker run --rm gcr.io/deeplearning-platform-release/tf2-gpu.2-0 python -c "import tensorflow as tf; print(tf.sysconfig.CXX11_ABI_FLAG)"
>> 1

The best solution we can provide atm is that for custom TF installs (as found in these docker images) you'll want to compile TFA from source so that it matches the installed version of TF. You can see in our configure script that we compile TFA based on the installed TF compile flags. There are some other build issues we have (look for a refactor issue to be posted today/tomorrow), so this might still run into issues for the GPU version built from source.

Once we get ABI stable custom-ops (see RFC mentioned previously) this might be a non-issue, but at a minimum we would like it so compiling from source is always a workable solution.

@martin-gorner
Copy link

is the RFC #133 planned to be included in TF 2.1 ?

@seanpmorgan
Copy link
Member

is the RFC #133 planned to be included in TF 2.1 ?

Unfortunately no. It's being actively worked on and I believe a stable API has been built for CPU Only kernels but they're working on CUDA kernels.

@seanpmorgan
Copy link
Member

Consolidating in #987

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working build
Projects
None yet
Development

No branches or pull requests

4 participants