Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request]: Address corner cases of when venv cannot be created #26792

Closed
1 of 15 tasks
tvalentyn opened this issue May 19, 2023 · 11 comments
Closed
1 of 15 tasks

[Feature Request]: Address corner cases of when venv cannot be created #26792

tvalentyn opened this issue May 19, 2023 · 11 comments
Assignees
Labels
done & done Issue has been reviewed after it was closed for verification, followups, etc. new feature P2 python

Comments

@tvalentyn
Copy link
Contributor

What would you like to happen?

#16658 made a change to Python SDK harness container boot sequence to launch SDK processes in separately created virtual environments.

It appears that the venv dependency is sometimes not available on non-beam Python container images. Users who supply custom containers may run into errors when python3-venv is not installed, and need to install it separately, which is inconvenient.

Creating a venv is not strictly required on some runners, therefore #26753 changed the behavior to use global environment where venv was not available.

There is a concern that falling back to global environment may have adverse effects on the group of users which benefitted from the separate venv, see: #26778 (comment) .

Possible failure modes:

  • there is a hypothetical one-time flake in creating a venv, and global environment becomes polluted with pipeline dependencies, potentially having side-effects on pipelines running on the same cluster, and using packages from the global environment.
  • less hypothetical scenario: Flink users decide to use a custom container, that doesn't include venv, and now they silently start running into issues like Multiple jobs running on Flink session cluster reuse the persistent Python environment. #21123.

Possible avenues to address :

  • make it explicit whether venv should be used (either opt-in or opt-out).
  • make venv a requirement for Beam containers.
  • Determine at runtime whether venv is definitely necessary, and pipeline should fail (some Flink usecses), or definitely unncecessary (Dataflow usecases) and execution can continue in default environment.
  • Give a clear error when venv not available: "venv is required but not installed in this execution environment. If you wish to disable venv, set RUN_PYTHON_SDK_IN_DEFAULT_ENVIRONMENT=1 environment variable. If you use a custom container, set ENV RUN_ ...=1

cc: @phoerious (who was working on #21123).

Issue Priority

Priority: 2 (default / most feature requests should be filed as P2)

Issue Components

  • Component: Python SDK
  • Component: Java SDK
  • Component: Go SDK
  • Component: Typescript SDK
  • Component: IO connector
  • Component: Beam examples
  • Component: Beam playground
  • Component: Beam katas
  • Component: Website
  • Component: Spark Runner
  • Component: Flink Runner
  • Component: Samza Runner
  • Component: Twister2 Runner
  • Component: Hazelcast Jet Runner
  • Component: Google Cloud Dataflow Runner
@phoerious
Copy link
Contributor

phoerious commented May 19, 2023

I don't see why it shouldn't be a dependency. I would assume that in most cases where it's not installed it's simply because it's not needed otherwise.

An explicit config flag would be fine, too. Such a flag should cause Beam to install venv if it's not there.

@tvalentyn
Copy link
Contributor Author

tvalentyn commented May 19, 2023

As I understand, an explicit config flag would mean that Flink users who need separate environments would need to explicitly enable it. Note that it would be a change in behavior. @phoerious Do you have an idea how frequently this is necessary? If not very common, then it's ok to make it opt in.

@tvalentyn
Copy link
Contributor Author

I would assume that in most cases where it's not installed it's simply because it's not needed otherwise

I think when it's not installed, it's because some distros try to save on the size of Python installation and try to reduce which modules are installed. It doesn't make sense to me. venv seems like a very basic and necessary functionality nowadays.

@tvalentyn
Copy link
Contributor Author

As I understand, an explicit config flag would mean that Flink users who need separate environments would need to explicitly enable it. Note that it would be a change in behavior. @phoerious Do you have an idea how frequently this is necessary? If not very common, then it's ok to make it opt in.

Specifically, in your use case, you'd have to set some environment variable in the Flink environment, such as:

ENV RUN_PYTHON_SDK_IN_VIRTUAL_ENVIRONMENT=1

@tvalentyn
Copy link
Contributor Author

just trying to think through what reasonable defaults should be.
Another idea, we can set the RUN_PYTHON_SDK_IN_VIRTUAL_ENVIRONMENT=1 in default Beam image.

But people who use a custom image with added boot.go on top of it, wouldn't inherit this setting.

@tvalentyn
Copy link
Contributor Author

@phoerious in your case, does the SDK harness run in Beam container or as a standalone process? In the latter case, setting the environment variable in Beam container wouldn't suffice.

@tvalentyn
Copy link
Contributor Author

cc: @robertwb

@tvalentyn tvalentyn self-assigned this May 19, 2023
@phoerious
Copy link
Contributor

Its a separate Beam container.

If you make it an option, I think it should be opt-out and on by default. Also I don't think adding venv by default adds much to the overall size.

@tvalentyn tvalentyn added this to the 2.48.0 Release milestone May 19, 2023
@tvalentyn
Copy link
Contributor Author

I spot checked several container images. Whenever python was provided, venv was also available.

I did find that canonical ubuntu os images (https://ubuntu.com/server/docs/cloud-images/google-cloud-engine) included python but not venv, but didn't find any container image that had similar issue.

This said and given the feedback so far, I'll keep the venv creation enabled by default, with opt out capability, and preserve the prior behavior where pipelines fail when venv not available, but with additional logging that shows how to disable separate venvs.

@riteshghorse
Copy link
Contributor

Moving this to 2.49 if there is any more work we want to do for this feature
Let me know if that is okay

@tvalentyn tvalentyn added this to the 2.48.0 Release milestone May 30, 2023
@tvalentyn
Copy link
Contributor Author

all work is completed with 2.48.0

@tvalentyn tvalentyn added the done & done Issue has been reviewed after it was closed for verification, followups, etc. label Jun 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
done & done Issue has been reviewed after it was closed for verification, followups, etc. new feature P2 python
Projects
None yet
Development

No branches or pull requests

3 participants