Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[jobs] two-hop file_mounts upload #4708

Merged
merged 4 commits into from
Feb 20, 2025
Merged

Conversation

cg505
Copy link
Collaborator

@cg505 cg505 commented Feb 13, 2025

Avoids the need for a cloud bucket for simple file mounts in managed jobs, by syncing files over two hops:

  1. First hop is local -> jobs controller
  2. Second hop is jobs controller -> managed job cluster

See translate_local_file_mounts_to_two_hop():

This strategy will upload the local files to the controller first, using
a normal rsync as part of sky.launch(). Then, when the controller launches
the task, it will also use local file_mounts from the destination path of
the first hop.

Local machine/API server        Controller              Job cluster
------------------------  -----------------------  --------------------
|      local path -----|--|-> controller path --|--|-> job dst path   |
------------------------  -----------------------  --------------------

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

@cg505 cg505 requested a review from Michaelvll February 13, 2025 02:40
Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for quickly adding this @cg505! Would it be possible for us to add a smoke test for this?

Comment on lines 510 to 513
if os.path.isdir(path):
shutil.rmtree(path)
else:
os.remove(path)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we catch any potential file not exist issue to avoid going into FAILED_CONTROLLER?

for task_ in dag.tasks:
controller_utils.maybe_translate_local_file_mounts_and_sync_up(
task_, task_type='jobs')
if storage_lib.get_cached_enabled_storage_clouds_or_refresh():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This call can be moved outside the loop, i.e. only need to be called once.

'storage is available. Please specify local '
'file_mounts only.')

first_hop_file_mounts = (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This variable name is a bit confusing. How about user_file_mounts_to_controller?

@@ -119,6 +144,7 @@ def launch(
vars_to_fill = {
'remote_user_yaml_path': remote_user_yaml_path,
'user_yaml_path': f.name,
'two_hop_file_mounts': first_hop_file_mounts,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
'two_hop_file_mounts': first_hop_file_mounts,
'user_file_mounts_to_controller': user_file_mounts_to_controller,

Comment on lines 678 to 681
Local machine/API server Controller Job cluster
------------------------ ----------------------- --------------------
| local path -----|--|-> controller path --|--|-> job dst path |
------------------------ ----------------------- --------------------
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor

Suggested change
Local machine/API server Controller Job cluster
------------------------ ----------------------- --------------------
| local path -----|--|-> controller path --|--|-> job dst path |
------------------------ ----------------------- --------------------
Local machine/API server Controller Job cluster
------------------------ ----------------------- --------------------
| local path --|--|-> controller path --|--|-> job dst path |
------------------------ ----------------------- --------------------

Comment on lines 673 to 676
This strategy will upload the local files to the controller first, using
a normal rsync as part of sky.launch(). Then, when the controller launches
the task, it will also use local file_mounts from the destination path of
the first hop.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This strategy will upload the local files to the controller first, using
a normal rsync as part of sky.launch(). Then, when the controller launches
the task, it will also use local file_mounts from the destination path of
the first hop.
This strategy will upload the local files to the controller first, using
a normal rsync as part of sky.launch() for the controller. Then, when the
controller launches the task, it will also use local file_mounts from the
destination path of the first hop.

@Michaelvll
Copy link
Collaborator

Michaelvll commented Feb 13, 2025

We can probably hold off this PR from being merged until the client-server branch is merged. Just to make a clean cut for the different functionality and easier for maintanence later. : )

@Michaelvll Michaelvll deleted the branch skypilot-org:master February 16, 2025 22:03
@Michaelvll Michaelvll closed this Feb 16, 2025
@cg505 cg505 reopened this Feb 18, 2025
@cg505 cg505 changed the base branch from client-server to master February 18, 2025 19:27
cg505 and others added 2 commits February 19, 2025 14:46
Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>
@cg505
Copy link
Collaborator Author

cg505 commented Feb 19, 2025

/quicktest-core

@cg505
Copy link
Collaborator Author

cg505 commented Feb 19, 2025

/smoke-tests --managed-jobs

@cg505 cg505 requested a review from Michaelvll February 20, 2025 01:03
@cg505
Copy link
Collaborator Author

cg505 commented Feb 20, 2025

Would it be possible for us to add a smoke test for this?

It's very difficult because we need the jobs controller to disable all the clouds except k8s, but it probably already exists.

@cg505 cg505 merged commit 65ac50e into skypilot-org:master Feb 20, 2025
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants