Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[k8s] Update the instance type format to support number only accelerator type #4756

Merged
merged 7 commits into from
Feb 20, 2025

Conversation

DanielZhangQD
Copy link
Contributor

@DanielZhangQD DanielZhangQD commented Feb 19, 2025

Update the instance type format for Kubernetes to support number-only accelerator type, which is reported in #4608.

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
    • Number-only accelerator type test
✗ python -m sky.cli launch -y -c mygpucluster --cloud k8s --gpus 4090:1 --cpus 1 --memory 6 --dryrun -- "nvidia-smi"

Command to run: nvidia-smi
Running on cluster: mygpucluster
D 02-20 10:15:11 optimizer.py:301] #### Task<name=sky-cmd>(run='nvidia-smi')
D 02-20 10:15:11 optimizer.py:301]   resources: Kubernetes(cpus=1, mem=6, {'4090': 1}) ####
D 02-20 10:15:11 optimizer.py:316] Defaulting the task's estimated time to 1 hour.
D 02-20 10:15:11 optimizer.py:338] resources: Kubernetes(1CPU--6GB--4090:1, cpus=1, mem=6, {'4090': 1})
D 02-20 10:15:11 optimizer.py:357]   estimated_runtime: 3600 s (1.0 hr)
D 02-20 10:15:11 optimizer.py:361]   estimated_cost (not incl. egress): $0.0
I 02-20 10:15:11 optimizer.py:884] Considered resources (1 node):
I 02-20 10:15:11 optimizer.py:952] -------------------------------------------------------------------------------------------------------
I 02-20 10:15:11 optimizer.py:952]  CLOUD        INSTANCE            vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE     COST ($)   CHOSEN
I 02-20 10:15:11 optimizer.py:952] -------------------------------------------------------------------------------------------------------
I 02-20 10:15:11 optimizer.py:952]  Kubernetes   1CPU--6GB--4090:1   1       6         4090:1         kind-skypilot   0.00          ✔
I 02-20 10:15:11 optimizer.py:952] -------------------------------------------------------------------------------------------------------
D 02-20 10:15:11 cloud_vm_ray_backend.py:4596] cluster_ever_up: False
D 02-20 10:15:11 cloud_vm_ray_backend.py:4597] record: None
D 02-20 10:15:11 utils.py:54] more than 1 credential file found
D 02-20 10:15:11 backend_utils.py:667] Using ssh_proxy_command: None
I 02-20 10:15:11 execution.py:323] Dryrun finished.
  • Letter prefixed accelerator type (for backward compatibility)
✗ python -m sky.cli launch -y -c mygpucluster --cloud k8s --gpus K80:1 --cpus 1 --memory 6 --dryrun -- "nvidia-smi"

Command to run: nvidia-smi
Running on cluster: mygpucluster
D 02-20 10:14:38 optimizer.py:301] #### Task<name=sky-cmd>(run='nvidia-smi')
D 02-20 10:14:38 optimizer.py:301]   resources: Kubernetes(cpus=1, mem=6, {'K80': 1}) ####
D 02-20 10:14:38 optimizer.py:316] Defaulting the task's estimated time to 1 hour.
D 02-20 10:14:38 optimizer.py:338] resources: Kubernetes(1CPU--6GB--K80:1, cpus=1, mem=6, {'K80': 1})
D 02-20 10:14:38 optimizer.py:357]   estimated_runtime: 3600 s (1.0 hr)
D 02-20 10:14:38 optimizer.py:361]   estimated_cost (not incl. egress): $0.0
I 02-20 10:14:38 optimizer.py:884] Considered resources (1 node):
I 02-20 10:14:38 optimizer.py:952] ------------------------------------------------------------------------------------------------------
I 02-20 10:14:38 optimizer.py:952]  CLOUD        INSTANCE           vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE     COST ($)   CHOSEN
I 02-20 10:14:38 optimizer.py:952] ------------------------------------------------------------------------------------------------------
I 02-20 10:14:38 optimizer.py:952]  Kubernetes   1CPU--6GB--K80:1   1       6         K80:1          kind-skypilot   0.00          ✔
I 02-20 10:14:38 optimizer.py:952] ------------------------------------------------------------------------------------------------------
D 02-20 10:14:38 cloud_vm_ray_backend.py:4596] cluster_ever_up: False
D 02-20 10:14:38 cloud_vm_ray_backend.py:4597] record: None
D 02-20 10:14:38 utils.py:54] more than 1 credential file found
D 02-20 10:14:38 backend_utils.py:667] Using ssh_proxy_command: None
I 02-20 10:14:38 execution.py:323] Dryrun finished.
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

@DanielZhangQD
Copy link
Contributor Author

/smoke-test --kubernetes

@DanielZhangQD DanielZhangQD marked this pull request as ready for review February 20, 2025 02:32
@DanielZhangQD
Copy link
Contributor Author

Hi @Michaelvll, please help review this PR, thanks!

@Michaelvll
Copy link
Collaborator

/smoke-test --kubernetes

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this @DanielZhangQD ! Could we add a unit test for this newly added code as well?

@DanielZhangQD
Copy link
Contributor Author

Hi @Michaelvll, thanks for the reminder!
Sorry that I wrote UT cases for the regex patterns in a private file but did not migrate to the repo, and I have added UT cases with the following result:

✗ pytest -v --kubernetes tests/unit_tests/kubernetes/test_instance_type.py
========================================================================================== test session starts ==========================================================================================
platform darwin -- Python 3.11.11, pytest-8.3.4, pluggy-1.5.0
rootdir: /Users/matrixorigin/dan/go/src/github.com/skypilot-org/skypilot
configfile: pyproject.toml
plugins: env-1.1.5, time-machine-2.16.0, anyio-4.7.0, xdist-3.6.1
16 workers [1 item]
.
=========================================================================================== 1 passed in 5.32s ===========================================================================================

PTAL again, thanks!

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @DanielZhangQD! This is awesome! LGTM.

@Michaelvll Michaelvll merged commit 9515fc3 into skypilot-org:master Feb 20, 2025
18 checks passed
@DanielZhangQD DanielZhangQD deleted the fix/4608 branch February 21, 2025 01:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants