-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HPU support #3378
base: main
Are you sure you want to change the base?
HPU support #3378
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a bunch! From the accelerate side this looks fine; do you want to do big model inference while you're at it? (if not all good).
I assume accelerate test
etc went well? 🤗
@muellerzr I forgot to mark it as a draft 😅. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice thanks for the PR ! +1 for big model inference also for a folow-up PR for example
No worries @IlyasMoutawwakil ! Just let us know when you're all set to go 🫡 |
@@ -28,7 +28,7 @@ test_big_modeling: | |||
|
|||
test_core: | |||
python -m pytest -s -v ./tests/ --ignore=./tests/test_examples.py --ignore=./tests/deepspeed --ignore=./tests/test_big_modeling.py \ | |||
--ignore=./tests/fsdp --ignore=./tests/test_cli.py $(if $(IS_GITHUB_CI),--report-log "$(PYTORCH_VERSION)_core.log",) | |||
--ignore=./tests/fsdp --ignore=./tests/tp --ignore=./tests/test_cli.py $(if $(IS_GITHUB_CI),--report-log "$(PYTORCH_VERSION)_core.log",) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure TP should be part of test_core, tell me if you want me to revert this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah i don't think we want that cc @muellerzr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
tests/fsdp/test_fsdp.py
Outdated
@@ -295,7 +298,7 @@ class FSDPIntegrationTest(TempDirTestCase): | |||
|
|||
def setUp(self): | |||
super().setUp() | |||
self.performance_lower_bound = 0.82 | |||
self.performance_lower_bound = 0.82 if torch.cuda.is_available() else 0.70 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess 0.70 is specific to HPU? Maybe this should rather be
0.70 if is_hpu_available() else 0.82
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this nice integration. Left a couple of comments
@@ -28,7 +28,7 @@ test_big_modeling: | |||
|
|||
test_core: | |||
python -m pytest -s -v ./tests/ --ignore=./tests/test_examples.py --ignore=./tests/deepspeed --ignore=./tests/test_big_modeling.py \ | |||
--ignore=./tests/fsdp --ignore=./tests/test_cli.py $(if $(IS_GITHUB_CI),--report-log "$(PYTORCH_VERSION)_core.log",) | |||
--ignore=./tests/fsdp --ignore=./tests/tp --ignore=./tests/test_cli.py $(if $(IS_GITHUB_CI),--report-log "$(PYTORCH_VERSION)_core.log",) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah i don't think we want that cc @muellerzr
src/accelerate/accelerator.py
Outdated
else: | ||
device_ids, output_device = [self.device.index], self.device.index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah yes this should only be done in the case of hpu, basically with hpu, all processes will have access to their respective devices as hpu:0
, the device allocation for processes is managed on a lower level. @regisss
if self.device.type == "hpu": | ||
# This env variable is initialized here to make sure it is set to "true" | ||
# It should be done by the launcher but it does not work for multi-node runs | ||
os.environ["DEEPSPEED_USE_HPU"] = "true" | ||
|
||
# This should be verified in the config validation and not here | ||
if ( | ||
self.deepspeed_config["zero_optimization"].get("offload_optimizer", {}).get("device", "none") | ||
!= "none" | ||
and os.environ.get("PT_HPU_LAZY_MODE", "1") == "1" | ||
): | ||
raise ValueError( | ||
"You can't use an Offload Optimizer with HPU in Lazy Mode. " | ||
"Please set the environment variable `PT_HPU_LAZY_MODE` to `0`." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @muellerzr
src/accelerate/test_utils/testing.py
Outdated
@@ -453,6 +471,13 @@ def require_torchdata_stateful_dataloader(test_case): | |||
)(test_case) | |||
|
|||
|
|||
def launches_subprocesses(test_case): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be better to rename it run_first
and add explanations on the docstring in which case this might be useful
@require_huggingface_suite | ||
@require_multi_device | ||
@slow | ||
def test_train_multiple_models(self): | ||
self.test_file_path = self.test_scripts_folder / "test_ds_multiple_model.py" | ||
args = ["--num_processes=2", "--num_machines=1", "--main_process_port=0", str(self.test_file_path)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @muellerzr
tests/fsdp/test_fsdp.py
Outdated
@@ -295,7 +298,7 @@ class FSDPIntegrationTest(TempDirTestCase): | |||
|
|||
def setUp(self): | |||
super().setUp() | |||
self.performance_lower_bound = 0.82 | |||
self.performance_lower_bound = 0.82 if torch.cuda.is_available() else 0.70 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for regis suggestion
|
||
if is_hpu_available(): | ||
device_map = {"linear1": "cpu", "linear2": "disk", "batchnorm": "cpu", "linear3": 0, "linear4": 0} | ||
else: | ||
device_map = {"linear1": "cpu", "linear2": "disk", "batchnorm": "cpu", "linear3": 0, "linear4": 1} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it because you don't have access to multiple hpus ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one process can only acquire one hou device, and in the context of that process, the id of that device is 0. if you put something on 'hpu:1' you just get a warning and the tensor is put on 'hpu:0'.
but it seems like 'hpu:x' is gonna be supported at some point (from comment in the optimum-habana repo), @regisss ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm yeah, I guess XD
But there is no ETA AFAIK, so I wouldn't expect to be able to use hpu:x
in the near future.
What does this PR do?
Fixes # (issue)
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.