HPU support #3378

IlyasMoutawwakil · 2025-02-04T09:43:18Z

What does this PR do?

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

HuggingFaceDocBuilderDev · 2025-02-04T09:49:23Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

muellerzr

Thanks a bunch! From the accelerate side this looks fine; do you want to do big model inference while you're at it? (if not all good).

I assume accelerate test etc went well? 🤗

IlyasMoutawwakil · 2025-02-06T07:57:11Z

@muellerzr I forgot to mark it as a draft 😅.
I'm still debugging some issues, both from accelerate and optimum-habana side, should be ready by next week.

SunMarc

Nice thanks for the PR ! +1 for big model inference also for a folow-up PR for example

muellerzr · 2025-02-06T14:52:23Z

No worries @IlyasMoutawwakil ! Just let us know when you're all set to go 🫡

IlyasMoutawwakil · 2025-02-20T16:39:24Z

Makefile

@@ -28,7 +28,7 @@ test_big_modeling:

 test_core:
 	python -m pytest -s -v ./tests/ --ignore=./tests/test_examples.py --ignore=./tests/deepspeed --ignore=./tests/test_big_modeling.py \
-	--ignore=./tests/fsdp --ignore=./tests/test_cli.py $(if $(IS_GITHUB_CI),--report-log "$(PYTORCH_VERSION)_core.log",)
+	--ignore=./tests/fsdp --ignore=./tests/tp --ignore=./tests/test_cli.py $(if $(IS_GITHUB_CI),--report-log "$(PYTORCH_VERSION)_core.log",)


not sure TP should be part of test_core, tell me if you want me to revert this.

yeah i don't think we want that cc @muellerzr

regisss

LGTM!

src/accelerate/state.py

src/accelerate/test_utils/scripts/test_ddp_comm_hook.py

regisss · 2025-02-24T15:48:10Z

tests/fsdp/test_fsdp.py

@@ -295,7 +298,7 @@ class FSDPIntegrationTest(TempDirTestCase):

    def setUp(self):
        super().setUp()
-        self.performance_lower_bound = 0.82
+        self.performance_lower_bound = 0.82 if torch.cuda.is_available() else 0.70


I guess 0.70 is specific to HPU? Maybe this should rather be

0.70 if is_hpu_available() else 0.82

?

tests/test_examples.py

tests/test_memory_utils.py

SunMarc

Thanks for this nice integration. Left a couple of comments

SunMarc · 2025-02-25T10:41:54Z

Makefile

@@ -28,7 +28,7 @@ test_big_modeling:

 test_core:
 	python -m pytest -s -v ./tests/ --ignore=./tests/test_examples.py --ignore=./tests/deepspeed --ignore=./tests/test_big_modeling.py \
-	--ignore=./tests/fsdp --ignore=./tests/test_cli.py $(if $(IS_GITHUB_CI),--report-log "$(PYTORCH_VERSION)_core.log",)
+	--ignore=./tests/fsdp --ignore=./tests/tp --ignore=./tests/test_cli.py $(if $(IS_GITHUB_CI),--report-log "$(PYTORCH_VERSION)_core.log",)


yeah i don't think we want that cc @muellerzr

SunMarc · 2025-02-25T10:44:05Z

src/accelerate/accelerator.py

+                        else:
+                            device_ids, output_device = [self.device.index], self.device.index


ah yes this should only be done in the case of hpu, basically with hpu, all processes will have access to their respective devices as hpu:0, the device allocation for processes is managed on a lower level. @regisss

SunMarc · 2025-02-25T10:49:55Z

src/accelerate/accelerator.py

+            if self.device.type == "hpu":
+                # This env variable is initialized here to make sure it is set to "true"
+                # It should be done by the launcher but it does not work for multi-node runs
+                os.environ["DEEPSPEED_USE_HPU"] = "true"
+
+                # This should be verified in the config validation and not here
+                if (
+                    self.deepspeed_config["zero_optimization"].get("offload_optimizer", {}).get("device", "none")
+                    != "none"
+                    and os.environ.get("PT_HPU_LAZY_MODE", "1") == "1"
+                ):
+                    raise ValueError(
+                        "You can't use an Offload Optimizer with HPU in Lazy Mode. "
+                        "Please set the environment variable `PT_HPU_LAZY_MODE` to `0`."
+                    )


cc @muellerzr

src/accelerate/checkpointing.py

SunMarc · 2025-02-25T10:59:24Z

src/accelerate/test_utils/testing.py

@@ -453,6 +471,13 @@ def require_torchdata_stateful_dataloader(test_case):
    )(test_case)


+def launches_subprocesses(test_case):


I think it would be better to rename it run_first and add explanations on the docstring in which case this might be useful

SunMarc · 2025-02-25T11:03:34Z

tests/deepspeed/test_deepspeed_multiple_model.py

    @require_huggingface_suite
    @require_multi_device
    @slow
    def test_train_multiple_models(self):
        self.test_file_path = self.test_scripts_folder / "test_ds_multiple_model.py"
-        args = ["--num_processes=2", "--num_machines=1", "--main_process_port=0", str(self.test_file_path)]


cc @muellerzr

SunMarc · 2025-02-25T11:04:20Z

tests/fsdp/test_fsdp.py

@@ -295,7 +298,7 @@ class FSDPIntegrationTest(TempDirTestCase):

    def setUp(self):
        super().setUp()
-        self.performance_lower_bound = 0.82
+        self.performance_lower_bound = 0.82 if torch.cuda.is_available() else 0.70


+1 for regis suggestion

SunMarc · 2025-02-25T11:07:42Z

tests/test_big_modeling.py

+
+        if is_hpu_available():
+            device_map = {"linear1": "cpu", "linear2": "disk", "batchnorm": "cpu", "linear3": 0, "linear4": 0}
+        else:
+            device_map = {"linear1": "cpu", "linear2": "disk", "batchnorm": "cpu", "linear3": 0, "linear4": 1}


Is it because you don't have access to multiple hpus ?

one process can only acquire one hou device, and in the context of that process, the id of that device is 0. if you put something on 'hpu:1' you just get a warning and the tensor is put on 'hpu:0'.
but it seems like 'hpu:x' is gonna be supported at some point (from comment in the optimum-habana repo), @regisss ?

Hmm yeah, I guess XD
But there is no ETA AFAIK, so I wouldn't expect to be able to use hpu:x in the near future.

tests/test_examples.py

src/accelerate/utils/memory.py

src/accelerate/accelerator.py

IlyasMoutawwakil added 3 commits February 4, 2025 10:10

init

4f462b0

style

7b51103

is_hpu_available

9d7376e

IlyasMoutawwakil and others added 9 commits February 4, 2025 11:12

fix

069b88a

import habana_frameworks.torch.distributed.hccl

cd3cbb9

style

2493abe

test

32cbc88

initialize dist proc group

5fd4de2

revert

7f72745

set backend to hccl only if hccl initialization sets a local rank

f66c5df

force backend hccl and multi_hpu type when sure of distributed launch

2a4130d

style

fa1bc44

muellerzr approved these changes Feb 5, 2025

View reviewed changes

muellerzr requested a review from SunMarc February 5, 2025 21:38

IlyasMoutawwakil marked this pull request as draft February 6, 2025 07:50

pass accelerator tests

d3e24c5

SunMarc approved these changes Feb 6, 2025

View reviewed changes

IlyasMoutawwakil added 2 commits February 6, 2025 13:09

pas big modeling tests with bigger atol/rtol for accelerators

00cc283

fix hpu device count and skip tests requiring hpu:x

97081da

IlyasMoutawwakil added 8 commits February 6, 2025 15:22

hpu autocast

ddcb3ca

hpu rng_state

6de389c

hpu launch

ae9a76b

hpu special device placement

5b8b0b2

hpu launch

a2f8040

rng state

6abecdd

distributed data loop tests

7bc37dc

enforce non contiguity after device memory allocation

ef1de61

local sgd and masked_fill_fwd_i64

f449d3f

IlyasMoutawwakil commented Feb 20, 2025

View reviewed changes

IlyasMoutawwakil and others added 5 commits February 20, 2025 21:29

fix num_processes in test_load_states_by_steps

79ef8a5

fp8 support

f772b76

test

6218cec

Merge branch 'main' into hpu-support

31872f6

fix

610c68b

regisss mentioned this pull request Feb 24, 2025

[Transformers future] Loss Computation for Compatibility with Transformers 4.48.3 huggingface/optimum-habana#1794

Open

3 tasks

regisss approved these changes Feb 24, 2025

View reviewed changes

IlyasMoutawwakil marked this pull request as ready for review February 25, 2025 09:46

IlyasMoutawwakil requested review from muellerzr and SunMarc February 25, 2025 09:46

SunMarc reviewed Feb 25, 2025

View reviewed changes

IlyasMoutawwakil commented Feb 25, 2025

View reviewed changes

src/accelerate/accelerator.py Outdated Show resolved Hide resolved

IlyasMoutawwakil and others added 16 commits February 25, 2025 14:20

add a workflow

347db07

Update src/accelerate/accelerator.py

5fc5a2a

review comments

dc7a773

ci

9606f0d

style

6b77bc4

comments

d556021

test

e2fe2cc

habana_frameworks.torch

05e6861

patch device count

ef6192c

fix

59b51e5

fix

c6731f5

require_fp8

66ec449

fix

28dae91

fix

ec9c562

gaudi 1

53f99c3

remove unnecessary

5f9928d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HPU support #3378

HPU support #3378

IlyasMoutawwakil commented Feb 4, 2025

HuggingFaceDocBuilderDev commented Feb 4, 2025

muellerzr left a comment

IlyasMoutawwakil commented Feb 6, 2025

SunMarc left a comment •

edited

Loading

muellerzr commented Feb 6, 2025

IlyasMoutawwakil Feb 20, 2025

SunMarc Feb 25, 2025

regisss left a comment

regisss Feb 24, 2025

SunMarc left a comment

SunMarc Feb 25, 2025

SunMarc Feb 25, 2025

IlyasMoutawwakil Feb 25, 2025

SunMarc Feb 25, 2025

SunMarc Feb 25, 2025

SunMarc Feb 25, 2025

SunMarc Feb 25, 2025

SunMarc Feb 25, 2025

IlyasMoutawwakil Feb 25, 2025 •

edited

Loading

regisss Feb 25, 2025

		else:
		device_ids, output_device = [self.device.index], self.device.index

		@@ -453,6 +471,13 @@ def require_torchdata_stateful_dataloader(test_case):
		)(test_case)


		def launches_subprocesses(test_case):

HPU support #3378

Are you sure you want to change the base?

HPU support #3378

Conversation

IlyasMoutawwakil commented Feb 4, 2025

What does this PR do?

Before submitting

Who can review?

HuggingFaceDocBuilderDev commented Feb 4, 2025

muellerzr left a comment

Choose a reason for hiding this comment

IlyasMoutawwakil commented Feb 6, 2025

SunMarc left a comment • edited Loading

Choose a reason for hiding this comment

muellerzr commented Feb 6, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

regisss left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SunMarc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

IlyasMoutawwakil Feb 25, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SunMarc left a comment •

edited

Loading

IlyasMoutawwakil Feb 25, 2025 •

edited

Loading