Variable batch size and LR scheduler (moved to #7104) #7020

bm-synth · 2025-02-08T23:30:22Z

NOTE: PR closed and moved to #7104 )

Background and rationale

In many use cases, particularly LLMs, one is faced with inputs (sentences) of variable lengths. A common practice is to pack batches by token count (not a fixed batch size), ie by putting together sentences whose given metric (eg sequence lengths) will add up to an user-provided value. As an example, in Attention is all you need, section 5.1:

Sentence pairs were batched together by approximate sequence length. Each training
batch contained a set of sentence pairs containing approximately 25000 source tokens and 25000
target tokens.

Dynamic batch sizes has been requested in DeepSpeed issue 1051, DeepSpeed issue 3455 , Pytorch Lightning issue 16914, huggingface issue 2647 and is available already in many libraries e.g. NVIDIA Triton and Meta FairSeq (implementation here ).

The immediate use case for this is when one needs to maximize GPU utilization. Moreover, this is particularly relevant for curriculum learning where a BxTxE (Batch x Time x Embedding) -shaped input should ideally have high B and low T at the early curriculum steps (many short sentences packed together as a batch), and low B and high T at the late steps (few long sentences in the batch). A dynamic size T is already supported by Deepspeed, e.g. in the documentation for pipeline parallelism's reset_activation_shape():

For curriculum learning that changes the seqlen of each sample, we need to call this whenever the seqlen is going to change.

However, dynamic B is not supported. A dynamic B would require an adequate increase/decrease of learning rate. This technique has been applied previously, and the two most common LR scaling algorithms have been described as:

Linear Scaling Rule: "When the minibatch size is multiplied by k, multiply the learning rate by k", as in Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour, Goyal et al.
Square Root scaling: "when multiplying the batch size by k, multiply the learning rate by √k, to keep the variance in the gradient expectation constant" by One weird trick for parallelizing convolutional neural networks, A. Krizhevsky et al.

In practice, the user picks the total token count per batch as the metric that drives batching, instead of batching by sentence count. During runtime, the variable batch size is computed and the LR is adjusted respectively, based on the LR and batch size provided by the config.

Illustration of dynamic batch size, sequence length and LR

Imagine we picked a limit of 30 tokens per batch, and have set a reference lr=1e-3 for a train_batch_size=2 (in the deepspeed config). The batching algorithm for curriculum may pack the data into batches of short sentences (left) at the early stages, and batches of long sentences (right) as later stages, e.g.:

Above, we collected samples until we filled up the batch with at most 30 tokens. The batch sizes (number of samples) became then 10 and 4 on the left and right examples, respectively. Using the linear scaling rule, the LR for those batches become 5e-3 and 2e-3.

Pipeline parallelism

Pipeline parallelism requires the same batch size and same sequence length across all micro-batches in a batch, as the activation sizes must be fixed between gradient accumulation steps. Between batches, these may change, and long as engine.reset_activation_shape() is called so that the new shapes are communicated on the first gradient accumulation step in the batch. Enforcing similar BxTxE between batches may lead to smaller micro-batches. As an example, below we can see an illustration of a 2-node 2-gradient-accumulation-step (ie 4 micro-batches) batching for the same dataset, when preparing data for the regular DDP (left) and for the pipeline parallelism use cases (right):

We can see that the pipeline use case (right) has the same BxTxE shape across all the 4 micro-batches in the same batch, and in order to respect that, it packs less samples in the batch, when compared to the standard use case (left hand size)

Attention Head

For an input of size BxTxE the attention has a shape of TxT for a mask of fixed size across samples of same size, or BxTxT for a different mask per sample (when samples have different sizes, as in the dataset above). This 3D attention matrix can be illustrated for the DDP microbatch 1 (picture above top-left, 4 sentences) as:

Note the memory savings: the attention head has a size of BxTxT, i.e. a linear memory dependency on the batch size B and quadratic memory dependency on the largest sequence length T in the (micro-) batch. Thus, supporting a dynamic size T allows for an increase of B.

PR overview

This PRs implements dynamic batching and LR scaling. The dataloader and LR scheduler necessary can be retrieved by calling get_dataloader_and_lr_scheduler_for_variable_batch_size. A small explanation of that function follows:

The logic behind the algorithms for LR scaling is in scale_lr;
The partitioning of samples into batches is done by batch_by_seqlen.
For pipeline parallelism, it is required that all micro-batches in a pipeline pass to have the same activation shapes. This is enabled by setting to True the following parameters:
- required_microbatches_of_same_sizes that will force the B dimension to be the same across all gradient accumulation steps of all dataloaders on a batch;
- required_microbatches_of_same_lengths that will force the T dimension to be the same across all gradient accumulation steps. Works by calling the user-provided sample_padding_fn(sentence, len) that pads a given sentence to the argument length;
- batch_by_seqlen returns microbatch_sample_ids (the list of sample ids per micro-batch), batch_sizes (the size of effective batch sizes, and batch_max_seqlens (longest sequence across all microbatches in a batch)
dataloader_for_variable_batch_size relies on microbatch_sample_ids and will iterate/collate/pad samples for every batch and return a dataloader that iterates the final (variable-size) batches;
lr_scheduler_for_variable_batch_size relies on batch_sizes to compute the learning rate for each effective batch, taking into account the batch size and LR in the config file, and scaling the LR based on the size of each effective batch, and the scaling rule mentioned above (Linear, Square root, etc).
- Special note to the lr_scheduler returned that will either accept either:
  1. an user-provided Optimizer that will scale the learning rates (in param groups) at every batch, or
  2. an user-defined LRScheduler, that in this case will first get the learning rate from the scheduler and then scale it accordingly.

Example

An example for the use case with and without pipelining is provided in deepspeed/runtime/data_pipeline/data_sampling/variable_batch_size_and_lr_example.py. The example shows an attention head with attention of variable-sized BxTxT per batch, followed by a fixed size feed forward network. These are the main blocks on a Large Language Model. The feed-forward (or linear layer) that follows the attention head requires a constant input size, equivalent to the largest sentence in the whole dataset, so the output of the attention must be padded (see feedforward: needs to convert BxTxE to BxMxE by padding extra tokens in the code).

The output of the variable_batch_size_and_lr_example is the following:

Config

The example file also comments the relevant deepspeed config with comments:

config = {
  "train_batch_size": 16,
  # `train_micro_batch_size_per_gpu` tells how many sequence packs of `max_tokens` each will be collated together.
  #  I.e. the number of tokens per micro batch (ie per gpu iteration) is `train_micro_batch_size_per_gpu`*`max_tokens`.
  "train_micro_batch_size_per_gpu": 2,
  "data_efficiency": {
    "enabled": True,
    # seed to be applied to all data efficiency modules, including dynamic batching
    "seed": 42,
    "data_sampling": {
      "num_workers": 0, # dataloader num_workers argument
      "pin_memory": False,  # dataloader pin_memory argument
      "dynamic_batching": {
        # enables or disables dynamic batching
        "enabled": True,
        # how many tokens we need to fill a pack of sequences (that will be collated together as a sample)
        "max_tokens": 100,
        # Input and output write to read from or write the length of every sequence.
        # Sequence lengths will be loaded from: {metrics_path}/seqlen/seqlen_sample_to_metric.bin and *.idx
        # If files dont exist, they'll be computed and saved on the first run, and loaded on subsequent runs.
        "metrics_path": "./curriculum_output/",
        # As batch size increases/decreses, which method to use to scale LR accordingly?
        # Options: linear, sqrt (square root), or None to disable
        "lr_scaling_method": "linear",
        # how to pick sentences to be packed into samples:
        # - dataloader: by same order as they come in with the dataloader
        # - seqlen: by sequence length (shortest to longest)
        # - random: random order using the seed in config['data_efficiency']['seed'
        "sentence_picking_order": "dataloader",  # "random" / "seqlen" / "dataloader"
        # minimum number of sequences required to reach `max_tokens`. If sentence pack is smaller, it's discarded.
        "min_batch_size": 1,
        # maximum number of sequences required to reach `max_tokens`. If sentence pack is larger, it's discarded.
        "max_batch_size": 10,
        # enable the output of microbatching information about sentence packing
        "verbose": True,
      },
    },
  },
}

Future work

A follow-up PR will enable dynamic batching when calling deepspeed.initialize. I.e. instead of this:

engine, _, _, _ = deepspeed.initialize(config=config, model=model)
dataloader, lr_scheduler, _ = get_dataloader_and_lr_scheduler_for_variable_batch_size_deepspeed(...)
engine.lr_scheduler = lr_scheduler

we'd ideally have this:

engine, _, dataloader, lr_scheduler = deepspeed.initialize(config=config, model=model)

where initialize will call internally get_dataloader_and_lr_scheduler_for_variable_batch_size_deepspeed.

We use simple model + deepspeed zero 3 + torch.compile and count graph break numbers to demonstrate current status of combing deepspeed + torch.compile. --------- Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>

…eepspeedai#6583) HF accelerate implemented fixes here: huggingface/accelerate#3131 This means we can revert the changes from deepspeedai#6574

…ch workflow triggers (deepspeedai#6584) Changes from deepspeedai#6472 caused the no-torch workflow that is an example of how we build the DeepSpeed release package to fail (so we caught this before a release, see more in deepspeedai#6402). These changes also copy the style used to include torch in other accelerator op_builder implementations, such as npu [here](https://github.com/microsoft/DeepSpeed/blob/master/op_builder/npu/fused_adam.py#L8) and hpu [here](https://github.com/microsoft/DeepSpeed/blob/828ddfbbda2482412fffc89f5fcd3b0d0eba9a62/op_builder/hpu/fused_adam.py#L15). This also updates the no-torch workflow to run on all changes to the op_builder directory. The test runs quickly and shouldn't add any additional testing burden there. Resolves: deepspeedai#6576

Fixes deepspeedai#6585 Use shell=True for subprocess.check_output() in case of ROCm commands. Do not use shlex.split() since command string has wildcard expansion. Signed-off-by: Jagadish Krishnamoorthy <jagadish.krishnamoorthy@amd.com>

Work in progress to ensure we meet SSF best practices: https://www.bestpractices.dev/en/projects/9530

SD workflow needed updates when we moved to pydantic 2 support that was never added before. Passing nv-sd workflow [here](https://github.com/microsoft/DeepSpeed/actions/runs/11239699283)

Llama3.2-11b and llama3.2-90b including vision model and text model, these two models have different num_kv_heads, so we need to set num_kv_heads dynamically. Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>

Disable `steps_per_print` by default.

This PR addresses deepspeedai#5818. Instead of contiguous numbers based on the device count, this PR uses device indices in `--include`. --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>

@tohtana

cc @tohtana Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>

This PR mainly handles all places where InferenceBuilder is used to access any op or a specific implementation for an op. Instead an op is defined, and its proper implementation is picked inside and the usage will be transparent to the user. What was done in the PR: 1) Added missing ops (added a py file with fallback mechanism) 2) Added missing fallback implementations for existing ops 3) removed all usages for builder.load and replaced them with ops instead. 4) added workspace op and inferenceContext which contains all workspace related functions and inferenceContext is the python fallback of inferenceContext in CUDA 5) a small change to softmax_context signature to fit the fallback signature. --------- Co-authored-by: Joe Mayer <114769929+jomayeri@users.noreply.github.com> Co-authored-by: Lev Kurilenko <113481193+lekurile@users.noreply.github.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

…eedai#6611) HF accelerate fixes implemented in huggingface/accelerate#3145 mean that we no longer need to pin the Accelerate version! nv-lightning tests now run on Ubuntu 20.04+, so we support >node 16, so we can remove the explicit permissions for that in the env config.

Modified _replace_module in auto_tp.py : The modification keeps the layers 'shared_expert_gate' and 'gate' in qwen2-moe the original type torch.nn.Linear and not changes them into LinearLayer. In this way, their weights will not be split into multiple HPU/GPU cards. Then the qwen2-moe can run on multiple HPU/GPU cards. Since the weights of 'gate' are not split into multiple HPU/GPU cards, all gather operations are not needed, which may improve performance. --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>

@jomayeri

**Auto-generated PR to update version.txt after a DeepSpeed release** Released version - 0.15.2 Author - @jomayeri Co-authored-by: jomayeri <jomayeri@users.noreply.github.com>

Parameters prefetched by ZeRO3 are sometimes not used. This occurs when the actual sub-module execution differs from previous tracing. As a result, the state of the allgather handle for such a parameter remains `INFLIGHT`, causing functions like `empty_partition_cache` to detect it and throw an error. This PR resolves the issue by ensuring that communication finishes and the parameters are freed. As this issue was mentioned in deepspeedai#6011, this includes the change of the branch. We need to merge deepspeedai#6011 first. --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

Restoring the functionality of the cpu locked tensor in the AIO library. Make async_io operator available for CPU accelerator, i.e., CPU only environment. --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

…eepspeedai#6541) setting global variables during training will create a graph breaks when using torch.compile (reading global variables doesn't). this commit attempts to reduce the setting of global variables in the checkpointing flows. there are 2 main uses setting global variables: 1. Share data between functions 2. Establish that this is the first call to the code For most of the cases the data in the global variables is data that can be computed on demand or set once in an initial state in a configure function. For "check that this is the first run" use case the code was moved to the configure function. --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>

This PR adds an API `deepspeed.runtime.zero.offload_states get_state_devices`, which gets devices of offload states as suggested in this [comment](deepspeedai#6011 (comment)). We could lift this up to `deepspeed.utils` but would need to resolve a circular import: User code -> `deepspeed.utils` -> `deepspeed.utils.offload_states` -> `deepspeed.runtime.zero` -> `deepspeed.runtime.zero.partition_parameters` -> `deepspeed.utils` This will require a significant refactoring as long as we have `OffloadStateTypeEnum` in `deepspeed.runtime.zero`. --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

Tests with `reuse_dist_env = True` often causes memory leaks. This PR ignores `reuse_dist_env` and forcibly sets it to `False`. This change might slow down the tests, but I think it is better to manually restart runners and relaunch tests. Memory usages (See deepspeedai#6578): - `reuse_dist_env == True`: https://github.com/microsoft/DeepSpeed/actions/runs/11302940871/job/31439471512 - `reuse_dist_env == False`: https://github.com/microsoft/DeepSpeed/actions/runs/11303250613/job/31440137894

This PR extends deepspeedai#6570 by showing a breakdown of graph breaks. So we can see how graph breaks are distributed among different reasons. An example of graph break output can be seen from the following workflow run https://github.com/microsoft/DeepSpeed/actions/runs/11199157962

) This patch fixes issue deepspeedai#4460. When `btl_tcp_if_include` option is provided through `--launcher_args`, we use the provided option instead of the hardcoded `--mca btl_tcp_if_include eth0`. Otherwise we use `--mca btl_tcp_if_include eth0` as the default for compatibility. Fixes deepspeedai#4460 --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

Some (not all) of the LR schedulers in runtime were missing the initialization of the optimizer group lr. --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>

## Feature This commit implements the following features: - [x] support saving checkpoint as safetensors (more commonly used format) - [x] support sharding checkpoints (which is important for very large models) Most of the codes are borrowed from https://github.com/huggingface/transformers/blob/v4.45.1/src/transformers/modeling_utils.py#L2490 ## Usage For `pytorch_model.bin` export ``` python zero_to_fp32.py . output_dir/ ``` For `model.safetensors` export ``` python zero_to_fp32.py . output_dir/ --safe_serialization ``` --------- Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>

…eepspeedai#6496) adding an option to disable calls for logger while compiling to avoid graph breaks. Here I used an environment variable to determine whether to activate this option, but it can also be determined using the json config file or any other way you see fit. --------- Co-authored-by: snahir <snahir@habana.ai> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>

The error in the following log suggests that the cache file for HF model list can be broken: https://github.com/microsoft/DeepSpeed/actions/runs/11343665365/job/31546708118?pr=6614 The actual cause of the above error is unclear, but `_hf_model_list` potentially breaks the cache file when it is concurrently called from multiple processes. This PR locks the cache file to ensure `_hf_model_list` safely reads and writes the file.

Signed-off-by: Logan Adams <loadams@microsoft.com>

Following changes in Pytorch trace rules , my previous PR to avoid graph breaks caused by logger is no longer relevant. So instead I've added this functionality to torch dynamo - pytorch/pytorch@16ea0dd This commit allows the user to config torch to ignore logger methods and avoid associated graph breaks. To enable ignore logger methods - os.environ["DISABLE_LOGS_WHILE_COMPILING"] = "1" To ignore logger methods except for a specific method / methods (for example, info and isEnabledFor) - os.environ["DISABLE_LOGS_WHILE_COMPILING"] = "1" and os.environ["LOGGER_METHODS_TO_EXCLUDE_FROM_DISABLE"] = "info, isEnabledFor" Signed-off-by: ShellyNR <shelly.nahir@live.biu.ac.il> Co-authored-by: snahir <snahir@habana.ai>

The partition tensor doesn't need to move to the current device when meta load is used. Signed-off-by: Lai, Yejing <yejing.lai@intel.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

…t` (deepspeedai#7069) With future changes coming to pip/python/etc, we need to modify to no longer call `python setup.py ...` and replace it instead: https://packaging.python.org/en/latest/guides/modernize-setup-py-project/#should-setup-py-be-deleted ![image](https://github.com/user-attachments/assets/ea39ef7b-3cbe-4916-86f0-bc46a5fce96d) This means we need to install the build package which is added here as well. Additionally, we pass the `--sdist` flag to only build the sdist rather than the wheel as well here. --------- Signed-off-by: Logan Adams <loadams@microsoft.com>

…eepspeedai#7076) This reverts commit 8577bd2. Fixes: deepspeedai#7072

Add deepseekv3 autotp. Signed-off-by: Lai, Yejing <yejing.lai@intel.com>

loadams · 2025-02-26T15:56:40Z

Thanks @bm-synth - can you take a look at the formatting and DCO errors?

Hi @bm-synth - following up on the formatting fixes?

Fixes: deepspeedai#7082 --------- Signed-off-by: Logan Adams <loadams@microsoft.com>

Latest transformers causes failures when cpu-torch-latest test, so we pin it for now to unblock other PRs. --------- Signed-off-by: Logan Adams <loadams@microsoft.com>

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>

…/runner (deepspeedai#7086) Signed-off-by: Logan Adams <loadams@microsoft.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

These jobs haven't been run in a long time and were originally used when compatibility with torch <2 was more important. Signed-off-by: Logan Adams <loadams@microsoft.com>

bm-synth · 2025-03-02T19:25:37Z

@loadams done. Fixed formatting workflow:

issues related to formatting fixed in ed1f78938bdca90ab11f8abb0a24c42c8745c6c5;
flake8 error W604 backticks are deprecated, use 'repr()' in f07c3c42caa6be9b53df92c9e6ac49f7eb7a9f5a;
check-torchcuda pre-commit hook error related to calling torch.cuda instead of get_accelerator() in b4a08b6c44bb7db8fd8d257564b4ca394232bbf6;

Ran:

  pip show pre-commit clang-format
  pre-commit run --all-files

All pre-commit hooks passed.

…W604 backticks are deprecated, use 'repr()'

…nstead of 'get_accelerator()'

…nstead of 'get_accelerator()' Signed-off-by: Bruno Magalhaes <bruno.magalhaes@synthesia.io>

…Speed into variable_batch_size_and_lr Signed-off-by: Bruno Magalhaes <bruno.magalhaes@synthesia.io>

Signed-off-by: Bruno Magalhaes <bruno.magalhaes@synthesia.io>

bm-synth · 2025-03-03T11:39:48Z

@loadams to comply with the CDO i ended up in a rabbit hole of debugging when i had togit rebase --signoff and solve hundreds of merge conflicts. So I just created a fresh PR for all this logic (with a single commit, signed) in #7104 . Hope it's ok!

loadams · 2025-03-03T17:08:53Z

@loadams to comply with the CDO i ended up in a rabbit hole of debugging when i had togit rebase --signoff and solve hundreds of merge conflicts. So I just created a fresh PR for all this logic (with a single commit, signed) in #7104 . Hope it's ok!

Hi @bm-synth - that's fine! For future reference, if you forget the -s in git commit and the rebasing ends up causing issues, you can click on the failing DCO test and this screen should pop up:

You can manually set to pass, giving the same approval as signing off with the -s.

bm-synth · 2025-03-03T17:20:20Z

@loadams to comply with the CDO i ended up in a rabbit hole of debugging when i had togit rebase --signoff and solve hundreds of merge conflicts. So I just created a fresh PR for all this logic (with a single commit, signed) in #7104 . Hope it's ok!

Hi @bm-synth - that's fine! For future reference, if you forget the -s in git commit and the rebasing ends up causing issues, you can click on the failing DCO test and this screen should pop up:

You can manually set to pass, giving the same approval as signing off with the -s.

Hi @loadams thank you. That button doesn't show up on my side. But now I know that for future reference, I can just ask you to click it.

YizhouZ and others added 30 commits February 8, 2025 23:03

Fixes on the accelerate side mean we do not need to skip this test (d…

6b04e68

…eepspeedai#6583) HF accelerate implemented fixes here: huggingface/accelerate#3131 This means we can revert the changes from deepspeedai#6574

[ROCm] Fix subprocess error (deepspeedai#6587)

c1fcd06

Fixes deepspeedai#6585 Use shell=True for subprocess.check_output() in case of ROCm commands. Do not use shlex.split() since command string has wildcard expansion. Signed-off-by: Jagadish Krishnamoorthy <jagadish.krishnamoorthy@amd.com>

Cleanup CODEOWNERS file to be valid (deepspeedai#6603)

d40adbc

Add SSF Best practices badge (deepspeedai#6604)

0135549

Work in progress to ensure we meet SSF best practices: https://www.bestpractices.dev/en/projects/9530

Move V100 workflows from cuda 11.1/11.7 to 12.1 (deepspeedai#6607)

085ce11

Fix SD workflow (deepspeedai#6609)

85980da

SD workflow needed updates when we moved to pydantic 2 support that was never added before. Passing nv-sd workflow [here](https://github.com/microsoft/DeepSpeed/actions/runs/11239699283)

Pin accelerate to fix CI failures/issues (deepspeedai#6610)

ad8dc7b

Add llama3.2 vision autotp (deepspeedai#6577)

4a92461

Llama3.2-11b and llama3.2-90b including vision model and text model, these two models have different num_kv_heads, so we need to set num_kv_heads dynamically. Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>

Improve DS logging control (deepspeedai#6602)

820057c

Disable `steps_per_print` by default.

Handle when backend is also in compile_kwargs (deepspeedai#6502)

6e60348

cc @tohtana Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>

Update version.txt after 0.15.2 release (deepspeedai#6615)

d02116e

**Auto-generated PR to update version.txt after a DeepSpeed release** Released version - 0.15.2 Author - @jomayeri Co-authored-by: jomayeri <jomayeri@users.noreply.github.com>

AIO CPU Locked Tensor (deepspeedai#6592)

8e33def

Restoring the functionality of the cpu locked tensor in the AIO library. Make async_io operator available for CPU accelerator, i.e., CPU only environment. --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

apply fp16 autocast only to floating point values

7db8f30

Add API for updating ZeRO gradients (deepspeedai#6590)

ffa3238

loadams and others added 7 commits February 24, 2025 06:01

Fix TOCTOU issues, switch to fstat (deepspeedai#7067)

9044558

Signed-off-by: Logan Adams <loadams@microsoft.com>

Fix meta load tensor imcompatible issue (deepspeedai#7073)

d9db98b

The partition tensor doesn't need to move to the current device when meta load is used. Signed-off-by: Lai, Yejing <yejing.lai@intel.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

Revert "Handle special case of libuv for Windows (deepspeedai#7064)" (d…

0b120e6

…eepspeedai#7076) This reverts commit 8577bd2. Fixes: deepspeedai#7072

Add DeepseekV3 AutoTP. (deepspeedai#7045)

908ab92

Add deepseekv3 autotp. Signed-off-by: Lai, Yejing <yejing.lai@intel.com>

Merge branch 'master' into variable_batch_size_and_lr

167c317

loadams and others added 7 commits February 26, 2025 15:56

Improve inference tutorial docs (deepspeedai#7083)

37c7723

Fixes: deepspeedai#7082 --------- Signed-off-by: Logan Adams <loadams@microsoft.com>

Pin transformers version on tests that use latest. (deepspeedai#7085)

1840367

Latest transformers causes failures when cpu-torch-latest test, so we pin it for now to unblock other PRs. --------- Signed-off-by: Logan Adams <loadams@microsoft.com>

Update README.md with ICS '23 MoE paper link (deepspeedai#7087)

398c4be

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>

Update parallelism for nv-torch-latest/nightly tests due to more GPUs…

9d432eb

…/runner (deepspeedai#7086) Signed-off-by: Logan Adams <loadams@microsoft.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

Remove workflows for very old torch versions (deepspeedai#7090)

8817e9d

These jobs haven't been run in a long time and were originally used when compatibility with torch <2 was more important. Signed-off-by: Logan Adams <loadams@microsoft.com>

Merge branch 'master' into variable_batch_size_and_lr

338c7b7

ran pre-commit run --all-files to fix formatting

f90f60d

bm-synth requested a review from hwchen2017 as a code owner March 2, 2025 19:24

replace ticks by apostrophes to fix flake8 error in pre-commit hook: …

902b33c

…W604 backticks are deprecated, use 'repr()'

bm-synth requested a review from jomayeri as a code owner March 2, 2025 19:57

bm-synth added 3 commits March 2, 2025 20:02

fixed check-torchcuda pre-commit hook related to using 'torch.cuda' i…

24d0929

…nstead of 'get_accelerator()'

fixed check-torchcuda pre-commit hook related to using 'torch.cuda' i…

ca6b3ca

…nstead of 'get_accelerator()' Signed-off-by: Bruno Magalhaes <bruno.magalhaes@synthesia.io>

Merge branch 'variable_batch_size_and_lr' of github.com:bm-synth/Deep…

e08d369

…Speed into variable_batch_size_and_lr Signed-off-by: Bruno Magalhaes <bruno.magalhaes@synthesia.io>

bm-synth force-pushed the variable_batch_size_and_lr branch from 477733f to 4e20c47 Compare March 2, 2025 20:07

bm-synth closed this Mar 3, 2025

bm-synth force-pushed the variable_batch_size_and_lr branch from 4e20c47 to e08d369 Compare March 3, 2025 02:11

bm-synth added a commit to bm-synth/DeepSpeed that referenced this pull request Mar 3, 2025

copy/paste of files from deepspeedai#7020 to fix CDO sign-off

c39934c

Signed-off-by: Bruno Magalhaes <bruno.magalhaes@synthesia.io>

bm-synth changed the title ~~Variable batch size and LR scheduler~~ Variable batch size and LR scheduler (moved to #7104) Mar 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Variable batch size and LR scheduler (moved to #7104) #7020

Variable batch size and LR scheduler (moved to #7104) #7020

bm-synth commented Feb 8, 2025 •

edited

Loading

loadams commented Feb 26, 2025

bm-synth commented Mar 2, 2025 •

edited

Loading

bm-synth commented Mar 3, 2025 •

edited

Loading

loadams commented Mar 3, 2025

bm-synth commented Mar 3, 2025

Variable batch size and LR scheduler (moved to #7104) #7020

Variable batch size and LR scheduler (moved to #7104) #7020

Conversation

bm-synth commented Feb 8, 2025 • edited Loading

Background and rationale

Illustration of dynamic batch size, sequence length and LR

Pipeline parallelism

Attention Head

PR overview

Example

Config

Future work

loadams commented Feb 26, 2025

bm-synth commented Mar 2, 2025 • edited Loading

bm-synth commented Mar 3, 2025 • edited Loading

loadams commented Mar 3, 2025

bm-synth commented Mar 3, 2025

bm-synth commented Feb 8, 2025 •

edited

Loading

bm-synth commented Mar 2, 2025 •

edited

Loading

bm-synth commented Mar 3, 2025 •

edited

Loading