Skip to content

Commit

Permalink
Merge branch 'master' into avoid-graph-break-caused-by-inner-classes
Browse files Browse the repository at this point in the history
  • Loading branch information
deepcharm authored Feb 27, 2025
2 parents 060a600 + c07b635 commit 3770a7b
Show file tree
Hide file tree
Showing 2 changed files with 11 additions and 6 deletions.
3 changes: 2 additions & 1 deletion deepspeed/module_inject/auto_tp.py
Original file line number Diff line number Diff line change
Expand Up @@ -137,7 +137,8 @@ def is_load_module(module):
"LPLayerNorm", "SharedEmbedding", "OPTLearnedPositionalEmbedding", "LlamaRMSNorm", "FalconLinear",
"MistralRMSNorm", "T5LayerNorm", "MixtralRMSNorm", "Phi3RotaryEmbedding", "Phi3SuScaledRotaryEmbedding",
"Phi3RMSNorm", "YuanRMSNorm", "YuanRotaryEmbedding", "Phi3LongRoPEScaledRotaryEmbedding", "Qwen2RMSNorm",
"DeepseekV2RMSNorm", "DeepseekV2YarnRotaryEmbedding", "MoEGate"
"DeepseekV2RMSNorm", "DeepseekV3RMSNorm", "DeepseekV2YarnRotaryEmbedding", "DeepseekV3YarnRotaryEmbedding",
"MoEGate"
]
return module.__class__ in load_layers or module._get_name() in load_layer_names

Expand Down
14 changes: 9 additions & 5 deletions docs/_tutorials/inference-tutorial.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,18 +21,22 @@ if args.pre_load_checkpoint:
model = model_class.from_pretrained(args.model_name_or_path)
else:
model = model_class()

# create the tokenizer
tokenizer = model_class.from_pretrained(args.model_name_or_path)
...

import deepspeed

# Initialize the DeepSpeed-Inference engine
ds_engine = deepspeed.init_inference(model,
tensor_parallel={"tp_size": 2},
dtype=torch.half,
checkpoint=None if args.pre_load_checkpoint else args.checkpoint_json,
replace_with_kernel_inject=True)
tensor_parallel={"tp_size": world_size},
dtype=torch.half,
checkpoint=None if args.pre_load_checkpoint else args.checkpoint_json,
replace_with_kernel_inject=True)
model = ds_engine.module
output = model('Input String')
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
output = pipe('Input String')
```

To run inference with only model-parallelism for the models that we don't support kernels, you can pass an injection policy that shows the two specific linear layers on a Transformer Encoder/Decoder layer: 1) the attention output GeMM and 2) layer output GeMM. We need these part of the layer to add the required all-reduce communication between GPUs to merge the partial results across model-parallel ranks. Below, we bring an example that shows how you can use deepspeed-inference with a T5 model:
Expand Down

0 comments on commit 3770a7b

Please sign in to comment.