Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore(docs): fix typos #1939

Closed
wants to merge 2 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions docs/source/advanced/batch-manager.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

TensorRT-LLM relies on a component, called the Batch Manager, to support
in-flight batching of requests (also known in the community as continuous
batching or iteration-level batching). That technique that aims at reducing
batching or iteration-level batching). That technique aims at reducing
wait times in queues, eliminating the need for padding requests and allowing
for higher GPU utilization.

Expand Down Expand Up @@ -119,15 +119,15 @@ When using V1 batching, the following additional statistics are reported per V1

### Logits Post-Processor (optional)

Users can alter the logits produced the network, with a callback attached to an `InferenceRequest`:
Users can alter the logits produced by the network, with a callback attached to an `InferenceRequest`:

```
using LogitsPostProcessor = std::function<TensorPtr(RequestIdType, TensorPtr&, BeamTokens const&, TStream const&)>;
```

The first argument is the request id, second is the logits tensor, third are the tokens produced by the request so far, and last one is the operation stream used by the logits tensor.

Users *must* use the stream to access the logits tensor. For example, performing a addition with a bias tensor should be enqueued on that stream.
Users *must* use the stream to access the logits tensor. For example, performing an addition with a bias tensor should be enqueued on that stream.
Alternatively, users may call `stream->synchronize()`, however, that will slow down the entire execution pipeline.

Note: this feature isn't supported with the `V1` batching scheme for the moment.
Expand Down Expand Up @@ -244,5 +244,5 @@ results.

A Triton Inference Server C++ backend is provided with TensorRT-LLM that
includes the mechanisms needed to serve models using in-flight batching. That
backend is also a good starting example how to implement in-flight batching using
backend is also a good starting example of how to implement in-flight batching using
the TensorRT-LLM batch manager.
4 changes: 2 additions & 2 deletions docs/source/advanced/expert-parallelism.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,6 @@ When both Tensor Parallel and Expert Parallel are enabled, each GPU handles a po

The default parallel pattern is Tensor Parallel. You can enable Expert Parallel or hybrid parallel by setting `--moe_tp_size` and `--moe_ep_size` when calling `convert_coneckpoint.py`. If only `--moe_tp_size` is provided, TRT-LLM will use Tensor Parallel for the MoE model; if only `--moe_ep_size` is provided, TRT-LLM will use Expert Parallel; if both are provided, the hybrid parallel will be used.

Be sure that the product of `moe_tp_size` and `moe_ep_size` should equal to `tp_size`, since the total number of MoE paralleism across all GPUs must match the total number of parallelism in other parts of the model.
Ensure the product of `moe_tp_size` and `moe_ep_size` is equal to `tp_size`, since the total number of MoE parallelism across all GPUs must match the total number of parallelism in other parts of the model.

The other parameters related to MoE structure, such as `num_experts_per_tok` (TopK in previous context), and `num_local_experts`, can be find in the model’s configuration file, such as the one for [Mixtral 8x7B model](https://huggingface.co/mistralai/Mixtral-8x7B-v0.1/blob/main/config.json).
The other parameters related to the MoE structure, such as `num_experts_per_tok` (TopK in previous context) and `num_local_experts,` can be found in the model’s configuration file, such as the one for [Mixtral 8x7B model](https://huggingface.co/mistralai/Mixtral-8x7B-v0.1/blob/main/config.json).
2 changes: 1 addition & 1 deletion docs/source/advanced/gpt-attention.md
Original file line number Diff line number Diff line change
Expand Up @@ -140,7 +140,7 @@ TensorRT-LLM supports in-flight batching of requests (also known as continuous
batching or iteration-level batching) for higher serving throughput. With this feature,
sequences in context phase can be processed together with sequences in
generation phase. The purpose of that technique is to better interleave
requests to reduce latency as well as make better use the of the GPUs.
requests to reduce latency as well as make better use of the GPUs.
For efficiency reasons (1), the support for inflight batching ***requires the
input tensors to be packed (no padding)***.

Expand Down
4 changes: 2 additions & 2 deletions docs/source/advanced/gpt-runtime.md
Original file line number Diff line number Diff line change
Expand Up @@ -177,10 +177,10 @@ sequences. If both `topK` and `topP` are zero, greedy search is performed.
longer sequences in beam-search (the log-probability of a sequence will be
penalized by a factor that depends on `1.f / (length ^ lengthPenalty)`). The
default is value `0.f`,
* `earlyStopping`, a integer value that controls whether the generation process
* `earlyStopping`, an integer value that controls whether the generation process
finishes once `beamWidth` sentences are generated (end up with `end_token`).
Default value `1` means `earlyStopping` is enabled, value `0` means `earlyStopping`
is disable, other values means the generation process is depended on
is disabled, other values means the generation process is dependent on
`length_penalty`.
The `beamWidth` parameter is a scalar value. It means that in this release of
TensorRT-LLM, it is not possible to specify a different width for each input
Expand Down