NVIDIA · lfz941 · Jul 12, 2024 · Jul 25, 2024
diff --git a/docs/source/advanced/batch-manager.md b/docs/source/advanced/batch-manager.md
@@ -4,7 +4,7 @@
 
 TensorRT-LLM relies on a component, called the Batch Manager, to support
 in-flight batching of requests (also known in the community as continuous
-batching or iteration-level batching). That technique that aims at reducing
+batching or iteration-level batching). That technique aims at reducing
 wait times in queues, eliminating the need for padding requests and allowing
 for higher GPU utilization.
 
@@ -119,15 +119,15 @@ When using V1 batching, the following additional statistics are reported per V1
 
 ### Logits Post-Processor (optional)
 
-Users can alter the logits produced the network, with a callback attached to an `InferenceRequest`:
+Users can alter the logits produced by the network, with a callback attached to an `InferenceRequest`:
 
 ```
   using LogitsPostProcessor = std::function<TensorPtr(RequestIdType, TensorPtr&, BeamTokens const&, TStream const&)>;
 ```
 
 The first argument is the request id, second is the logits tensor, third are the tokens produced by the request so far, and last one is the operation stream used by the logits tensor.
 
-Users *must* use the stream to access the logits tensor. For example, performing a addition with a bias tensor should be enqueued on that stream.
+Users *must* use the stream to access the logits tensor. For example, performing an addition with a bias tensor should be enqueued on that stream.
 Alternatively, users may call `stream->synchronize()`, however, that will slow down the entire execution pipeline.
 
 Note: this feature isn't supported with the `V1` batching scheme for the moment.
@@ -244,5 +244,5 @@ results.
 
 A Triton Inference Server C++ backend is provided with TensorRT-LLM that
 includes the mechanisms needed to serve models using in-flight batching. That
-backend is also a good starting example how to implement in-flight batching using
+backend is also a good starting example of how to implement in-flight batching using
 the TensorRT-LLM batch manager.
diff --git a/docs/source/advanced/expert-parallelism.md b/docs/source/advanced/expert-parallelism.md
@@ -25,6 +25,6 @@ When both Tensor Parallel and Expert Parallel are enabled, each GPU handles a po
 
 The default parallel pattern is Tensor Parallel. You can enable Expert Parallel or hybrid parallel by setting `--moe_tp_size` and `--moe_ep_size` when calling `convert_coneckpoint.py`. If only `--moe_tp_size` is provided, TRT-LLM will use Tensor Parallel for the MoE model; if only `--moe_ep_size` is provided, TRT-LLM will use Expert Parallel; if both are provided, the hybrid parallel will be used.
 
-Be sure that the product of `moe_tp_size` and `moe_ep_size` should equal to `tp_size`, since the total number of MoE paralleism across all GPUs must match the total number of parallelism in other parts of the model.
+Ensure the product of `moe_tp_size` and `moe_ep_size` is equal to `tp_size`, since the total number of MoE parallelism across all GPUs must match the total number of parallelism in other parts of the model.
 
-The other parameters related to MoE structure, such as `num_experts_per_tok` (TopK in previous context), and `num_local_experts`, can be find in the model’s configuration file, such as the one for [Mixtral 8x7B model](https://huggingface.co/mistralai/Mixtral-8x7B-v0.1/blob/main/config.json).
+The other parameters related to the MoE structure, such as `num_experts_per_tok` (TopK in previous context) and `num_local_experts,` can be found in the model’s configuration file, such as the one for [Mixtral 8x7B model](https://huggingface.co/mistralai/Mixtral-8x7B-v0.1/blob/main/config.json).
diff --git a/docs/source/advanced/gpt-attention.md b/docs/source/advanced/gpt-attention.md
@@ -140,7 +140,7 @@ TensorRT-LLM supports in-flight batching of requests (also known as continuous
 batching or iteration-level batching) for higher serving throughput. With this feature,
 sequences in context phase can be processed together with sequences in
 generation phase. The purpose of that technique is to better interleave
-requests to reduce latency as well as make better use the of the GPUs.
+requests to reduce latency as well as make better use of the GPUs.
 For efficiency reasons (1), the support for inflight batching ***requires the
 input tensors to be packed (no padding)***.
 

diff --git a/docs/source/advanced/gpt-runtime.md b/docs/source/advanced/gpt-runtime.md
@@ -177,10 +177,10 @@ sequences. If both `topK` and `topP` are zero, greedy search is performed.
    longer sequences in beam-search (the log-probability of a sequence will be
    penalized by a factor that depends on `1.f / (length ^ lengthPenalty)`). The
    default is value `0.f`,
- * `earlyStopping`, a integer value that controls whether the generation process
+ * `earlyStopping`, an integer value that controls whether the generation process
    finishes once `beamWidth` sentences are generated (end up with `end_token`).
    Default value `1` means `earlyStopping` is enabled, value `0` means `earlyStopping`
-   is disable, other values  means the generation process is depended on
+   is disabled, other values means the generation process is dependent on
    `length_penalty`.
 The `beamWidth` parameter is a scalar value. It means that in this release of
 TensorRT-LLM, it is not possible to specify a different width for each input