Falcon-40b build causing memory leaks and failure #226

rohithkrn · 2023-11-01T05:00:32Z

Falcon-40b build failing due to memory leaks. I monitored CPU memory usage and it reaches peak and then build fails. I am on aws g5.48xlarge ec2 instance with 768GB of RAM.

Command:

python3 build.py             --model_dir /workspace/falcon/tiiuae-falcon-40b             --dtype float16             --use_inflight_batching             --use_gpt_attention_plugin float16             --use_gemm_plugin float16             --new_decoder_architecture             --use_layernorm_plugin float16             --enable_context_fmha             --remove_input_padding             --paged_kv_cache             --strongly_typed             --output_dir /tmp/tensorrtllm/falcon40b/1             --max_output_len 256             --max_input_len 512             --max_batch_size 128             --enable_debug_output             --tp_size 4             --pp_size 1             --world_size 4            --parallel_build

Log

[11/01/2023-04:48:30] [TRT-LLM] [I] ========================================= Build Arguments ==========================================
[11/01/2023-04:48:30] [TRT-LLM] [I]  - world_size....................: 4
[11/01/2023-04:48:30] [TRT-LLM] [I]  - tp_size.......................: 4
[11/01/2023-04:48:30] [TRT-LLM] [I]  - pp_size.......................: 1
[11/01/2023-04:48:30] [TRT-LLM] [I]  - model_dir.....................: /workspace/falcon/tiiuae-falcon-40b
[11/01/2023-04:48:30] [TRT-LLM] [I]  - dtype.........................: float16
[11/01/2023-04:48:30] [TRT-LLM] [I]  - timing_cache..................: model.cache
[11/01/2023-04:48:30] [TRT-LLM] [I]  - log_level.....................: info
[11/01/2023-04:48:30] [TRT-LLM] [I]  - vocab_size....................: 65024
[11/01/2023-04:48:30] [TRT-LLM] [I]  - n_layer.......................: 60
[11/01/2023-04:48:30] [TRT-LLM] [I]  - n_positions...................: 2048
[11/01/2023-04:48:30] [TRT-LLM] [I]  - n_embd........................: 8192
[11/01/2023-04:48:30] [TRT-LLM] [I]  - n_head........................: 128
[11/01/2023-04:48:30] [TRT-LLM] [I]  - n_kv_head.....................: 8
[11/01/2023-04:48:30] [TRT-LLM] [I]  - mlp_hidden_size...............: None
[11/01/2023-04:48:30] [TRT-LLM] [I]  - max_batch_size................: 128
[11/01/2023-04:48:30] [TRT-LLM] [I]  - max_input_len.................: 512
[11/01/2023-04:48:30] [TRT-LLM] [I]  - max_output_len................: 256
[11/01/2023-04:48:30] [TRT-LLM] [I]  - max_beam_width................: 1
[11/01/2023-04:48:30] [TRT-LLM] [I]  - use_gpt_attention_plugin......: float16
[11/01/2023-04:48:30] [TRT-LLM] [I]  - bias..........................: False
[11/01/2023-04:48:30] [TRT-LLM] [I]  - parallel_attention............: True
[11/01/2023-04:48:30] [TRT-LLM] [I]  - new_decoder_architecture......: True
[11/01/2023-04:48:30] [TRT-LLM] [I]  - alibi.........................: False
[11/01/2023-04:48:30] [TRT-LLM] [I]  - logits_dtype..................: float32
[11/01/2023-04:48:30] [TRT-LLM] [I]  - use_gemm_plugin...............: float16
[11/01/2023-04:48:30] [TRT-LLM] [I]  - use_layernorm_plugin..........: float16
[11/01/2023-04:48:30] [TRT-LLM] [I]  - parallel_build................: True
[11/01/2023-04:48:30] [TRT-LLM] [I]  - enable_context_fmha...........: True
[11/01/2023-04:48:30] [TRT-LLM] [I]  - enable_context_fmha_fp32_acc..: False
[11/01/2023-04:48:30] [TRT-LLM] [I]  - visualize.....................: False
[11/01/2023-04:48:30] [TRT-LLM] [I]  - load_by_shard.................: False
[11/01/2023-04:48:30] [TRT-LLM] [I]  - enable_debug_output...........: True
[11/01/2023-04:48:30] [TRT-LLM] [I]  - gpus_per_node.................: 8
[11/01/2023-04:48:30] [TRT-LLM] [I]  - builder_opt...................: None
[11/01/2023-04:48:30] [TRT-LLM] [I]  - output_dir....................: /tmp/tensorrtllm/falcon40b/1
[11/01/2023-04:48:30] [TRT-LLM] [I]  - remove_input_padding..........: True
[11/01/2023-04:48:30] [TRT-LLM] [I]  - strongly_typed................: True
[11/01/2023-04:48:30] [TRT-LLM] [I]  - enable_fp8....................: False
[11/01/2023-04:48:30] [TRT-LLM] [I]  - quantized_fp8_model_path......: None
[11/01/2023-04:48:30] [TRT-LLM] [I]  - fp8_kv_cache..................: False
[11/01/2023-04:48:30] [TRT-LLM] [I]  - use_inflight_batching.........: True
[11/01/2023-04:48:30] [TRT-LLM] [I]  - paged_kv_cache................: True
[11/01/2023-04:48:30] [TRT-LLM] [I]  - tokens_per_block..............: 64
[11/01/2023-04:48:30] [TRT-LLM] [I]  - max_num_tokens................: None
[11/01/2023-04:48:30] [TRT-LLM] [I]  - use_custom_all_reduce.........: False
[11/01/2023-04:48:30] [TRT-LLM] [I]  - quant_mode....................: 0
[11/01/2023-04:48:30] [TRT-LLM] [I] ====================================================================================================
[11/01/2023-04:48:30] [TRT-LLM] [W] Parallelly build TensorRT engines. Please make sure that all of the 4 GPUs are totally free.
[11/01/2023-04:48:44] 
WARNING: You are currently loading Falcon using legacy code contained in the model repository. Falcon has now been fully ported into the Hugging Face transformers library. For the most up-to-date and high-performance version of the Falcon model code, please update to the latest version of transformers and then load the model without the trust_remote_code=True argument.

[11/01/2023-04:48:44] 
WARNING: You are currently loading Falcon using legacy code contained in the model repository. Falcon has now been fully ported into the Hugging Face transformers library. For the most up-to-date and high-performance version of the Falcon model code, please update to the latest version of transformers and then load the model without the trust_remote_code=True argument.

[11/01/2023-04:48:44] 
WARNING: You are currently loading Falcon using legacy code contained in the model repository. Falcon has now been fully ported into the Hugging Face transformers library. For the most up-to-date and high-performance version of the Falcon model code, please update to the latest version of transformers and then load the model without the trust_remote_code=True argument.

[11/01/2023-04:48:44] 
WARNING: You are currently loading Falcon using legacy code contained in the model repository. Falcon has now been fully ported into the Hugging Face transformers library. For the most up-to-date and high-performance version of the Falcon model code, please update to the latest version of transformers and then load the model without the trust_remote_code=True argument.


Loading checkpoint shards:   0%|          | 0/9 [00:00<?, ?it/s]
Loading checkpoint shards:   0%|          | 0/9 [00:00<?, ?it/s]
Loading checkpoint shards:   0%|          | 0/9 [00:00<?, ?it/s]
Loading checkpoint shards:   0%|          | 0/9 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/workspace/trt-llm/tensorrt_llm/examples/falcon/build.py", line 563, in <module>
    mp.spawn(build, nprocs=args.world_size, args=(args, ))
  File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 246, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
  File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 202, in start_processes
    while not context.join():
  File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 145, in join
    raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 2 terminated with signal SIGKILL
/usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 4 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

The text was updated successfully, but these errors were encountered:

rohithkrn · 2023-11-01T06:01:13Z

seems like its failing when loading from HF

rohithkrn · 2023-11-01T07:05:43Z

passing load_by_shard worked

Shixiaowei02 · 2023-11-01T08:38:35Z

Thank you for the support from aws and our colleagues will follow up on this issue.

byshiue · 2024-04-02T08:56:46Z

Close this bug. Reopen if needed.

juney-nvidia added the triaged Issue has been triaged by maintainers label Nov 1, 2023

juney-nvidia assigned Shixiaowei02 Nov 1, 2023

ncomly-nvidia mentioned this issue Dec 18, 2023

TensorRT-LLM Requests #632

Open

41 tasks

byshiue closed this as completed Apr 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Falcon-40b build causing memory leaks and failure #226

Falcon-40b build causing memory leaks and failure #226

rohithkrn commented Nov 1, 2023

rohithkrn commented Nov 1, 2023

rohithkrn commented Nov 1, 2023

Shixiaowei02 commented Nov 1, 2023

byshiue commented Apr 2, 2024

Falcon-40b build causing memory leaks and failure #226

Falcon-40b build causing memory leaks and failure #226

Comments

rohithkrn commented Nov 1, 2023

rohithkrn commented Nov 1, 2023

rohithkrn commented Nov 1, 2023

Shixiaowei02 commented Nov 1, 2023

byshiue commented Apr 2, 2024