Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Falcon-40b build causing memory leaks and failure #226

Closed
rohithkrn opened this issue Nov 1, 2023 · 4 comments
Closed

Falcon-40b build causing memory leaks and failure #226

rohithkrn opened this issue Nov 1, 2023 · 4 comments
Assignees
Labels
triaged Issue has been triaged by maintainers

Comments

@rohithkrn
Copy link

Falcon-40b build failing due to memory leaks. I monitored CPU memory usage and it reaches peak and then build fails. I am on aws g5.48xlarge ec2 instance with 768GB of RAM.

Command:

python3 build.py             --model_dir /workspace/falcon/tiiuae-falcon-40b             --dtype float16             --use_inflight_batching             --use_gpt_attention_plugin float16             --use_gemm_plugin float16             --new_decoder_architecture             --use_layernorm_plugin float16             --enable_context_fmha             --remove_input_padding             --paged_kv_cache             --strongly_typed             --output_dir /tmp/tensorrtllm/falcon40b/1             --max_output_len 256             --max_input_len 512             --max_batch_size 128             --enable_debug_output             --tp_size 4             --pp_size 1             --world_size 4            --parallel_build

Log

[11/01/2023-04:48:30] [TRT-LLM] [I] ========================================= Build Arguments ==========================================
[11/01/2023-04:48:30] [TRT-LLM] [I]  - world_size....................: 4
[11/01/2023-04:48:30] [TRT-LLM] [I]  - tp_size.......................: 4
[11/01/2023-04:48:30] [TRT-LLM] [I]  - pp_size.......................: 1
[11/01/2023-04:48:30] [TRT-LLM] [I]  - model_dir.....................: /workspace/falcon/tiiuae-falcon-40b
[11/01/2023-04:48:30] [TRT-LLM] [I]  - dtype.........................: float16
[11/01/2023-04:48:30] [TRT-LLM] [I]  - timing_cache..................: model.cache
[11/01/2023-04:48:30] [TRT-LLM] [I]  - log_level.....................: info
[11/01/2023-04:48:30] [TRT-LLM] [I]  - vocab_size....................: 65024
[11/01/2023-04:48:30] [TRT-LLM] [I]  - n_layer.......................: 60
[11/01/2023-04:48:30] [TRT-LLM] [I]  - n_positions...................: 2048
[11/01/2023-04:48:30] [TRT-LLM] [I]  - n_embd........................: 8192
[11/01/2023-04:48:30] [TRT-LLM] [I]  - n_head........................: 128
[11/01/2023-04:48:30] [TRT-LLM] [I]  - n_kv_head.....................: 8
[11/01/2023-04:48:30] [TRT-LLM] [I]  - mlp_hidden_size...............: None
[11/01/2023-04:48:30] [TRT-LLM] [I]  - max_batch_size................: 128
[11/01/2023-04:48:30] [TRT-LLM] [I]  - max_input_len.................: 512
[11/01/2023-04:48:30] [TRT-LLM] [I]  - max_output_len................: 256
[11/01/2023-04:48:30] [TRT-LLM] [I]  - max_beam_width................: 1
[11/01/2023-04:48:30] [TRT-LLM] [I]  - use_gpt_attention_plugin......: float16
[11/01/2023-04:48:30] [TRT-LLM] [I]  - bias..........................: False
[11/01/2023-04:48:30] [TRT-LLM] [I]  - parallel_attention............: True
[11/01/2023-04:48:30] [TRT-LLM] [I]  - new_decoder_architecture......: True
[11/01/2023-04:48:30] [TRT-LLM] [I]  - alibi.........................: False
[11/01/2023-04:48:30] [TRT-LLM] [I]  - logits_dtype..................: float32
[11/01/2023-04:48:30] [TRT-LLM] [I]  - use_gemm_plugin...............: float16
[11/01/2023-04:48:30] [TRT-LLM] [I]  - use_layernorm_plugin..........: float16
[11/01/2023-04:48:30] [TRT-LLM] [I]  - parallel_build................: True
[11/01/2023-04:48:30] [TRT-LLM] [I]  - enable_context_fmha...........: True
[11/01/2023-04:48:30] [TRT-LLM] [I]  - enable_context_fmha_fp32_acc..: False
[11/01/2023-04:48:30] [TRT-LLM] [I]  - visualize.....................: False
[11/01/2023-04:48:30] [TRT-LLM] [I]  - load_by_shard.................: False
[11/01/2023-04:48:30] [TRT-LLM] [I]  - enable_debug_output...........: True
[11/01/2023-04:48:30] [TRT-LLM] [I]  - gpus_per_node.................: 8
[11/01/2023-04:48:30] [TRT-LLM] [I]  - builder_opt...................: None
[11/01/2023-04:48:30] [TRT-LLM] [I]  - output_dir....................: /tmp/tensorrtllm/falcon40b/1
[11/01/2023-04:48:30] [TRT-LLM] [I]  - remove_input_padding..........: True
[11/01/2023-04:48:30] [TRT-LLM] [I]  - strongly_typed................: True
[11/01/2023-04:48:30] [TRT-LLM] [I]  - enable_fp8....................: False
[11/01/2023-04:48:30] [TRT-LLM] [I]  - quantized_fp8_model_path......: None
[11/01/2023-04:48:30] [TRT-LLM] [I]  - fp8_kv_cache..................: False
[11/01/2023-04:48:30] [TRT-LLM] [I]  - use_inflight_batching.........: True
[11/01/2023-04:48:30] [TRT-LLM] [I]  - paged_kv_cache................: True
[11/01/2023-04:48:30] [TRT-LLM] [I]  - tokens_per_block..............: 64
[11/01/2023-04:48:30] [TRT-LLM] [I]  - max_num_tokens................: None
[11/01/2023-04:48:30] [TRT-LLM] [I]  - use_custom_all_reduce.........: False
[11/01/2023-04:48:30] [TRT-LLM] [I]  - quant_mode....................: 0
[11/01/2023-04:48:30] [TRT-LLM] [I] ====================================================================================================
[11/01/2023-04:48:30] [TRT-LLM] [W] Parallelly build TensorRT engines. Please make sure that all of the 4 GPUs are totally free.
[11/01/2023-04:48:44] 
WARNING: You are currently loading Falcon using legacy code contained in the model repository. Falcon has now been fully ported into the Hugging Face transformers library. For the most up-to-date and high-performance version of the Falcon model code, please update to the latest version of transformers and then load the model without the trust_remote_code=True argument.

[11/01/2023-04:48:44] 
WARNING: You are currently loading Falcon using legacy code contained in the model repository. Falcon has now been fully ported into the Hugging Face transformers library. For the most up-to-date and high-performance version of the Falcon model code, please update to the latest version of transformers and then load the model without the trust_remote_code=True argument.

[11/01/2023-04:48:44] 
WARNING: You are currently loading Falcon using legacy code contained in the model repository. Falcon has now been fully ported into the Hugging Face transformers library. For the most up-to-date and high-performance version of the Falcon model code, please update to the latest version of transformers and then load the model without the trust_remote_code=True argument.

[11/01/2023-04:48:44] 
WARNING: You are currently loading Falcon using legacy code contained in the model repository. Falcon has now been fully ported into the Hugging Face transformers library. For the most up-to-date and high-performance version of the Falcon model code, please update to the latest version of transformers and then load the model without the trust_remote_code=True argument.


Loading checkpoint shards:   0%|          | 0/9 [00:00<?, ?it/s]
Loading checkpoint shards:   0%|          | 0/9 [00:00<?, ?it/s]
Loading checkpoint shards:   0%|          | 0/9 [00:00<?, ?it/s]
Loading checkpoint shards:   0%|          | 0/9 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/workspace/trt-llm/tensorrt_llm/examples/falcon/build.py", line 563, in <module>
    mp.spawn(build, nprocs=args.world_size, args=(args, ))
  File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 246, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
  File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 202, in start_processes
    while not context.join():
  File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 145, in join
    raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 2 terminated with signal SIGKILL
/usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 4 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
@rohithkrn
Copy link
Author

seems like its failing when loading from HF

@rohithkrn
Copy link
Author

passing load_by_shard worked

@Shixiaowei02
Copy link
Collaborator

Thank you for the support from aws and our colleagues will follow up on this issue.

@juney-nvidia juney-nvidia added the triaged Issue has been triaged by maintainers label Nov 1, 2023
@byshiue
Copy link
Collaborator

byshiue commented Apr 2, 2024

Close this bug. Reopen if needed.

@byshiue byshiue closed this as completed Apr 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triaged Issue has been triaged by maintainers
Projects
None yet
Development

No branches or pull requests

4 participants