Lower memory spike when loading with ISQ on CUDA #433

EricLBuehler · 2024-06-14T20:20:39Z

Currently when loading with ISQ, there is a spike in GPU memory usage. This is because tensors are copied to the GPU asynchronously and quantized. By forcing a GPU <> CPU synchronization, we can ensure that there is no overlap of operations and that copies are completed, meaning that the spike should be reduced.

github-actions · 2024-06-14T20:21:43Z

Code Metrics Report

  ===============================================================================
 Language            Files        Lines         Code     Comments       Blanks
===============================================================================
 Dockerfile              1           34           25            0            9
 Happy                   1          442          369            0           73
 JSON                    9           21           21            0            0
 Python                 31         1217         1038           37          142
 TOML                   16          440          400            1           39
-------------------------------------------------------------------------------
 Jupyter Notebooks       1            0            0            0            0
 |- Markdown             1           60           30           22            8
 |- Python               1           96           87            1            8
 (Total)                            156          117           23           16
-------------------------------------------------------------------------------
 Markdown               16         1135            0          836          299
 |- BASH                 5          100           97            0            3
 |- Python               6          122          110            0           12
 |- Rust                 2           80           72            3            5
 (Total)                           1437          279          839          319
-------------------------------------------------------------------------------
 Rust                  115        34379        31132          584         2663
 |- Markdown            57          643           13          596           34
 (Total)                          35022        31145         1180         2697
===============================================================================
 Total                 191        37668        32985         1458         3225
===============================================================================

Synchronize device to lower memory usage

564135b

EricLBuehler added 3 commits June 14, 2024 16:30

Limit thread count if on cuda

b5545fb

Add timing, tune thread count, and display info

83470fb

Add isq low memory env var

05aac64

EricLBuehler changed the title ~~Lower memory usage when loading with ISQ~~ Lower memory spike when loading with ISQ on CUDA Jun 14, 2024

EricLBuehler merged commit 6648673 into master Jun 14, 2024
11 checks passed

EricLBuehler deleted the isq_lower_mem_usage branch June 14, 2024 21:10

EricLBuehler mentioned this pull request Jun 14, 2024

dolphin-2.9-mixtral-8x22b.Q8_0.gguf "Error: cannot find tensor info for blk.0.ffn_gate.0.weight"? #352

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lower memory spike when loading with ISQ on CUDA #433

Lower memory spike when loading with ISQ on CUDA #433

EricLBuehler commented Jun 14, 2024

github-actions bot commented Jun 14, 2024

Lower memory spike when loading with ISQ on CUDA #433

Lower memory spike when loading with ISQ on CUDA #433

Conversation

EricLBuehler commented Jun 14, 2024

github-actions bot commented Jun 14, 2024