Scale buf_size linearly with n_ctx #213

hx507 · 2023-03-16T21:12:32Z

This appear to solve #153 where error of ggml_new_tensor_impl: not enough space in the context's memory pool is thrown in interactive mode, if using a larger context size.

At least the out of memory error come from ctx0 used here. Although I am not familiar with the code base enough to tell if this is indeed the cause.

This appear to solve ggml-org#153 where error of "ggml_new_tensor_impl: not enough space in the context's memory pool" is thrown in interactive mode. At least the out of memory error come from `ctx0` used here. Although I am not familiar with the code base enough to tell if this is indeed the cause.

Green-Sky · 2023-03-16T22:35:32Z

This definitely increases the memory in the right place.

slaren · 2023-03-17T09:58:24Z

This does not happen out of interactive mode, so I don't think that this is right. This would mean 1.5 GB more memory usage for 2048 context size so it is significant.

Green-Sky · 2023-03-17T12:27:44Z

I looked at it more and I dont think this is the way to solve this. We should instead "determine the required inference memory per token" https://github.com/ggerganov/llama.cpp/blob/master/main.cpp#L891
, so it can increase the size by itself dynamically https://github.com/ggerganov/llama.cpp/blob/master/main.cpp#L557

edit: I just saw that it is set, but only after the first run. https://github.com/ggerganov/llama.cpp/blob/master/main.cpp#L749 , which means the reallocation logic has an error.

@slaren the 7B model does not run into this issue.

Green-Sky · 2023-03-17T13:59:04Z

Edit again: after testing the 30B model again, I realized that the bug only happens in an Interactive session.

nonnull-ca · 2023-03-17T23:46:14Z

I've been hitting this with the 65B model in oneshot mode with a 2048-token context. So it's not just interactive sessions that are affected. This is fairly easily reproducible with the following:

$ python3 -c "print('qy'*635)" > ./qy.txt
$ ./main -b 512 -t 16 -c 2048 -m ./ggml-model-q4_0.bin -f ./qy.txt >/dev/null

(Note: smaller batch sizes take longer before aborting, but still suffer the same issue.)

I suspect that the actual required buffer size is a linear function of the context length, but with a nonzero constant term.

Maybe add debug code to print out the actual high watermark at the end?

Green-Sky · 2023-03-17T23:51:35Z

I forgot to mention that I ran 30B with ctx2048 :)

slaren · 2023-03-18T00:05:46Z

Increasing batch size also makes llama.cpp run out of memory, so any solution that only considers the context size and not the batch size is likely wrong.

nonnull-ca · 2023-03-18T00:38:32Z

Increasing batch size also makes llama.cpp run out of memory, so any solution that only considers the context size and not the batch size is likely wrong.

Ah. That explains why I was seeing larger batch sizes OOM faster.

slaren · 2023-03-18T00:39:05Z

Thinking more about this, does it really matter what is the initial value of buf_size, as long as it is big enough for the first dummy call to llama_eval? This is because afterwards the buffer is reallocated based on the estimated mem_per_token multiplied by the number of tokens (ie. the batch size).

The "proper" solution to this would be to analyze the code painstakingly to be able to accurately predict how much memory is needed in the context, but that is not going to be easy and it is going to break any time that changes are made. Does it really matter anyway? Any memory reserved for the context won't be committed to physical memory until the pages are touched anyway. Alternatively we could catch the out of memory errors from ggml and realloc the buffer as needed.

Green-Sky · 2023-03-18T00:47:25Z

@slaren earlier i wrote:

We should instead "determine the required inference memory per token" https://github.com/ggerganov/llama.cpp/blob/master/main.cpp#L891
, so it can increase the size by itself dynamically https://github.com/ggerganov/llama.cpp/blob/master/main.cpp#L557
edit: I just saw that it is set, but only after the first run. https://github.com/ggerganov/llama.cpp/blob/master/main.cpp#L749 , which means the reallocation logic has an error.

jarcen · 2023-03-21T02:23:02Z

Throwing some ideas about actual reasons behind the bug. I think it's the classic integer division gotcha:

https://github.com/ggerganov/llama.cpp/blob/8cf9f34eddc124d4ab28f4d2fe8e99d574510bde/main.cpp#L757-L758

if batch size 'N' > 1 then there will be loss of fraction. Just after these lines I added:

    fprintf(stderr, "Used mem: %i, predicted %i\n", ggml_used_mem(ctx0), mem_per_token * N);

And tracked calls after the first one that sets the value:

Used mem: 13484368, predicted 14368644
Used mem: 13488464, predicted 14368644
Used mem: 57507344, predicted 57474576
Used mem: 115002912, predicted 114949152

As expected, actual memory usage differs from predicted and sometimes bigger. That's not good.

jarcen · 2023-03-21T02:33:25Z

Looking further, it also slowly creeps up as prompt being read(batch size = 4)

Used mem: 59375120, predicted 57474576
Used mem: 59440656, predicted 57474576
Used mem: 59506192, predicted 57474576
Used mem: 59571728, predicted 57474576
Used mem: 59637264, predicted 57474576
Used mem: 59702800, predicted 57474576
Used mem: 59768336, predicted 57474576
Used mem: 59833872, predicted 57474576
Used mem: 59899408, predicted 57474576
Used mem: 59964944, predicted 57474576
Used mem: 60030480, predicted 57474576
Used mem: 60096016, predicted 57474576
Used mem: 60161552, predicted 57474576
Used mem: 60227088, predicted 57474576
Used mem: 60292624, predicted 57474576
Used mem: 60358160, predicted 57474576
Used mem: 60423696, predicted 57474576
Used mem: 60489232, predicted 57474576
Used mem: 60554768, predicted 57474576

In a hindsight that makes sense. Attention mechanism performs more operations when context grows. Today's code makes wrong assumption by expecting that memory usage per token will remain static. It clearly grows by 65536 in each iteration. I think total memory usage at full context can be inferred from measurements taken by multiple llama_eval calls. It's a simple linear trend.

nonnull-ca · 2023-03-21T02:38:15Z

I haven't looked; how does LLaMA handle prompts that are smaller than context size? E.g. a 1024-token prompt with 2048-token context size. Does it just truncate to an effective context size of 1024? Or does it pad somehow?

hx507 · 2023-03-21T04:20:08Z

Looking further, it also slowly creeps up as prompt being read(batch size = 4)

To add another observation, the amount of memory increase per iteration seems to scale quadratically with batch size.

For the 65B 4bit model. A batch size of 4 gives:

...
mem: 285096976, predicted 284113936
mem: 285424656, predicted 284113936 # delta = 320kb

Where batch size of 8 gives:

...
Used mem: 573443104, predicted 568227872
Used mem: 574753824, predicted 568227872 # delta = 1280kb

Where 1280 kb = 4*320 kb, the delta in the latter case is 4 times larger.

That could explain the previous observation that larger batch size OOMs faster.

jarcen · 2023-03-21T04:58:59Z

Right, batch processing at least must construct ggml_diag_mask_inf for masked attention so each token in batch can attend not only to past memory but also to it's neighbors in the batch to the left. For batch size = 1 all matrices are simple vectors but they become rectangular for >1 cases.

This is also a good avenue for optimization. 'Scientific' code like this is only good for GPUs where compute units share instruction decoder and cannot diverge, e.g. if 9 cores do 1 iteration in a loop and 1 core does 10 iteration then they all must do 10 iterations because they execute same instructions in a lockstep. Assuming same model on CPU is a huge waste of time and memory as we can see.

nazthelizard122 · 2023-03-22T01:36:15Z

ask chatGPT for a solution lol

Green-Sky · 2023-03-23T18:37:15Z

please try #438 and see if it fixes it.
i implemented the observations made in this thread.

Green-Sky mentioned this pull request Mar 16, 2023

Segmentation Fault Error "not enough space in the context's memory pool" #52

Closed

gjmulder added the bug Something isn't working label Mar 17, 2023

Green-Sky mentioned this pull request Mar 18, 2023

Compute perplexity over prompt #270

Merged

gjmulder added the high priority Very important issue label Mar 21, 2023

This was referenced Mar 22, 2023

[mqy] ./examples/chatLLaMa: line 53: 33476 Segmentation fault: 11 #373

Closed

dynamic estimate of required memory usage #438

Closed

ggerganov closed this Mar 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scale buf_size linearly with n_ctx #213

Scale buf_size linearly with n_ctx #213

hx507 commented Mar 16, 2023 •

edited

Loading

Green-Sky commented Mar 16, 2023

slaren commented Mar 17, 2023

Green-Sky commented Mar 17, 2023 •

edited

Loading

Green-Sky commented Mar 17, 2023

nonnull-ca commented Mar 17, 2023

Green-Sky commented Mar 17, 2023

slaren commented Mar 18, 2023

nonnull-ca commented Mar 18, 2023

slaren commented Mar 18, 2023

Green-Sky commented Mar 18, 2023 •

edited

Loading

jarcen commented Mar 21, 2023 •

edited

Loading

jarcen commented Mar 21, 2023 •

edited

Loading

nonnull-ca commented Mar 21, 2023

hx507 commented Mar 21, 2023 •

edited

Loading

jarcen commented Mar 21, 2023 •

edited

Loading

nazthelizard122 commented Mar 22, 2023

Green-Sky commented Mar 23, 2023

Scale buf_size linearly with n_ctx #213

Scale buf_size linearly with n_ctx #213

Conversation

hx507 commented Mar 16, 2023 • edited Loading

Green-Sky commented Mar 16, 2023

slaren commented Mar 17, 2023

Green-Sky commented Mar 17, 2023 • edited Loading

Green-Sky commented Mar 17, 2023

nonnull-ca commented Mar 17, 2023

Green-Sky commented Mar 17, 2023

slaren commented Mar 18, 2023

nonnull-ca commented Mar 18, 2023

slaren commented Mar 18, 2023

Green-Sky commented Mar 18, 2023 • edited Loading

jarcen commented Mar 21, 2023 • edited Loading

jarcen commented Mar 21, 2023 • edited Loading

nonnull-ca commented Mar 21, 2023

hx507 commented Mar 21, 2023 • edited Loading

jarcen commented Mar 21, 2023 • edited Loading

nazthelizard122 commented Mar 22, 2023

Green-Sky commented Mar 23, 2023

hx507 commented Mar 16, 2023 •

edited

Loading

Green-Sky commented Mar 17, 2023 •

edited

Loading

Green-Sky commented Mar 18, 2023 •

edited

Loading

jarcen commented Mar 21, 2023 •

edited

Loading

jarcen commented Mar 21, 2023 •

edited

Loading

hx507 commented Mar 21, 2023 •

edited

Loading

jarcen commented Mar 21, 2023 •

edited

Loading