Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for DeepseekV3 680B #117

Closed
sorasoras opened this issue Dec 25, 2024 · 113 comments · Fixed by #122
Closed

Support for DeepseekV3 680B #117

sorasoras opened this issue Dec 25, 2024 · 113 comments · Fixed by #122

Comments

@sorasoras
Copy link

https://huggingface.co/deepseek-ai/DeepSeek-V3-Base
well, that's a beast.

@fengyang95
Copy link

+1

@Nottlespike
Copy link

Does ktransformers depend on HF transformers support for a for a model arch? If so we are going to have to wait until DeepSeek-V3 supports HF transformers as it does not yet and I don't see a PR from the DeepSeek team yet.

@TyraVex
Copy link

TyraVex commented Dec 27, 2024

The ktransformer backend is an old commit of llama.cpp iirc

Edit: 6 months old https://github.com/ggerganov/llama.cpp/tree/a94e6ff8774b7c9f950d9545baf0ce35e8d1ed2f

@Nottlespike
Copy link

The ktransformer backend is an old commit of llama.cpp iirc

Edit: 6 months old https://github.com/ggerganov/llama.cpp/tree/a94e6ff8774b7c9f950d9545baf0ce35e8d1ed2f

Is it just me or is this not promising... I mean I'm patient but this means that transformers needs to support DeepSeek-V3 then llama.cpp needs to suppport DeepSeek-V3 then ktransformers needs to support llama.cpp's DeepSeek-V3 implementation....

@sorasoras
Copy link
Author

The ktransformer backend is an old commit of llama.cpp iirc

Edit: 6 months old https://github.com/ggerganov/llama.cpp/tree/a94e6ff8774b7c9f950d9545baf0ce35e8d1ed2f

Is it just me or is this not promising... I mean I'm patient but this means that transformers needs to support DeepSeek-V3 then llama.cpp needs to suppport DeepSeek-V3 then ktransformers needs to support llama.cpp's DeepSeek-V3 implementation....

I heard that it s not hard to support V3 on llama cpp due to been resemblance to v2

@Nottlespike
Copy link

The ktransformer backend is an old commit of llama.cpp iirc
Edit: 6 months old https://github.com/ggerganov/llama.cpp/tree/a94e6ff8774b7c9f950d9545baf0ce35e8d1ed2f

Is it just me or is this not promising... I mean I'm patient but this means that transformers needs to support DeepSeek-V3 then llama.cpp needs to suppport DeepSeek-V3 then ktransformers needs to support llama.cpp's DeepSeek-V3 implementation....

I heard that it s not hard to support V3 on llama cpp due to been resemblance to v2

Sadly this is not the case in two ways. Firstly V3 from the technical report is far far more complex than V2. Second V2 never actually got a HF transformers implementation sadly only a stale draft PR.
This is an issue that shows the problem in action huggingface/transformers#34335
This is the stale attempt at V2 integration huggingface/transformers#31976

@sorasoras
Copy link
Author

The ktransformer backend is an old commit of llama.cpp iirc
Edit: 6 months old https://github.com/ggerganov/llama.cpp/tree/a94e6ff8774b7c9f950d9545baf0ce35e8d1ed2f

Is it just me or is this not promising... I mean I'm patient but this means that transformers needs to support DeepSeek-V3 then llama.cpp needs to suppport DeepSeek-V3 then ktransformers needs to support llama.cpp's DeepSeek-V3 implementation....

I heard that it s not hard to support V3 on llama cpp due to been resemblance to v2

Sadly this is not the case in two ways. Firstly V3 from the technical report is far far more complex than V2. Second V2 never actually got a HF transformers implementation sadly only a stale draft PR. This is an issue that shows the problem in action huggingface/transformers#34335 This is the stale attempt at V2 integration huggingface/transformers#31976

well. kt do support DS2 so it should be just a upgrade with next gen moe router

@Nottlespike
Copy link

The ktransformer backend is an old commit of llama.cpp iirc
Edit: 6 months old https://github.com/ggerganov/llama.cpp/tree/a94e6ff8774b7c9f950d9545baf0ce35e8d1ed2f

Is it just me or is this not promising... I mean I'm patient but this means that transformers needs to support DeepSeek-V3 then llama.cpp needs to suppport DeepSeek-V3 then ktransformers needs to support llama.cpp's DeepSeek-V3 implementation....

I heard that it s not hard to support V3 on llama cpp due to been resemblance to v2

Sadly this is not the case in two ways. Firstly V3 from the technical report is far far more complex than V2. Second V2 never actually got a HF transformers implementation sadly only a stale draft PR. This is an issue that shows the problem in action huggingface/transformers#34335 This is the stale attempt at V2 integration huggingface/transformers#31976

well. kt do support DS2 so it should be just a upgrade with next gen moe router

What they do is use remote code. There is no native HF transformers implemenation. Feel free to find it if you think I am incorrect. If people wish to use this for enterprise deployment letting a model run what is functionally arbitrary code on your servers is 100% a no go like it is for my startup.

@mahald
Copy link

mahald commented Dec 28, 2024

+1

1 similar comment
@lzumot
Copy link

lzumot commented Dec 29, 2024

+1

@Azure-Tang
Copy link
Contributor

Azure-Tang commented Dec 30, 2024

Supporting this seems easy; however, it requires approximately 400g of RAM for even q4km?

@Nottlespike
Copy link

Supporting this seems easy; however, it requires approximately 400g of RAM for even q4km?

Does ktransformers not need transformers or llama.cpp support for a a model? huggingface/transformers#35425 Back of the napkin math is more like 5XXGB of VRAM/RAM would be needed given context.

@Azure-Tang
Copy link
Contributor

Supporting this seems easy; however, it requires approximately 400g of RAM for even q4km?

Does ktransformers not need transformers or llama.cpp support for a a model? huggingface/transformers#35425 Back of the napkin math is more like 5XXGB of VRAM/RAM would be needed given context.

Yes, we need transformers‘ modeling.py.

@16x3b
Copy link

16x3b commented Dec 31, 2024

Just spitballing. Would it be possible to do some speculative decoding with a smaller model (dense or moe) and then on the larger moe just use nvme for some of the experts? Why do we need all experts loaded into ram at all times instead of selecting experts as necessary?

Again just throwing an idea out there from where my understanding is at. I'd like to understand better why this would not work.

@sorasoras
Copy link
Author

Just spitballing. Would it be possible to do some speculative decoding with a smaller model (dense or moe) and then on the larger moe just use nvme for some of the experts? Why do we need all experts loaded into ram at all times instead of selecting experts as necessary?

Again just throwing an idea out there from where my understanding is at. I'd like to understand better why this would not work.

why would you need smaller model speculative decoding when you can do it via MTP?

@Nottlespike
Copy link

Nottlespike commented Jan 2, 2025

@Azure-Tang
We are basically done with integrating DeepSeek-V3 into llama.cpp via a very very elegant solution from @fairydreaming
ggml-org/llama.cpp#10981 (comment)
Can you see if you can try what they did?

@Nottlespike
Copy link

For those looking for an update on this I've to forked it and I think I know how to get this working but no ETA.

@ELigoP
Copy link

ELigoP commented Jan 9, 2025

Supporting this seems easy; however, it requires approximately 400g of RAM for even q4km?

I manage to fit in Q3_K_M quant within 96GB VRAM + 256GB RAM. Q4_K_M is out of my reach.

@hvico
Copy link

hvico commented Jan 9, 2025

Supporting this seems easy; however, it requires approximately 400g of RAM for even q4km?

I manage to fit in Q3_K_M quant within 96GB VRAM + 256GB RAM. Q4_K_M is out of my reach.

Wow. Great news. What would be the mininum specs to be able to run a Q4_K_M with 96 GB VRAM? 512 GB of RAM plus that VRAM would be enough? I have that much VRAM but I will upgrade my RAM for that. Thanks!

@16x3b
Copy link

16x3b commented Jan 12, 2025

why would you need smaller model speculative decoding when you can do it via MTP?

I don't see how MTP helps me. I'm suggesting speculative decoding because we can get faster inference from a smaller model and only refer to the larger model if confidence is low. No need to call the large model if the small one has a confident answer.

@sorasoras
Copy link
Author

why would you need smaller model speculative decoding when you can do it via MTP?

I don't see how MTP helps me. I'm suggesting speculative decoding because we can get faster inference from a smaller model and only refer to the larger model if confidence is low. No need to call the large model if the small one has a confident answer.

MTP generate two token and you use second token as speculative decode. This is call self-speculative decode. no need for extra model

@16x3b
Copy link

16x3b commented Jan 14, 2025

MTP generate two token and you use second token as speculative decode. This is call self-speculative decode. no need for extra model

Understood. That's excellent! Thank you for explaining that to me. I see why there's no need to implement speculative decoding with a model that already has MTP implemented.

@sorasoras
Copy link
Author

MTP generate two token and you use second token as speculative decode. This is call self-speculative decode. no need for extra model

Understood. That's excellent! Thank you for explaining that to me. I see why there's no need to implement speculative decoding with a model that already has MTP implemented.

if you read the paper, it mentioned you can get 90% hit rate with MTP 2-token and SD.
I guess we are going to have 4-token MTP in the future.

@bitnom
Copy link

bitnom commented Jan 28, 2025

so hot right now

@whisper-bye
Copy link

Dynamic 1.58-bit
https://unsloth.ai/blog/deepseekr1-dynamic

@ChandanVerma
Copy link

Any updates by when can we see deepseek v3 or r1 in ktransformers? Really looking forward to it. It would be of great help to everyone

Thanks

@Azure-Tang
Copy link
Contributor

Any updates by when can we see deepseek v3 or r1 in ktransformers? Really looking forward to it. It would be of great help to everyone

Thanks

I'm actively tracking the DeepSeek V3/R1 integration. The HuggingFace Transformers team is finalizing support for DeepSeek V3 (see PR #35926), which ktransformers relies on.

I'm pausing my vacation resolving any remaining compatibility issues. Expect a quick follow-up release for ktransformers once dependencies stabilize.

I’ll keep you updated here – appreciate your patience and support! 🙌

@Nondzu
Copy link

Nondzu commented Jan 29, 2025

I can run DS V3 and R1 on my workstation 768g of RAM. llama.cpp tested and works fine with deepseek v3/r1 Q4_K_M and Q5 models.
If you need any tests or help to implementation pls ping me.

@ChandanVerma
Copy link

I can run DS V3 and R1 on my workstation 768g of RAM. llama.cpp tested and works fine with deepseek v3/r1 Q4_K_M and Q5 models. If you need any tests or help to implementation pls ping me.

Is it usable? How many TPS do you get? I am planning to run on my local server with 512g RAM and 6x3090 system

@Nottlespike
Copy link

@RodriMora what is your system specs, by the way? Mine is 4xRTX 3090 (only first two cards used with my command) and Threadripper 3970X with 256GB DDR4 RAM (4 ram channels, RAM read speed about 80GB/s).

Similar to yours but on Epyc cpu:

EPYC 7402 24c/48t 512GB RAM 3200MHz 4x3090s

I'm currently doing some performance tests, comparing it to llama.cpp at different context lenghts. I've never used ktransformers so I dont know how to manage how many GPU's to use? It seems to use 2/4 at the moment.

The relative path for the multi GPU .yaml is at ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-multi-gpu.yaml.

With the tutorial on injection being here: https://github.com/kvcache-ai/ktransformers/blob/feat-DeepSeekV3/doc/en/injection_tutorial.md

I've been stymied by the regex for offloading to more than 2 GPU's with specifically this piece being particularly confusing (0|[1-9]|[12][0-9]) as I assume this is for the fact that the first 3 layers of the model are dense but given the pipe I'm not sure how ktransformers handles the "or" regex here and still loads all 61 layers.

Either way it looks like the segfault issue is fixed so I'm rebuilding then testing.

@Nottlespike
Copy link

@RodriMora @ELigoP

So it seems NCCL speeds matter a LOT for TPS. I'm getting WAY faster speeds using 2 GPU's with NVLink than vanilla llama.cpp with the same 4x3090/3090 Ti setup with 14 layers offloaded. For me cuda:0 and cuda:2 are NVLinked together so when I replaced all cuda:1 instances with cuda:2 in ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-multi-gpu.yaml I saw a massive speedup of ~10tps which is 2x vanilla llama.cpp with --max_new_tokens 8192 --total_context 8192 and same context on vanilla llama.cpp.

Now to write the 4 GPU config! Any assitance with adapting the ktransformers/optimize/optimize_rules/DeepSeek-V2-Chat-multi-gpu-4.yaml to DeepSeek-V3/R1 would be appreciated @Azure-Tang! I've taken a VERY short look at the paper and it seems that the main difference here is that there is 1 dense layer in V2 vs the 3 in V3 and 60 layers in V2 vs 61 layers in V3 yet my regex for layer assignment seems to break despite my best efforts and reading the injection tutorial multiple times.

@Azure-Tang
Copy link
Contributor

curl 'http://0.0.0.0:10002/v1/chat/completions' \
    -X POST \
    -H "Content-Type: application/json" \
    --data-raw '{"model":"deepseek-ai/DeepSeek-V3","messages":[{"role":"system","content":"You are an AI coding assistant. You explain as minimum as possible."},{"role":"user","content":"Write numbers from 1 to 10, each on new line, no coding."}],"stream":false}'

Did you change the 0.0.0.0 for privacy? you should use the actual IP of the server, it's working for me.

curl 'http://192.168.10.2:5000/v1/chat/completions' \
    -X POST \
    -H "Content-Type: application/json" \
    --data-raw '{"model":"deepseek-ai/DeepSeek-V3","messages":[{"role":"system","content":"You are an AI coding assistant. You explain as minimum as possible."},{"role":"user","content":"Write numbers from 1 to 10, each on new line, no coding."}],"stream":false}'

{"id":"7ff95288-d334-431f-9d94-867f372bd497","object":"chat.completion.chunk","created":1738864152,"model":"not implmented","system_fingerprint":"not implmented","usage":null,"choices":[{"index":0,"message":{"content":"<think>\nOkay, so the user wants me to write numbers from 1 to 10, each on a new line, and they specified no coding. Let me make sure I understand. They mentioned \"You are an AI coding assistant\" but then said \"no coding,\" so maybe they want the output to be simple text without any code formatting like in a code block. Got it.\n\nSo, I need to list numbers 1 through 10, each on separate lines. Let me check if there's any other requirement. The user wrote \"explain as minimum as possible,\" which probably means not adding any extra explanations or comments. Just the numbers. Alright.\n\nI think that's all. I'll write each number from 1 to 10, each on a new line. Let me count to confirm: 10 numbers, each separated. Yes, that should do it.\n</think>\n\n1 \n2 \n3 \n4 \n5 \n6 \n7 \n8 \n9 \n10","role":"assistant","name":null},"logprobs":null,"finish_reason":null}]}

Thanks! My bad, I didn't pull latest change. Now server responds. Still performance is not as expected (comparable and even worse than llama.cpp, which as I understand offloads only first layers to GPU):

2025-02-06 20:45:21,438 INFO /home/ai/3rdparty/KTransformers-feat-DeepSeekV3/venv/lib/python3.11/site-packages/ktransformers/server/backend/base.py[64]: Performance(T/s): prefill 1.9965464671950348, decode 1.9084481659736685. Time(s): tokenize 0.04080605506896973, prefill 13.523351669311523, decode 259.89702463150024

llama.cpp gets 3.5 tps prompt processing (I guess this is prefill) and 2.5 tps token generation (which I guess is decode).

I will play with settings more.

@RodriMora what is your system specs, by the way? Mine is 4xRTX 3090 (only first two cards used with my command) and Threadripper 3970X with 256GB DDR4 RAM (4 ram channels, RAM read speed about 80GB/s).

Try to adjust cpu_infer according to your cpu core~

@Azure-Tang
Copy link
Contributor

I have successfully deployed DeepSeek-V3 on the KTransformers framework, but there are still some accuracy issues that I am currently investigating.
Based on preliminary tests, using the Q4KM weight file requires at least 16GB of VRAM and 480GB of RAM. On a positive note, the generation speed can reach approximately 3.8 tokens/s.”

Thats wonderful.. Really appreciate the hard work. Just a small question if i inject more layers to the gpu since i have 144 GB of VRAM, will the RAM consumption go down and will i be able to allocate more context?

Is it possible to run the DeepSeek R1 in just 4 A100 GPU (roughly 320 GB VRAM) using accelerate with offloading to disk? While this will be really really slow, but for some analysis purpose is it possible to run with this setup?

It's possible. Loading process will be slow in this way, but the inference speed can be greatly accelerated. You can refer this:

I have successfully deployed DeepSeek-V3 on the KTransformers framework, but there are still some accuracy issues that I am currently investigating.
Based on preliminary tests, using the Q4KM weight file requires at least 16GB of VRAM and 480GB of RAM. On a positive note, the generation speed can reach approximately 3.8 tokens/s.”

Thats wonderful.. Really appreciate the hard work. Just a small question if i inject more layers to the gpu since i have 144 GB of VRAM, will the RAM consumption go down and will i be able to allocate more context?

yes, theoretically you can use marling expert as KExperts' backend (but we haven't test this op, this may cause some bug orz, and it's loading will be veeeeeeery slow)

I would check our marlin expert op once down supporting DeepSeekv3.

Hi,
I’ve updated the Marlin backend and added a small example YAML in ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-multi-gpu-marlin.yaml.
To fully utilize your VRAM, try injecting more KExpertsMarlin as the KTransformersExperts backend.

Please note:
• If you use KExpertsMarlin, you need to set --use_cuda_graph False. (This will be added to the documentation later.)
• Loading speed will be significantly slower as you add more KExpertsMarlin.

@Azure-Tang
Copy link
Contributor

@RodriMora @ELigoP

So it seems NCCL speeds matter a LOT for TPS. I'm getting WAY faster speeds using 2 GPU's with NVLink than vanilla llama.cpp with the same 4x3090/3090 Ti setup with 14 layers offloaded. For me cuda:0 and cuda:2 are NVLinked together so when I replaced all cuda:1 instances with cuda:2 in ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-multi-gpu.yaml I saw a massive speedup of ~10tps which is 2x vanilla llama.cpp with --max_new_tokens 8192 --total_context 8192 and same context on vanilla llama.cpp.

Now to write the 4 GPU config! Any assitance with adapting the ktransformers/optimize/optimize_rules/DeepSeek-V2-Chat-multi-gpu-4.yaml to DeepSeek-V3/R1 would be appreciated @Azure-Tang! I've taken a VERY short look at the paper and it seems that the main difference here is that there is 1 dense layer in V2 vs the 3 in V3 and 60 layers in V2 vs 61 layers in V3 yet my regex for layer assignment seems to break despite my best efforts and reading the injection tutorial multiple times.

You can copy the v2 4-GPU YAML and make slight modifications. The changes needed can be identified by comparing the differences between DeepSeek-V2-Chat-multi-gpu and DeepSeek-V3-Chat-multi-gpu.
I’m sure this will be a piece of cake for you! >.-

@Nottlespike
Copy link

@RodriMora @ELigoP
So it seems NCCL speeds matter a LOT for TPS. I'm getting WAY faster speeds using 2 GPU's with NVLink than vanilla llama.cpp with the same 4x3090/3090 Ti setup with 14 layers offloaded. For me cuda:0 and cuda:2 are NVLinked together so when I replaced all cuda:1 instances with cuda:2 in ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-multi-gpu.yaml I saw a massive speedup of ~10tps which is 2x vanilla llama.cpp with --max_new_tokens 8192 --total_context 8192 and same context on vanilla llama.cpp.
Now to write the 4 GPU config! Any assitance with adapting the ktransformers/optimize/optimize_rules/DeepSeek-V2-Chat-multi-gpu-4.yaml to DeepSeek-V3/R1 would be appreciated @Azure-Tang! I've taken a VERY short look at the paper and it seems that the main difference here is that there is 1 dense layer in V2 vs the 3 in V3 and 60 layers in V2 vs 61 layers in V3 yet my regex for layer assignment seems to break despite my best efforts and reading the injection tutorial multiple times.

You can copy the v2 4-GPU YAML and make slight modifications. The changes needed can be identified by comparing the differences between DeepSeek-V2-Chat-multi-gpu and DeepSeek-V3-Chat-multi-gpu. I’m sure this will be a piece of cake for you! >.-

哎呀

@huliangbing
Copy link

Waiting for the results of 4 GPUs ....

@RodriMora
Copy link
Contributor

RodriMora commented Feb 7, 2025

I did some test with my system:

Command to test:
ktransformers --model_path deepseek-ai/DeepSeek-R1 --gguf_path /mnt/llms/models/DeepSeek-R1-UD-Q2_K_XL --total_context 1024 --max_new_tokens 512 --port 5000 --host 0.0.0.0 --cpu_infer 24 --optimize_config_path ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-multi-gpu-marlin.yaml

  Default --cpu_infer 22 --cpu_infer 24 --cpu_infer 46 --cpu_infer 48 --cpu_infer 24 --v3-2Gpu.yaml --cpu_infer 24 --v3-4Gpu.yaml
Prefill 12.31613439 23.87405887 25.11264391 24.01843436 23.99479158 25.12282812 25.07483114
Decode 4.384906435 5.116141499 5.160205418 5.066774707 3.65675763 5.134406789 5.15631064
Time prefill 247.3178599 127.5861812 121.2934811 126.819257 126.9442158 121.2443116 121.4763913
Time decode 116.536124 99.87995839 99.02706552 66.3143754 139.7412822 99.52464247 99.10186481
Tokens prefill 3046 3046 3046 3046 3046 3046 3046
Tokens decode 511 511 511 336 511 511 511

For the 2 gpu optimize rules a used this (I would say same results as without any optimization rules, it seems to use 2 gpus anyways by default but not fully, like less than 50% each card):
https://github.com/kvcache-ai/ktransformers/blob/feat-DeepSeekV3/ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-multi-gpu-marlin.yaml

For the 4 gpu optimize rules I made this, also uses low vram only 4-6GB each card:
https://github.com/RodriMora/ktransformers/blob/main/ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-multi-gpu-marlin-4.yaml

Note: --use_cuda_graph False doesn't seem to be enabled as an arg while loading the openai api.

Edit: Comparasion vs llama.cpp:
Same prompt and tokens. llama.cpp with -ngl 20 with the gpus at 23/24Gb VRAM full. 40% increase with ktransformers

prompt eval time =   78999.88 ms /  3046 tokens (   25.94 ms per token,    38.56 tokens per second)
       eval time =  161916.39 ms /   512 tokens (  316.24 ms per token,     3.16 tokens per second)
      total time =  240916.27 ms /  3558 tokens

@KMSorSMS
Copy link
Contributor

KMSorSMS commented Feb 8, 2025

I did some test with my system:

Command to test: ktransformers --model_path deepseek-ai/DeepSeek-R1 --gguf_path /mnt/llms/models/DeepSeek-R1-UD-Q2_K_XL --total_context 1024 --max_new_tokens 512 --port 5000 --host 0.0.0.0 --cpu_infer 24 --optimize_config_path ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-multi-gpu-marlin.yaml

  Default --cpu_infer 22 --cpu_infer 24 --cpu_infer 46 --cpu_infer 48 --cpu_infer 24 --v3-2Gpu.yaml --cpu_infer 24 --v3-4Gpu.yaml
Prefill 12.31613439 23.87405887 25.11264391 24.01843436 23.99479158 25.12282812 25.07483114
Decode 4.384906435 5.116141499 5.160205418 5.066774707 3.65675763 5.134406789 5.15631064
Time prefill 247.3178599 127.5861812 121.2934811 126.819257 126.9442158 121.2443116 121.4763913
Time decode 116.536124 99.87995839 99.02706552 66.3143754 139.7412822 99.52464247 99.10186481
Tokens prefill 3046 3046 3046 3046 3046 3046 3046
Tokens decode 511 511 511 336 511 511 511
For the 2 gpu optimize rules a used this (I would say same results as without any optimization rules, it seems to use 2 gpus anyways by default but not fully, like less than 50% each card): https://github.com/kvcache-ai/ktransformers/blob/feat-DeepSeekV3/ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-multi-gpu-marlin.yaml

For the 4 gpu optimize rules I made this, also uses low vram only 4-6GB each card: https://github.com/RodriMora/ktransformers/blob/main/ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-multi-gpu-marlin-4.yaml

Note: --use_cuda_graph False doesn't seem to be enabled as an arg while loading the openai api.

Edit: Comparasion vs llama.cpp: Same prompt and tokens. llama.cpp with -ngl 20 with the gpus at 23/24Gb VRAM full. 40% increase with ktransformers

prompt eval time =   78999.88 ms /  3046 tokens (   25.94 ms per token,    38.56 tokens per second)
       eval time =  161916.39 ms /   512 tokens (  316.24 ms per token,     3.16 tokens per second)
      total time =  240916.27 ms /  3558 tokens

This testing result is consistent with our test which will come soon. Our performance video shows faster prefilling decoding compared with llama.cpp. Seems your CPU has about 24 cores on a single Numa Node and in multi GPU context also slows some performance ( As our multi GPU uses pipeline(pp), not tensor Tensor Parallel, So putting in a single GPU is best). You can try to modify .yaml of deepseekv3 to inject more marlin experts to fully utilize GPU and use command numactl -N 1 -m 1 to bind the infer on the same Numa node. And if you want to see more, you can set the --cache_lens , command arg, to 1536. But it will constrain your input and output lens, so be cautious. And our detailed test will come with a stable new release. Coming soon~ 😃

@Azure-Tang
Copy link
Contributor

Hi! We’ve merged the PR supporting Deepseek-V3/R1 into the main branch, and we’ve also introduced additional optimizations for even better performance. For more details, please check our README and tutorial. Thank you for your support—if you find this helpful, we’d really appreciate any recommendations you share with your friends or community!

@davidsyoung
Copy link

Thank you for all of your work on this! If you could create some theoretical optimised YAML versions for more GPU's I could test it here on my side. I have 10x3090's, soon to be 12 with a 7713 EPYC 256gb 3200mhz.

@RodriMora
Copy link
Contributor

What does the "--cache_lens 1536" do? I got from 5.2t/s decode to 9t/s, using the current main branch compiled (so I think 0.2, not 0.3). That's an insane increase of speed. Is there any compromise?

For the rest, using either:
DeepSeek-V3-Chat.yaml
DeepSeek-V3-Chat-multi-gpu.yaml
DeepSeek-V3-Chat-multi-gpu-marlin.yaml

yields the same results. I think using the single gpu DeepSeek-V3-Chat.yaml for longer context give OOM errors, the rest works fine but with similar performance. 24-25 prefill and 5 decode.

I'll try 0.3 next

@RodriMora
Copy link
Contributor

I installed 0.3.

Few notes, in my Ubuntu 22.04 installation was needed to add the:
sudo add-apt-repository ppa:ubuntu-toolchain-r/test
sudo apt-get update
sudo apt-get install --only-upgrade libstdc++6

Othewise it gives this error:
ImportError: /lib/x86_64-linux-gnu/libstdc++.so.6: version GLIBCXX_3.4.32' not found (required by /home/ubuntuai/ktransformers/.venv/lib/python3.11/site-packages/KTransformersOps.cpython-311-x86_64-linux-gnu.so)`

But I guess my CPU is not compatible? EPYC 7402

(.venv) ubuntuai@ubuntuai ~/ktransformers (main) [0|SIGILL]> ktransformers --model_path deepseek-ai/DeepSeek-R1 --gguf_path /mnt/llms/models/DeepSeek-R1-UD-Q2_K_XL/ --max_new_tokens 512 --length 512 --total_context 512 --port 5000 --host 0.0.0.0 --cpu_infer 24
2025-02-10 11:14:11,145 INFO /home/ubuntuai/ktransformers/.venv/lib/python3.11/site-packages/ktransformers/server/main.py[29]: Creating SQL tables
2025-02-10 11:14:11,147 INFO /home/ubuntuai/ktransformers/.venv/lib/python3.11/site-packages/ktransformers/server/api/openai/assistants/assistants.py[75]: Creating default assistant
__AVX512F__
fish: Job 1, 'ktransformers --model_path deep…' terminated by signal SIGILL (Illegal instruction)
(.venv) ubuntuai@ubuntuai ~/ktransformers (main) [0|SIGILL]>

@KMSorSMS
Copy link
Contributor

KMSorSMS commented Feb 10, 2025

What does the "--cache_lens 1536" do? I got from 5.2t/s decode to 9t/s, using the current main branch compiled (so I think 0.2, not 0.3). That's an insane increase of speed. Is there any compromise?

For the rest, using either: DeepSeek-V3-Chat.yaml DeepSeek-V3-Chat-multi-gpu.yaml DeepSeek-V3-Chat-multi-gpu-marlin.yaml

yields the same results. I think using the single gpu DeepSeek-V3-Chat.yaml for longer context give OOM errors, the rest works fine but with similar performance. 24-25 prefill and 5 decode.

I'll try 0.3 next

TL;DR

The --cache_lens has no use.
It's because --cache_lens used to be an arg which indicates a static area to compute attention even if there is no such a huge prompt to calculate. So in the older version, we set it smaller to our actual prompt will make it faster (as the default is too large). Now the new version we set this to the prompt length and your max_nex_token size. (But it can't support history context now).


Some further intro:
Actually, If you use the current main branch( which is our release V0.2), we get rid of the arg cache-lens in local_chat. (But it remains in server backend, so we just keep this in the tutorial command ). The more insane increase can come from our support for dual-socket support ( make sure you first set env var USE_NUMA export USE_NUMA=1, see the link in detail). Tips: you may need to use apt to install the libnuma. If you don't use this, you can also check our best test on a single socket. link.

@chenht2022
Copy link
Contributor

I installed 0.3.

Few notes, in my Ubuntu 22.04 installation was needed to add the: sudo add-apt-repository ppa:ubuntu-toolchain-r/test sudo apt-get update sudo apt-get install --only-upgrade libstdc++6

Othewise it gives this error: ImportError: /lib/x86_64-linux-gnu/libstdc++.so.6: version GLIBCXX_3.4.32' not found (required by /home/ubuntuai/ktransformers/.venv/lib/python3.11/site-packages/KTransformersOps.cpython-311-x86_64-linux-gnu.so)`

But I guess my CPU is not compatible? EPYC 7402

(.venv) ubuntuai@ubuntuai ~/ktransformers (main) [0|SIGILL]> ktransformers --model_path deepseek-ai/DeepSeek-R1 --gguf_path /mnt/llms/models/DeepSeek-R1-UD-Q2_K_XL/ --max_new_tokens 512 --length 512 --total_context 512 --port 5000 --host 0.0.0.0 --cpu_infer 24
2025-02-10 11:14:11,145 INFO /home/ubuntuai/ktransformers/.venv/lib/python3.11/site-packages/ktransformers/server/main.py[29]: Creating SQL tables
2025-02-10 11:14:11,147 INFO /home/ubuntuai/ktransformers/.venv/lib/python3.11/site-packages/ktransformers/server/api/openai/assistants/assistants.py[75]: Creating default assistant
__AVX512F__
fish: Job 1, 'ktransformers --model_path deep…' terminated by signal SIGILL (Illegal instruction)
(.venv) ubuntuai@ubuntuai ~/ktransformers (main) [0|SIGILL]>

AMD CPUs like EPYC does not support AMX instruction, our V0.3 currently only supports Intel Xeon CPUs.

@chenht2022 chenht2022 reopened this Feb 10, 2025
@RodriMora
Copy link
Contributor

What does the "--cache_lens 1536" do? I got from 5.2t/s decode to 9t/s, using the current main branch compiled (so I think 0.2, not 0.3). That's an insane increase of speed. Is there any compromise?
For the rest, using either: DeepSeek-V3-Chat.yaml DeepSeek-V3-Chat-multi-gpu.yaml DeepSeek-V3-Chat-multi-gpu-marlin.yaml
yields the same results. I think using the single gpu DeepSeek-V3-Chat.yaml for longer context give OOM errors, the rest works fine but with similar performance. 24-25 prefill and 5 decode.
I'll try 0.3 next

TL;DR

The --cache_lens has no use. It's because --cache_lens used to be an arg which indicates a static area to compute attention even if there is no such a huge prompt to calculate. So in the older version, we set it smaller to our actual prompt will make it faster (as the default is too large). Now the new version we set this to the prompt length and your max_nex_token size. (But it can't support history context now).

Some further intro: Actually, If you use the current main branch( which is our release V0.2), we get rid of the arg cache-lens in local_chat. (But it remains in server backend, so we just keep this in the tutorial command ). The more insane increase can come from our support for dual-socket support ( make sure you first set env var USE_NUMA export USE_NUMA=1, see the link in detail). Tips: you may need to use apt to install the libnuma. If you don't use this, you can also check our best test on a single socket. link.

numactl --hardware
available: 1 nodes (0)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
node 0 size: 515634 MB
node 0 free: 14832 MB
node distances:
node   0
  0:  10

My server only has 1 socket and 1 numa, so it doesn't get any benefit from numactl -N 1 -m 1 (or numactl -N 0 -m 0 in my case).
The --cache_lens 1536 though gives a 80% increase in speed

@KMSorSMS
Copy link
Contributor

What does the "--cache_lens 1536" do? I got from 5.2t/s decode to 9t/s, using the current main branch compiled (so I think 0.2, not 0.3). That's an insane increase of speed. Is there any compromise?
For the rest, using either: DeepSeek-V3-Chat.yaml DeepSeek-V3-Chat-multi-gpu.yaml DeepSeek-V3-Chat-multi-gpu-marlin.yaml
yields the same results. I think using the single gpu DeepSeek-V3-Chat.yaml for longer context give OOM errors, the rest works fine but with similar performance. 24-25 prefill and 5 decode.
I'll try 0.3 next

TL;DR
The --cache_lens has no use. It's because --cache_lens used to be an arg which indicates a static area to compute attention even if there is no such a huge prompt to calculate. So in the older version, we set it smaller to our actual prompt will make it faster (as the default is too large). Now the new version we set this to the prompt length and your max_nex_token size. (But it can't support history context now).
Some further intro: Actually, If you use the current main branch( which is our release V0.2), we get rid of the arg cache-lens in local_chat. (But it remains in server backend, so we just keep this in the tutorial command ). The more insane increase can come from our support for dual-socket support ( make sure you first set env var USE_NUMA export USE_NUMA=1, see the link in detail). Tips: you may need to use apt to install the libnuma. If you don't use this, you can also check our best test on a single socket. link.

numactl --hardware
available: 1 nodes (0)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
node 0 size: 515634 MB
node 0 free: 14832 MB
node distances:
node   0
  0:  10

My server only has 1 socket and 1 numa, so it doesn't get any benefit from numactl -N 1 -m 1 (or numactl -N 0 -m 0 in my case). The --cache_lens 1536 though gives a 80% increase in speed

Have you test again without the --cache_lens args? As it has no use now in local_chat run.

@RodriMora
Copy link
Contributor

Have you test again without the --cache_lens args? As it has no use now in local_chat run.

Yes. Got the same performance on local_chat. The difference is in the openai api server.

With --cache_lens
prompt eval count: 16 token(s)
prompt eval duration: 1.1299359798431396s
prompt eval rate: 14.160094275625383 tokens/s
eval count: 1000 token(s)
eval duration: 98.3808274269104s
eval rate: 10.164582125952592 tokens/s

Without: --cache_lens
prompt eval count: 16 token(s)
prompt eval duration: 1.133697271347046s
prompt eval rate: 14.11311503024876 tokens/s
eval count: 1000 token(s)
eval duration: 98.45745468139648s
eval rate: 10.156671257000815 tokens/s

@KMSorSMS
Copy link
Contributor

Have you test again without the --cache_lens args? As it has no use now in local_chat run.

Yes. Got the same performance on local_chat. The difference is in the openai api server.

With --cache_lens prompt eval count: 16 token(s) prompt eval duration: 1.1299359798431396s prompt eval rate: 14.160094275625383 tokens/s eval count: 1000 token(s) eval duration: 98.3808274269104s eval rate: 10.164582125952592 tokens/s

Without: --cache_lens prompt eval count: 16 token(s) prompt eval duration: 1.133697271347046s prompt eval rate: 14.11311503024876 tokens/s eval count: 1000 token(s) eval duration: 98.45745468139648s eval rate: 10.156671257000815 tokens/s

Good! This is exactly what we have test and post on the Report(our result is 8.73 tokens/s in decoding as our single socket only get 32 cores). The cache_lens args have no use in the current version of our local_chat.py.

@ArYuZzz
Copy link

ArYuZzz commented Feb 11, 2025

I installed 0.3.
Few notes, in my Ubuntu 22.04 installation was needed to add the: sudo add-apt-repository ppa:ubuntu-toolchain-r/test sudo apt-get update sudo apt-get install --only-upgrade libstdc++6
Othewise it gives this error: ImportError: /lib/x86_64-linux-gnu/libstdc++.so.6: version GLIBCXX_3.4.32' not found (required by /home/ubuntuai/ktransformers/.venv/lib/python3.11/site-packages/KTransformersOps.cpython-311-x86_64-linux-gnu.so)`
But I guess my CPU is not compatible? EPYC 7402

(.venv) ubuntuai@ubuntuai ~/ktransformers (main) [0|SIGILL]> ktransformers --model_path deepseek-ai/DeepSeek-R1 --gguf_path /mnt/llms/models/DeepSeek-R1-UD-Q2_K_XL/ --max_new_tokens 512 --length 512 --total_context 512 --port 5000 --host 0.0.0.0 --cpu_infer 24
2025-02-10 11:14:11,145 INFO /home/ubuntuai/ktransformers/.venv/lib/python3.11/site-packages/ktransformers/server/main.py[29]: Creating SQL tables
2025-02-10 11:14:11,147 INFO /home/ubuntuai/ktransformers/.venv/lib/python3.11/site-packages/ktransformers/server/api/openai/assistants/assistants.py[75]: Creating default assistant
__AVX512F__
fish: Job 1, 'ktransformers --model_path deep…' terminated by signal SIGILL (Illegal instruction)
(.venv) ubuntuai@ubuntuai ~/ktransformers (main) [0|SIGILL]>

AMD CPUs like EPYC does not support AMX instruction, our V0.3 currently only supports Intel Xeon CPUs.

Does v0.3 supports earlier Xeon CPUs like gen3 or gen2 which only support AVX-512?
And if not, does v0.2 support AVX-512 and can also accelerate the inference of deepseek?

@KMSorSMS
Copy link
Contributor

I installed 0.3.
Few notes, in my Ubuntu 22.04 installation was needed to add the: sudo add-apt-repository ppa:ubuntu-toolchain-r/test sudo apt-get update sudo apt-get install --only-upgrade libstdc++6
Othewise it gives this error: ImportError: /lib/x86_64-linux-gnu/libstdc++.so.6: version GLIBCXX_3.4.32' not found (required by /home/ubuntuai/ktransformers/.venv/lib/python3.11/site-packages/KTransformersOps.cpython-311-x86_64-linux-gnu.so)`
But I guess my CPU is not compatible? EPYC 7402

(.venv) ubuntuai@ubuntuai ~/ktransformers (main) [0|SIGILL]> ktransformers --model_path deepseek-ai/DeepSeek-R1 --gguf_path /mnt/llms/models/DeepSeek-R1-UD-Q2_K_XL/ --max_new_tokens 512 --length 512 --total_context 512 --port 5000 --host 0.0.0.0 --cpu_infer 24
2025-02-10 11:14:11,145 INFO /home/ubuntuai/ktransformers/.venv/lib/python3.11/site-packages/ktransformers/server/main.py[29]: Creating SQL tables
2025-02-10 11:14:11,147 INFO /home/ubuntuai/ktransformers/.venv/lib/python3.11/site-packages/ktransformers/server/api/openai/assistants/assistants.py[75]: Creating default assistant
__AVX512F__
fish: Job 1, 'ktransformers --model_path deep…' terminated by signal SIGILL (Illegal instruction)
(.venv) ubuntuai@ubuntuai ~/ktransformers (main) [0|SIGILL]>

AMD CPUs like EPYC does not support AMX instruction, our V0.3 currently only supports Intel Xeon CPUs.

Does v0.3 supports earlier Xeon CPUs like gen3 or gen2 which only support AVX-512? And if not, does v0.2 support AVX-512 and can also accelerate the inference of deepseek?

The preview version of V0.3 doesn't support earlier Xeon CPU. But V0.2 can support as it doesn't contain AMX instruction acceleration implementation. So you can try V0.2 to accelerate the inference of deepseek

@chenht2022
Copy link
Contributor

chenht2022 commented Feb 11, 2025

AMX instructions are not supported on Xeon gen2 and gen3 , and you can also check the supported instruction sets through command lscpu. If AMX instruction are supported, there will be flags such as amx_bf16, amx_tile, amx_int8
V0.2 will automatically use the AVX512 instruction set on hardware that supports AVX512

@WuNein
Copy link

WuNein commented Feb 11, 2025

must I download the BF16 version gguf to use V0.3?

https://huggingface.co/unsloth/DeepSeek-R1-GGUF/tree/main/DeepSeek-R1-BF16

It is giant!!

@chenht2022
Copy link
Contributor

must I download the BF16 version gguf to use V0.3?

https://huggingface.co/unsloth/DeepSeek-R1-GGUF/tree/main/DeepSeek-R1-BF16

It is giant!!

Yes, for V0.3 preview version. We will consider optimizing it in the official release version, maybe support online dequantization.

@ArYuZzz
Copy link

ArYuZzz commented Feb 11, 2025

I installed 0.3.
Few notes, in my Ubuntu 22.04 installation was needed to add the: sudo add-apt-repository ppa:ubuntu-toolchain-r/test sudo apt-get update sudo apt-get install --only-upgrade libstdc++6
Othewise it gives this error: ImportError: /lib/x86_64-linux-gnu/libstdc++.so.6: version GLIBCXX_3.4.32' not found (required by /home/ubuntuai/ktransformers/.venv/lib/python3.11/site-packages/KTransformersOps.cpython-311-x86_64-linux-gnu.so)`
But I guess my CPU is not compatible? EPYC 7402

(.venv) ubuntuai@ubuntuai ~/ktransformers (main) [0|SIGILL]> ktransformers --model_path deepseek-ai/DeepSeek-R1 --gguf_path /mnt/llms/models/DeepSeek-R1-UD-Q2_K_XL/ --max_new_tokens 512 --length 512 --total_context 512 --port 5000 --host 0.0.0.0 --cpu_infer 24
2025-02-10 11:14:11,145 INFO /home/ubuntuai/ktransformers/.venv/lib/python3.11/site-packages/ktransformers/server/main.py[29]: Creating SQL tables
2025-02-10 11:14:11,147 INFO /home/ubuntuai/ktransformers/.venv/lib/python3.11/site-packages/ktransformers/server/api/openai/assistants/assistants.py[75]: Creating default assistant
__AVX512F__
fish: Job 1, 'ktransformers --model_path deep…' terminated by signal SIGILL (Illegal instruction)
(.venv) ubuntuai@ubuntuai ~/ktransformers (main) [0|SIGILL]>

AMD CPUs like EPYC does not support AMX instruction, our V0.3 currently only supports Intel Xeon CPUs.

Does v0.3 supports earlier Xeon CPUs like gen3 or gen2 which only support AVX-512? And if not, does v0.2 support AVX-512 and can also accelerate the inference of deepseek?

The preview version of V0.3 doesn't support earlier Xeon CPU. But V0.2 can support as it doesn't contain AMX instruction acceleration implementation. So you can try V0.2 to accelerate the inference of deepseek

Thanks for reply and contribution. Will the upcoming v0.3 supports AVX512?

@KMSorSMS
Copy link
Contributor

I installed 0.3.
Few notes, in my Ubuntu 22.04 installation was needed to add the: sudo add-apt-repository ppa:ubuntu-toolchain-r/test sudo apt-get update sudo apt-get install --only-upgrade libstdc++6
Othewise it gives this error: ImportError: /lib/x86_64-linux-gnu/libstdc++.so.6: version GLIBCXX_3.4.32' not found (required by /home/ubuntuai/ktransformers/.venv/lib/python3.11/site-packages/KTransformersOps.cpython-311-x86_64-linux-gnu.so)`
But I guess my CPU is not compatible? EPYC 7402

(.venv) ubuntuai@ubuntuai ~/ktransformers (main) [0|SIGILL]> ktransformers --model_path deepseek-ai/DeepSeek-R1 --gguf_path /mnt/llms/models/DeepSeek-R1-UD-Q2_K_XL/ --max_new_tokens 512 --length 512 --total_context 512 --port 5000 --host 0.0.0.0 --cpu_infer 24
2025-02-10 11:14:11,145 INFO /home/ubuntuai/ktransformers/.venv/lib/python3.11/site-packages/ktransformers/server/main.py[29]: Creating SQL tables
2025-02-10 11:14:11,147 INFO /home/ubuntuai/ktransformers/.venv/lib/python3.11/site-packages/ktransformers/server/api/openai/assistants/assistants.py[75]: Creating default assistant
__AVX512F__
fish: Job 1, 'ktransformers --model_path deep…' terminated by signal SIGILL (Illegal instruction)
(.venv) ubuntuai@ubuntuai ~/ktransformers (main) [0|SIGILL]>

AMD CPUs like EPYC does not support AMX instruction, our V0.3 currently only supports Intel Xeon CPUs.

Does v0.3 supports earlier Xeon CPUs like gen3 or gen2 which only support AVX-512? And if not, does v0.2 support AVX-512 and can also accelerate the inference of deepseek?

The preview version of V0.3 doesn't support earlier Xeon CPU. But V0.2 can support as it doesn't contain AMX instruction acceleration implementation. So you can try V0.2 to accelerate the inference of deepseek

Thanks for reply and contribution. Will the upcoming v0.3 supports AVX512?

We will consider it. But most likely it will not contain as the V0.3 most focus on AMX optimization

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.