-
Notifications
You must be signed in to change notification settings - Fork 760
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for DeepseekV3 680B #117
Comments
+1 |
Does ktransformers depend on HF transformers support for a for a model arch? If so we are going to have to wait until DeepSeek-V3 supports HF transformers as it does not yet and I don't see a PR from the DeepSeek team yet. |
The ktransformer backend is an old commit of llama.cpp iirc Edit: 6 months old https://github.com/ggerganov/llama.cpp/tree/a94e6ff8774b7c9f950d9545baf0ce35e8d1ed2f |
Is it just me or is this not promising... I mean I'm patient but this means that transformers needs to support DeepSeek-V3 then llama.cpp needs to suppport DeepSeek-V3 then ktransformers needs to support llama.cpp's DeepSeek-V3 implementation.... |
I heard that it s not hard to support V3 on llama cpp due to been resemblance to v2 |
Sadly this is not the case in two ways. Firstly V3 from the technical report is far far more complex than V2. Second V2 never actually got a HF transformers implementation sadly only a stale draft PR. |
well. kt do support DS2 so it should be just a upgrade with next gen moe router |
What they do is use remote code. There is no native HF transformers implemenation. Feel free to find it if you think I am incorrect. If people wish to use this for enterprise deployment letting a model run what is functionally arbitrary code on your servers is 100% a no go like it is for my startup. |
+1 |
1 similar comment
+1 |
Supporting this seems easy; however, it requires approximately 400g of RAM for even q4km? |
Does ktransformers not need transformers or llama.cpp support for a a model? huggingface/transformers#35425 Back of the napkin math is more like 5XXGB of VRAM/RAM would be needed given context. |
Yes, we need transformers‘ modeling.py. |
Just spitballing. Would it be possible to do some speculative decoding with a smaller model (dense or moe) and then on the larger moe just use nvme for some of the experts? Why do we need all experts loaded into ram at all times instead of selecting experts as necessary? Again just throwing an idea out there from where my understanding is at. I'd like to understand better why this would not work. |
why would you need smaller model speculative decoding when you can do it via MTP? |
@Azure-Tang |
For those looking for an update on this I've to forked it and I think I know how to get this working but no ETA. |
I manage to fit in Q3_K_M quant within 96GB VRAM + 256GB RAM. Q4_K_M is out of my reach. |
Wow. Great news. What would be the mininum specs to be able to run a Q4_K_M with 96 GB VRAM? 512 GB of RAM plus that VRAM would be enough? I have that much VRAM but I will upgrade my RAM for that. Thanks! |
I don't see how MTP helps me. I'm suggesting speculative decoding because we can get faster inference from a smaller model and only refer to the larger model if confidence is low. No need to call the large model if the small one has a confident answer. |
MTP generate two token and you use second token as speculative decode. This is call self-speculative decode. no need for extra model |
Understood. That's excellent! Thank you for explaining that to me. I see why there's no need to implement speculative decoding with a model that already has MTP implemented. |
if you read the paper, it mentioned you can get 90% hit rate with MTP 2-token and SD. |
so hot right now |
Dynamic 1.58-bit |
Any updates by when can we see deepseek v3 or r1 in ktransformers? Really looking forward to it. It would be of great help to everyone Thanks |
I'm actively tracking the DeepSeek V3/R1 integration. The HuggingFace Transformers team is finalizing support for DeepSeek V3 (see PR #35926), which ktransformers relies on. I'm pausing my vacation resolving any remaining compatibility issues. Expect a quick follow-up release for ktransformers once dependencies stabilize. I’ll keep you updated here – appreciate your patience and support! 🙌 |
I can run DS V3 and R1 on my workstation 768g of RAM. llama.cpp tested and works fine with deepseek v3/r1 Q4_K_M and Q5 models. |
Is it usable? How many TPS do you get? I am planning to run on my local server with 512g RAM and 6x3090 system |
The relative path for the multi GPU .yaml is at With the tutorial on injection being here: https://github.com/kvcache-ai/ktransformers/blob/feat-DeepSeekV3/doc/en/injection_tutorial.md I've been stymied by the regex for offloading to more than 2 GPU's with specifically this piece being particularly confusing Either way it looks like the segfault issue is fixed so I'm rebuilding then testing. |
So it seems NCCL speeds matter a LOT for TPS. I'm getting WAY faster speeds using 2 GPU's with NVLink than vanilla llama.cpp with the same 4x3090/3090 Ti setup with 14 layers offloaded. For me Now to write the 4 GPU config! Any assitance with adapting the |
Try to adjust |
Hi, Please note: |
You can copy the v2 4-GPU YAML and make slight modifications. The changes needed can be identified by comparing the differences between DeepSeek-V2-Chat-multi-gpu and DeepSeek-V3-Chat-multi-gpu. |
哎呀 |
Waiting for the results of 4 GPUs .... |
I did some test with my system: Command to test:
For the 2 gpu optimize rules a used this (I would say same results as without any optimization rules, it seems to use 2 gpus anyways by default but not fully, like less than 50% each card): For the 4 gpu optimize rules I made this, also uses low vram only 4-6GB each card: Note: --use_cuda_graph False doesn't seem to be enabled as an arg while loading the openai api. Edit: Comparasion vs llama.cpp:
|
This testing result is consistent with our test which will come soon. Our performance video shows faster prefilling decoding compared with llama.cpp. Seems your CPU has about 24 cores on a single Numa Node and in multi GPU context also slows some performance ( As our multi GPU uses pipeline(pp), not tensor Tensor Parallel, So putting in a single GPU is best). You can try to modify .yaml of deepseekv3 to inject more marlin experts to fully utilize GPU and use command |
Hi! We’ve merged the PR supporting Deepseek-V3/R1 into the main branch, and we’ve also introduced additional optimizations for even better performance. For more details, please check our README and tutorial. Thank you for your support—if you find this helpful, we’d really appreciate any recommendations you share with your friends or community! |
Thank you for all of your work on this! If you could create some theoretical optimised YAML versions for more GPU's I could test it here on my side. I have 10x3090's, soon to be 12 with a 7713 EPYC 256gb 3200mhz. |
What does the "--cache_lens 1536" do? I got from 5.2t/s decode to 9t/s, using the current main branch compiled (so I think 0.2, not 0.3). That's an insane increase of speed. Is there any compromise? For the rest, using either: yields the same results. I think using the single gpu DeepSeek-V3-Chat.yaml for longer context give OOM errors, the rest works fine but with similar performance. 24-25 prefill and 5 decode. I'll try 0.3 next |
I installed 0.3. Few notes, in my Ubuntu 22.04 installation was needed to add the: Othewise it gives this error: But I guess my CPU is not compatible? EPYC 7402
|
TL;DR The --cache_lens has no use. Some further intro: |
AMD CPUs like EPYC does not support AMX instruction, our V0.3 currently only supports Intel Xeon CPUs. |
My server only has 1 socket and 1 numa, so it doesn't get any benefit from numactl -N 1 -m 1 (or numactl -N 0 -m 0 in my case). |
Have you test again without the --cache_lens args? As it has no use now in local_chat run. |
Yes. Got the same performance on local_chat. The difference is in the openai api server. With --cache_lens Without: --cache_lens |
Good! This is exactly what we have test and post on the Report(our result is 8.73 tokens/s in decoding as our single socket only get 32 cores). The cache_lens args have no use in the current version of our local_chat.py. |
Does v0.3 supports earlier Xeon CPUs like gen3 or gen2 which only support AVX-512? |
The preview version of V0.3 doesn't support earlier Xeon CPU. But V0.2 can support as it doesn't contain AMX instruction acceleration implementation. So you can try V0.2 to accelerate the inference of deepseek |
AMX instructions are not supported on Xeon gen2 and gen3 , and you can also check the supported instruction sets through command |
must I download the BF16 version gguf to use V0.3? https://huggingface.co/unsloth/DeepSeek-R1-GGUF/tree/main/DeepSeek-R1-BF16 It is giant!! |
Yes, for V0.3 preview version. We will consider optimizing it in the official release version, maybe support online dequantization. |
Thanks for reply and contribution. Will the upcoming v0.3 supports AVX512? |
We will consider it. But most likely it will not contain as the V0.3 most focus on AMX optimization |
https://huggingface.co/deepseek-ai/DeepSeek-V3-Base
well, that's a beast.
The text was updated successfully, but these errors were encountered: