Support for DeepseekV3 680B #117

sorasoras · 2024-12-25T16:53:28Z

https://huggingface.co/deepseek-ai/DeepSeek-V3-Base
well, that's a beast.

fengyang95 · 2024-12-26T09:40:12Z

+1

Nottlespike · 2024-12-26T23:56:40Z

Does ktransformers depend on HF transformers support for a for a model arch? If so we are going to have to wait until DeepSeek-V3 supports HF transformers as it does not yet and I don't see a PR from the DeepSeek team yet.

TyraVex · 2024-12-27T00:31:39Z

The ktransformer backend is an old commit of llama.cpp iirc

Edit: 6 months old https://github.com/ggerganov/llama.cpp/tree/a94e6ff8774b7c9f950d9545baf0ce35e8d1ed2f

Nottlespike · 2024-12-27T00:50:32Z

The ktransformer backend is an old commit of llama.cpp iirc

Edit: 6 months old https://github.com/ggerganov/llama.cpp/tree/a94e6ff8774b7c9f950d9545baf0ce35e8d1ed2f

Is it just me or is this not promising... I mean I'm patient but this means that transformers needs to support DeepSeek-V3 then llama.cpp needs to suppport DeepSeek-V3 then ktransformers needs to support llama.cpp's DeepSeek-V3 implementation....

sorasoras · 2024-12-27T07:19:40Z

The ktransformer backend is an old commit of llama.cpp iirc

Edit: 6 months old https://github.com/ggerganov/llama.cpp/tree/a94e6ff8774b7c9f950d9545baf0ce35e8d1ed2f

Is it just me or is this not promising... I mean I'm patient but this means that transformers needs to support DeepSeek-V3 then llama.cpp needs to suppport DeepSeek-V3 then ktransformers needs to support llama.cpp's DeepSeek-V3 implementation....

I heard that it s not hard to support V3 on llama cpp due to been resemblance to v2

Nottlespike · 2024-12-27T09:44:41Z

The ktransformer backend is an old commit of llama.cpp iirc
Edit: 6 months old https://github.com/ggerganov/llama.cpp/tree/a94e6ff8774b7c9f950d9545baf0ce35e8d1ed2f

Is it just me or is this not promising... I mean I'm patient but this means that transformers needs to support DeepSeek-V3 then llama.cpp needs to suppport DeepSeek-V3 then ktransformers needs to support llama.cpp's DeepSeek-V3 implementation....

I heard that it s not hard to support V3 on llama cpp due to been resemblance to v2

Sadly this is not the case in two ways. Firstly V3 from the technical report is far far more complex than V2. Second V2 never actually got a HF transformers implementation sadly only a stale draft PR.
This is an issue that shows the problem in action huggingface/transformers#34335
This is the stale attempt at V2 integration huggingface/transformers#31976

sorasoras · 2024-12-27T17:23:32Z

The ktransformer backend is an old commit of llama.cpp iirc
Edit: 6 months old https://github.com/ggerganov/llama.cpp/tree/a94e6ff8774b7c9f950d9545baf0ce35e8d1ed2f

Is it just me or is this not promising... I mean I'm patient but this means that transformers needs to support DeepSeek-V3 then llama.cpp needs to suppport DeepSeek-V3 then ktransformers needs to support llama.cpp's DeepSeek-V3 implementation....

I heard that it s not hard to support V3 on llama cpp due to been resemblance to v2

Sadly this is not the case in two ways. Firstly V3 from the technical report is far far more complex than V2. Second V2 never actually got a HF transformers implementation sadly only a stale draft PR. This is an issue that shows the problem in action huggingface/transformers#34335 This is the stale attempt at V2 integration huggingface/transformers#31976

well. kt do support DS2 so it should be just a upgrade with next gen moe router

Nottlespike · 2024-12-27T17:41:06Z

The ktransformer backend is an old commit of llama.cpp iirc
Edit: 6 months old https://github.com/ggerganov/llama.cpp/tree/a94e6ff8774b7c9f950d9545baf0ce35e8d1ed2f

Is it just me or is this not promising... I mean I'm patient but this means that transformers needs to support DeepSeek-V3 then llama.cpp needs to suppport DeepSeek-V3 then ktransformers needs to support llama.cpp's DeepSeek-V3 implementation....

I heard that it s not hard to support V3 on llama cpp due to been resemblance to v2

Sadly this is not the case in two ways. Firstly V3 from the technical report is far far more complex than V2. Second V2 never actually got a HF transformers implementation sadly only a stale draft PR. This is an issue that shows the problem in action huggingface/transformers#34335 This is the stale attempt at V2 integration huggingface/transformers#31976

well. kt do support DS2 so it should be just a upgrade with next gen moe router

What they do is use remote code. There is no native HF transformers implemenation. Feel free to find it if you think I am incorrect. If people wish to use this for enterprise deployment letting a model run what is functionally arbitrary code on your servers is 100% a no go like it is for my startup.

mahald · 2024-12-28T06:34:39Z

+1

lzumot · 2024-12-29T16:36:24Z

+1

Azure-Tang · 2024-12-30T08:58:00Z

Supporting this seems easy; however, it requires approximately 400g of RAM for even q4km?

Nottlespike · 2024-12-30T11:11:03Z

Supporting this seems easy; however, it requires approximately 400g of RAM for even q4km?

Does ktransformers not need transformers or llama.cpp support for a a model? huggingface/transformers#35425 Back of the napkin math is more like 5XXGB of VRAM/RAM would be needed given context.

Azure-Tang · 2024-12-31T07:44:29Z

Supporting this seems easy; however, it requires approximately 400g of RAM for even q4km?

Does ktransformers not need transformers or llama.cpp support for a a model? huggingface/transformers#35425 Back of the napkin math is more like 5XXGB of VRAM/RAM would be needed given context.

Yes, we need transformers‘ modeling.py.

16x3b · 2024-12-31T14:54:26Z

Just spitballing. Would it be possible to do some speculative decoding with a smaller model (dense or moe) and then on the larger moe just use nvme for some of the experts? Why do we need all experts loaded into ram at all times instead of selecting experts as necessary?

Again just throwing an idea out there from where my understanding is at. I'd like to understand better why this would not work.

sorasoras · 2025-01-01T11:49:12Z

Just spitballing. Would it be possible to do some speculative decoding with a smaller model (dense or moe) and then on the larger moe just use nvme for some of the experts? Why do we need all experts loaded into ram at all times instead of selecting experts as necessary?

Again just throwing an idea out there from where my understanding is at. I'd like to understand better why this would not work.

why would you need smaller model speculative decoding when you can do it via MTP？

Nottlespike · 2025-01-02T20:58:36Z

@Azure-Tang
We are basically done with integrating DeepSeek-V3 into llama.cpp via a very very elegant solution from @fairydreaming
ggml-org/llama.cpp#10981 (comment)
Can you see if you can try what they did?

Nottlespike · 2025-01-06T23:04:34Z

For those looking for an update on this I've to forked it and I think I know how to get this working but no ETA.

ELigoP · 2025-01-09T14:44:55Z

Supporting this seems easy; however, it requires approximately 400g of RAM for even q4km?

I manage to fit in Q3_K_M quant within 96GB VRAM + 256GB RAM. Q4_K_M is out of my reach.

hvico · 2025-01-09T14:52:37Z

Supporting this seems easy; however, it requires approximately 400g of RAM for even q4km?

I manage to fit in Q3_K_M quant within 96GB VRAM + 256GB RAM. Q4_K_M is out of my reach.

Wow. Great news. What would be the mininum specs to be able to run a Q4_K_M with 96 GB VRAM? 512 GB of RAM plus that VRAM would be enough? I have that much VRAM but I will upgrade my RAM for that. Thanks!

16x3b · 2025-01-12T10:24:53Z

why would you need smaller model speculative decoding when you can do it via MTP？

I don't see how MTP helps me. I'm suggesting speculative decoding because we can get faster inference from a smaller model and only refer to the larger model if confidence is low. No need to call the large model if the small one has a confident answer.

sorasoras · 2025-01-13T07:12:57Z

why would you need smaller model speculative decoding when you can do it via MTP？

I don't see how MTP helps me. I'm suggesting speculative decoding because we can get faster inference from a smaller model and only refer to the larger model if confidence is low. No need to call the large model if the small one has a confident answer.

MTP generate two token and you use second token as speculative decode. This is call self-speculative decode. no need for extra model

16x3b · 2025-01-14T13:21:32Z

MTP generate two token and you use second token as speculative decode. This is call self-speculative decode. no need for extra model

Understood. That's excellent! Thank you for explaining that to me. I see why there's no need to implement speculative decoding with a model that already has MTP implemented.

sorasoras · 2025-01-14T17:04:39Z

MTP generate two token and you use second token as speculative decode. This is call self-speculative decode. no need for extra model

Understood. That's excellent! Thank you for explaining that to me. I see why there's no need to implement speculative decoding with a model that already has MTP implemented.

if you read the paper, it mentioned you can get 90% hit rate with MTP 2-token and SD.
I guess we are going to have 4-token MTP in the future.

bitnom · 2025-01-28T12:07:57Z

so hot right now

whisper-bye · 2025-01-28T13:46:56Z

Dynamic 1.58-bit
https://unsloth.ai/blog/deepseekr1-dynamic

ChandanVerma · 2025-01-29T16:38:03Z

Any updates by when can we see deepseek v3 or r1 in ktransformers? Really looking forward to it. It would be of great help to everyone

Thanks

Azure-Tang · 2025-01-29T18:11:51Z

Any updates by when can we see deepseek v3 or r1 in ktransformers? Really looking forward to it. It would be of great help to everyone

Thanks

I'm actively tracking the DeepSeek V3/R1 integration. The HuggingFace Transformers team is finalizing support for DeepSeek V3 (see PR #35926), which ktransformers relies on.

I'm pausing my vacation resolving any remaining compatibility issues. Expect a quick follow-up release for ktransformers once dependencies stabilize.

I’ll keep you updated here – appreciate your patience and support! 🙌

Nondzu · 2025-01-29T22:13:47Z

I can run DS V3 and R1 on my workstation 768g of RAM. llama.cpp tested and works fine with deepseek v3/r1 Q4_K_M and Q5 models.
If you need any tests or help to implementation pls ping me.

ChandanVerma · 2025-01-30T16:11:36Z

I can run DS V3 and R1 on my workstation 768g of RAM. llama.cpp tested and works fine with deepseek v3/r1 Q4_K_M and Q5 models. If you need any tests or help to implementation pls ping me.

Is it usable? How many TPS do you get? I am planning to run on my local server with 512g RAM and 6x3090 system

Nottlespike · 2025-02-06T19:43:02Z

@RodriMora what is your system specs, by the way? Mine is 4xRTX 3090 (only first two cards used with my command) and Threadripper 3970X with 256GB DDR4 RAM (4 ram channels, RAM read speed about 80GB/s).

Similar to yours but on Epyc cpu:

EPYC 7402 24c/48t 512GB RAM 3200MHz 4x3090s

I'm currently doing some performance tests, comparing it to llama.cpp at different context lenghts. I've never used ktransformers so I dont know how to manage how many GPU's to use? It seems to use 2/4 at the moment.

The relative path for the multi GPU .yaml is at ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-multi-gpu.yaml.

With the tutorial on injection being here: https://github.com/kvcache-ai/ktransformers/blob/feat-DeepSeekV3/doc/en/injection_tutorial.md

I've been stymied by the regex for offloading to more than 2 GPU's with specifically this piece being particularly confusing (0|[1-9]|[12][0-9]) as I assume this is for the fact that the first 3 layers of the model are dense but given the pipe I'm not sure how ktransformers handles the "or" regex here and still loads all 61 layers.

Either way it looks like the segfault issue is fixed so I'm rebuilding then testing.

Nottlespike · 2025-02-06T22:38:20Z

@RodriMora @ELigoP

So it seems NCCL speeds matter a LOT for TPS. I'm getting WAY faster speeds using 2 GPU's with NVLink than vanilla llama.cpp with the same 4x3090/3090 Ti setup with 14 layers offloaded. For me cuda:0 and cuda:2 are NVLinked together so when I replaced all cuda:1 instances with cuda:2 in ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-multi-gpu.yaml I saw a massive speedup of ~10tps which is 2x vanilla llama.cpp with --max_new_tokens 8192 --total_context 8192 and same context on vanilla llama.cpp.

Now to write the 4 GPU config! Any assitance with adapting the ktransformers/optimize/optimize_rules/DeepSeek-V2-Chat-multi-gpu-4.yaml to DeepSeek-V3/R1 would be appreciated @Azure-Tang! I've taken a VERY short look at the paper and it seems that the main difference here is that there is 1 dense layer in V2 vs the 3 in V3 and 60 layers in V2 vs 61 layers in V3 yet my regex for layer assignment seems to break despite my best efforts and reading the injection tutorial multiple times.

Azure-Tang · 2025-02-07T02:32:19Z

curl 'http://0.0.0.0:10002/v1/chat/completions' \
    -X POST \
    -H "Content-Type: application/json" \
    --data-raw '{"model":"deepseek-ai/DeepSeek-V3","messages":[{"role":"system","content":"You are an AI coding assistant. You explain as minimum as possible."},{"role":"user","content":"Write numbers from 1 to 10, each on new line, no coding."}],"stream":false}'
Did you change the 0.0.0.0 for privacy? you should use the actual IP of the server, it's working for me.
curl 'http://192.168.10.2:5000/v1/chat/completions' \
    -X POST \
    -H "Content-Type: application/json" \
    --data-raw '{"model":"deepseek-ai/DeepSeek-V3","messages":[{"role":"system","content":"You are an AI coding assistant. You explain as minimum as possible."},{"role":"user","content":"Write numbers from 1 to 10, each on new line, no coding."}],"stream":false}'
{"id":"7ff95288-d334-431f-9d94-867f372bd497","object":"chat.completion.chunk","created":1738864152,"model":"not implmented","system_fingerprint":"not implmented","usage":null,"choices":[{"index":0,"message":{"content":"<think>\nOkay, so the user wants me to write numbers from 1 to 10, each on a new line, and they specified no coding. Let me make sure I understand. They mentioned \"You are an AI coding assistant\" but then said \"no coding,\" so maybe they want the output to be simple text without any code formatting like in a code block. Got it.\n\nSo, I need to list numbers 1 through 10, each on separate lines. Let me check if there's any other requirement. The user wrote \"explain as minimum as possible,\" which probably means not adding any extra explanations or comments. Just the numbers. Alright.\n\nI think that's all. I'll write each number from 1 to 10, each on a new line. Let me count to confirm: 10 numbers, each separated. Yes, that should do it.\n</think>\n\n1 \n2 \n3 \n4 \n5 \n6 \n7 \n8 \n9 \n10","role":"assistant","name":null},"logprobs":null,"finish_reason":null}]}
Thanks! My bad, I didn't pull latest change. Now server responds. Still performance is not as expected (comparable and even worse than llama.cpp, which as I understand offloads only first layers to GPU):
2025-02-06 20:45:21,438 INFO /home/ai/3rdparty/KTransformers-feat-DeepSeekV3/venv/lib/python3.11/site-packages/ktransformers/server/backend/base.py[64]: Performance(T/s): prefill 1.9965464671950348, decode 1.9084481659736685. Time(s): tokenize 0.04080605506896973, prefill 13.523351669311523, decode 259.89702463150024
llama.cpp gets 3.5 tps prompt processing (I guess this is prefill) and 2.5 tps token generation (which I guess is decode).

I will play with settings more.

@RodriMora what is your system specs, by the way? Mine is 4xRTX 3090 (only first two cards used with my command) and Threadripper 3970X with 256GB DDR4 RAM (4 ram channels, RAM read speed about 80GB/s).

Try to adjust cpu_infer according to your cpu core~

Azure-Tang · 2025-02-07T06:27:22Z

I have successfully deployed DeepSeek-V3 on the KTransformers framework, but there are still some accuracy issues that I am currently investigating.
Based on preliminary tests, using the Q4KM weight file requires at least 16GB of VRAM and 480GB of RAM. On a positive note, the generation speed can reach approximately 3.8 tokens/s.”

Thats wonderful.. Really appreciate the hard work. Just a small question if i inject more layers to the gpu since i have 144 GB of VRAM, will the RAM consumption go down and will i be able to allocate more context?

Is it possible to run the DeepSeek R1 in just 4 A100 GPU (roughly 320 GB VRAM) using accelerate with offloading to disk? While this will be really really slow, but for some analysis purpose is it possible to run with this setup?

It's possible. Loading process will be slow in this way, but the inference speed can be greatly accelerated. You can refer this:

I have successfully deployed DeepSeek-V3 on the KTransformers framework, but there are still some accuracy issues that I am currently investigating.
Based on preliminary tests, using the Q4KM weight file requires at least 16GB of VRAM and 480GB of RAM. On a positive note, the generation speed can reach approximately 3.8 tokens/s.”

Thats wonderful.. Really appreciate the hard work. Just a small question if i inject more layers to the gpu since i have 144 GB of VRAM, will the RAM consumption go down and will i be able to allocate more context?

yes, theoretically you can use marling expert as KExperts' backend (but we haven't test this op, this may cause some bug orz, and it's loading will be veeeeeeery slow)

I would check our marlin expert op once down supporting DeepSeekv3.

Hi,
I’ve updated the Marlin backend and added a small example YAML in ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-multi-gpu-marlin.yaml.
To fully utilize your VRAM, try injecting more KExpertsMarlin as the KTransformersExperts backend.

Please note:
• If you use KExpertsMarlin, you need to set --use_cuda_graph False. (This will be added to the documentation later.)
• Loading speed will be significantly slower as you add more KExpertsMarlin.

Azure-Tang · 2025-02-07T06:36:04Z

@RodriMora @ELigoP

So it seems NCCL speeds matter a LOT for TPS. I'm getting WAY faster speeds using 2 GPU's with NVLink than vanilla llama.cpp with the same 4x3090/3090 Ti setup with 14 layers offloaded. For me cuda:0 and cuda:2 are NVLinked together so when I replaced all cuda:1 instances with cuda:2 in ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-multi-gpu.yaml I saw a massive speedup of ~10tps which is 2x vanilla llama.cpp with --max_new_tokens 8192 --total_context 8192 and same context on vanilla llama.cpp.

Now to write the 4 GPU config! Any assitance with adapting the ktransformers/optimize/optimize_rules/DeepSeek-V2-Chat-multi-gpu-4.yaml to DeepSeek-V3/R1 would be appreciated @Azure-Tang! I've taken a VERY short look at the paper and it seems that the main difference here is that there is 1 dense layer in V2 vs the 3 in V3 and 60 layers in V2 vs 61 layers in V3 yet my regex for layer assignment seems to break despite my best efforts and reading the injection tutorial multiple times.

You can copy the v2 4-GPU YAML and make slight modifications. The changes needed can be identified by comparing the differences between DeepSeek-V2-Chat-multi-gpu and DeepSeek-V3-Chat-multi-gpu.
I’m sure this will be a piece of cake for you! >.-

Nottlespike · 2025-02-07T06:42:02Z

@RodriMora @ELigoP
So it seems NCCL speeds matter a LOT for TPS. I'm getting WAY faster speeds using 2 GPU's with NVLink than vanilla llama.cpp with the same 4x3090/3090 Ti setup with 14 layers offloaded. For me cuda:0 and cuda:2 are NVLinked together so when I replaced all cuda:1 instances with cuda:2 in ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-multi-gpu.yaml I saw a massive speedup of ~10tps which is 2x vanilla llama.cpp with --max_new_tokens 8192 --total_context 8192 and same context on vanilla llama.cpp.
Now to write the 4 GPU config! Any assitance with adapting the ktransformers/optimize/optimize_rules/DeepSeek-V2-Chat-multi-gpu-4.yaml to DeepSeek-V3/R1 would be appreciated @Azure-Tang! I've taken a VERY short look at the paper and it seems that the main difference here is that there is 1 dense layer in V2 vs the 3 in V3 and 60 layers in V2 vs 61 layers in V3 yet my regex for layer assignment seems to break despite my best efforts and reading the injection tutorial multiple times.

You can copy the v2 4-GPU YAML and make slight modifications. The changes needed can be identified by comparing the differences between DeepSeek-V2-Chat-multi-gpu and DeepSeek-V3-Chat-multi-gpu. I’m sure this will be a piece of cake for you! >.-

哎呀

huliangbing · 2025-02-07T16:49:32Z

Waiting for the results of 4 GPUs ....

RodriMora · 2025-02-07T20:00:26Z

I did some test with my system:

Command to test:
ktransformers --model_path deepseek-ai/DeepSeek-R1 --gguf_path /mnt/llms/models/DeepSeek-R1-UD-Q2_K_XL --total_context 1024 --max_new_tokens 512 --port 5000 --host 0.0.0.0 --cpu_infer 24 --optimize_config_path ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-multi-gpu-marlin.yaml

	Default	--cpu_infer 22	--cpu_infer 24	--cpu_infer 46	--cpu_infer 48	--cpu_infer 24 --v3-2Gpu.yaml	--cpu_infer 24 --v3-4Gpu.yaml
Prefill	12.31613439	23.87405887	25.11264391	24.01843436	23.99479158	25.12282812	25.07483114
Decode	4.384906435	5.116141499	5.160205418	5.066774707	3.65675763	5.134406789	5.15631064
Time prefill	247.3178599	127.5861812	121.2934811	126.819257	126.9442158	121.2443116	121.4763913
Time decode	116.536124	99.87995839	99.02706552	66.3143754	139.7412822	99.52464247	99.10186481
Tokens prefill	3046	3046	3046	3046	3046	3046	3046
Tokens decode	511	511	511	336	511	511	511

For the 2 gpu optimize rules a used this (I would say same results as without any optimization rules, it seems to use 2 gpus anyways by default but not fully, like less than 50% each card):
https://github.com/kvcache-ai/ktransformers/blob/feat-DeepSeekV3/ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-multi-gpu-marlin.yaml

For the 4 gpu optimize rules I made this, also uses low vram only 4-6GB each card:
https://github.com/RodriMora/ktransformers/blob/main/ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-multi-gpu-marlin-4.yaml

Note: --use_cuda_graph False doesn't seem to be enabled as an arg while loading the openai api.

Edit: Comparasion vs llama.cpp:
Same prompt and tokens. llama.cpp with -ngl 20 with the gpus at 23/24Gb VRAM full. 40% increase with ktransformers

prompt eval time =   78999.88 ms /  3046 tokens (   25.94 ms per token,    38.56 tokens per second)
       eval time =  161916.39 ms /   512 tokens (  316.24 ms per token,     3.16 tokens per second)
      total time =  240916.27 ms /  3558 tokens

KMSorSMS · 2025-02-08T07:04:45Z

I did some test with my system:

Command to test: ktransformers --model_path deepseek-ai/DeepSeek-R1 --gguf_path /mnt/llms/models/DeepSeek-R1-UD-Q2_K_XL --total_context 1024 --max_new_tokens 512 --port 5000 --host 0.0.0.0 --cpu_infer 24 --optimize_config_path ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-multi-gpu-marlin.yaml

Default --cpu_infer 22 --cpu_infer 24 --cpu_infer 46 --cpu_infer 48 --cpu_infer 24 --v3-2Gpu.yaml --cpu_infer 24 --v3-4Gpu.yaml
Prefill 12.31613439 23.87405887 25.11264391 24.01843436 23.99479158 25.12282812 25.07483114
Decode 4.384906435 5.116141499 5.160205418 5.066774707 3.65675763 5.134406789 5.15631064
Time prefill 247.3178599 127.5861812 121.2934811 126.819257 126.9442158 121.2443116 121.4763913
Time decode 116.536124 99.87995839 99.02706552 66.3143754 139.7412822 99.52464247 99.10186481
Tokens prefill 3046 3046 3046 3046 3046 3046 3046
Tokens decode 511 511 511 336 511 511 511
For the 2 gpu optimize rules a used this (I would say same results as without any optimization rules, it seems to use 2 gpus anyways by default but not fully, like less than 50% each card): https://github.com/kvcache-ai/ktransformers/blob/feat-DeepSeekV3/ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-multi-gpu-marlin.yaml

For the 4 gpu optimize rules I made this, also uses low vram only 4-6GB each card: https://github.com/RodriMora/ktransformers/blob/main/ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-multi-gpu-marlin-4.yaml

Note: --use_cuda_graph False doesn't seem to be enabled as an arg while loading the openai api.

Edit: Comparasion vs llama.cpp: Same prompt and tokens. llama.cpp with -ngl 20 with the gpus at 23/24Gb VRAM full. 40% increase with ktransformers
prompt eval time =   78999.88 ms /  3046 tokens (   25.94 ms per token,    38.56 tokens per second)
       eval time =  161916.39 ms /   512 tokens (  316.24 ms per token,     3.16 tokens per second)
      total time =  240916.27 ms /  3558 tokens

This testing result is consistent with our test which will come soon. Our performance video shows faster prefilling decoding compared with llama.cpp. Seems your CPU has about 24 cores on a single Numa Node and in multi GPU context also slows some performance ( As our multi GPU uses pipeline(pp), not tensor Tensor Parallel, So putting in a single GPU is best). You can try to modify .yaml of deepseekv3 to inject more marlin experts to fully utilize GPU and use command numactl -N 1 -m 1 to bind the infer on the same Numa node. And if you want to see more, you can set the --cache_lens , command arg, to 1536. But it will constrain your input and output lens, so be cautious. And our detailed test will come with a stable new release. Coming soon~ 😃

Azure-Tang · 2025-02-10T06:40:35Z

Hi! We’ve merged the PR supporting Deepseek-V3/R1 into the main branch, and we’ve also introduced additional optimizations for even better performance. For more details, please check our README and tutorial. Thank you for your support—if you find this helpful, we’d really appreciate any recommendations you share with your friends or community!

davidsyoung · 2025-02-10T09:12:09Z

Thank you for all of your work on this! If you could create some theoretical optimised YAML versions for more GPU's I could test it here on my side. I have 10x3090's, soon to be 12 with a 7713 EPYC 256gb 3200mhz.

RodriMora · 2025-02-10T09:58:23Z

What does the "--cache_lens 1536" do? I got from 5.2t/s decode to 9t/s, using the current main branch compiled (so I think 0.2, not 0.3). That's an insane increase of speed. Is there any compromise?

For the rest, using either:
DeepSeek-V3-Chat.yaml
DeepSeek-V3-Chat-multi-gpu.yaml
DeepSeek-V3-Chat-multi-gpu-marlin.yaml

yields the same results. I think using the single gpu DeepSeek-V3-Chat.yaml for longer context give OOM errors, the rest works fine but with similar performance. 24-25 prefill and 5 decode.

I'll try 0.3 next

RodriMora · 2025-02-10T10:18:37Z

I installed 0.3.

Few notes, in my Ubuntu 22.04 installation was needed to add the:
sudo add-apt-repository ppa:ubuntu-toolchain-r/test
sudo apt-get update
sudo apt-get install --only-upgrade libstdc++6

Othewise it gives this error:
ImportError: /lib/x86_64-linux-gnu/libstdc++.so.6: version GLIBCXX_3.4.32' not found (required by /home/ubuntuai/ktransformers/.venv/lib/python3.11/site-packages/KTransformersOps.cpython-311-x86_64-linux-gnu.so)`

But I guess my CPU is not compatible? EPYC 7402

(.venv) ubuntuai@ubuntuai ~/ktransformers (main) [0|SIGILL]> ktransformers --model_path deepseek-ai/DeepSeek-R1 --gguf_path /mnt/llms/models/DeepSeek-R1-UD-Q2_K_XL/ --max_new_tokens 512 --length 512 --total_context 512 --port 5000 --host 0.0.0.0 --cpu_infer 24
2025-02-10 11:14:11,145 INFO /home/ubuntuai/ktransformers/.venv/lib/python3.11/site-packages/ktransformers/server/main.py[29]: Creating SQL tables
2025-02-10 11:14:11,147 INFO /home/ubuntuai/ktransformers/.venv/lib/python3.11/site-packages/ktransformers/server/api/openai/assistants/assistants.py[75]: Creating default assistant
__AVX512F__
fish: Job 1, 'ktransformers --model_path deep…' terminated by signal SIGILL (Illegal instruction)
(.venv) ubuntuai@ubuntuai ~/ktransformers (main) [0|SIGILL]>

KMSorSMS · 2025-02-10T10:38:26Z

What does the "--cache_lens 1536" do? I got from 5.2t/s decode to 9t/s, using the current main branch compiled (so I think 0.2, not 0.3). That's an insane increase of speed. Is there any compromise?

For the rest, using either: DeepSeek-V3-Chat.yaml DeepSeek-V3-Chat-multi-gpu.yaml DeepSeek-V3-Chat-multi-gpu-marlin.yaml

yields the same results. I think using the single gpu DeepSeek-V3-Chat.yaml for longer context give OOM errors, the rest works fine but with similar performance. 24-25 prefill and 5 decode.

I'll try 0.3 next

TL;DR

The --cache_lens has no use.
It's because --cache_lens used to be an arg which indicates a static area to compute attention even if there is no such a huge prompt to calculate. So in the older version, we set it smaller to our actual prompt will make it faster (as the default is too large). Now the new version we set this to the prompt length and your max_nex_token size. (But it can't support history context now).

Some further intro:
Actually, If you use the current main branch( which is our release V0.2), we get rid of the arg cache-lens in local_chat. (But it remains in server backend, so we just keep this in the tutorial command ). The more insane increase can come from our support for dual-socket support ( make sure you first set env var USE_NUMA export USE_NUMA=1, see the link in detail). Tips: you may need to use apt to install the libnuma. If you don't use this, you can also check our best test on a single socket. link.

chenht2022 · 2025-02-10T10:46:55Z

I installed 0.3.

Few notes, in my Ubuntu 22.04 installation was needed to add the: sudo add-apt-repository ppa:ubuntu-toolchain-r/test sudo apt-get update sudo apt-get install --only-upgrade libstdc++6

Othewise it gives this error: ImportError: /lib/x86_64-linux-gnu/libstdc++.so.6: version GLIBCXX_3.4.32' not found (required by /home/ubuntuai/ktransformers/.venv/lib/python3.11/site-packages/KTransformersOps.cpython-311-x86_64-linux-gnu.so)`

But I guess my CPU is not compatible? EPYC 7402
(.venv) ubuntuai@ubuntuai ~/ktransformers (main) [0|SIGILL]> ktransformers --model_path deepseek-ai/DeepSeek-R1 --gguf_path /mnt/llms/models/DeepSeek-R1-UD-Q2_K_XL/ --max_new_tokens 512 --length 512 --total_context 512 --port 5000 --host 0.0.0.0 --cpu_infer 24
2025-02-10 11:14:11,145 INFO /home/ubuntuai/ktransformers/.venv/lib/python3.11/site-packages/ktransformers/server/main.py[29]: Creating SQL tables
2025-02-10 11:14:11,147 INFO /home/ubuntuai/ktransformers/.venv/lib/python3.11/site-packages/ktransformers/server/api/openai/assistants/assistants.py[75]: Creating default assistant
__AVX512F__
fish: Job 1, 'ktransformers --model_path deep…' terminated by signal SIGILL (Illegal instruction)
(.venv) ubuntuai@ubuntuai ~/ktransformers (main) [0|SIGILL]>

AMD CPUs like EPYC does not support AMX instruction, our V0.3 currently only supports Intel Xeon CPUs.

RodriMora · 2025-02-10T11:49:00Z

What does the "--cache_lens 1536" do? I got from 5.2t/s decode to 9t/s, using the current main branch compiled (so I think 0.2, not 0.3). That's an insane increase of speed. Is there any compromise?
For the rest, using either: DeepSeek-V3-Chat.yaml DeepSeek-V3-Chat-multi-gpu.yaml DeepSeek-V3-Chat-multi-gpu-marlin.yaml
yields the same results. I think using the single gpu DeepSeek-V3-Chat.yaml for longer context give OOM errors, the rest works fine but with similar performance. 24-25 prefill and 5 decode.
I'll try 0.3 next

TL;DR

The --cache_lens has no use. It's because --cache_lens used to be an arg which indicates a static area to compute attention even if there is no such a huge prompt to calculate. So in the older version, we set it smaller to our actual prompt will make it faster (as the default is too large). Now the new version we set this to the prompt length and your max_nex_token size. (But it can't support history context now).

Some further intro: Actually, If you use the current main branch( which is our release V0.2), we get rid of the arg cache-lens in local_chat. (But it remains in server backend, so we just keep this in the tutorial command ). The more insane increase can come from our support for dual-socket support ( make sure you first set env var USE_NUMA export USE_NUMA=1, see the link in detail). Tips: you may need to use apt to install the libnuma. If you don't use this, you can also check our best test on a single socket. link.

numactl --hardware
available: 1 nodes (0)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
node 0 size: 515634 MB
node 0 free: 14832 MB
node distances:
node   0
  0:  10

My server only has 1 socket and 1 numa, so it doesn't get any benefit from numactl -N 1 -m 1 (or numactl -N 0 -m 0 in my case).
The --cache_lens 1536 though gives a 80% increase in speed

KMSorSMS · 2025-02-10T12:40:47Z

What does the "--cache_lens 1536" do? I got from 5.2t/s decode to 9t/s, using the current main branch compiled (so I think 0.2, not 0.3). That's an insane increase of speed. Is there any compromise?
For the rest, using either: DeepSeek-V3-Chat.yaml DeepSeek-V3-Chat-multi-gpu.yaml DeepSeek-V3-Chat-multi-gpu-marlin.yaml
yields the same results. I think using the single gpu DeepSeek-V3-Chat.yaml for longer context give OOM errors, the rest works fine but with similar performance. 24-25 prefill and 5 decode.
I'll try 0.3 next

TL;DR
The --cache_lens has no use. It's because --cache_lens used to be an arg which indicates a static area to compute attention even if there is no such a huge prompt to calculate. So in the older version, we set it smaller to our actual prompt will make it faster (as the default is too large). Now the new version we set this to the prompt length and your max_nex_token size. (But it can't support history context now).
Some further intro: Actually, If you use the current main branch( which is our release V0.2), we get rid of the arg cache-lens in local_chat. (But it remains in server backend, so we just keep this in the tutorial command ). The more insane increase can come from our support for dual-socket support ( make sure you first set env var USE_NUMA export USE_NUMA=1, see the link in detail). Tips: you may need to use apt to install the libnuma. If you don't use this, you can also check our best test on a single socket. link.
numactl --hardware
available: 1 nodes (0)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
node 0 size: 515634 MB
node 0 free: 14832 MB
node distances:
node   0
  0:  10
My server only has 1 socket and 1 numa, so it doesn't get any benefit from numactl -N 1 -m 1 (or numactl -N 0 -m 0 in my case). The --cache_lens 1536 though gives a 80% increase in speed

Have you test again without the --cache_lens args? As it has no use now in local_chat run.

RodriMora · 2025-02-10T13:24:19Z

Have you test again without the --cache_lens args? As it has no use now in local_chat run.

Yes. Got the same performance on local_chat. The difference is in the openai api server.

With --cache_lens
prompt eval count: 16 token(s)
prompt eval duration: 1.1299359798431396s
prompt eval rate: 14.160094275625383 tokens/s
eval count: 1000 token(s)
eval duration: 98.3808274269104s
eval rate: 10.164582125952592 tokens/s

Without: --cache_lens
prompt eval count: 16 token(s)
prompt eval duration: 1.133697271347046s
prompt eval rate: 14.11311503024876 tokens/s
eval count: 1000 token(s)
eval duration: 98.45745468139648s
eval rate: 10.156671257000815 tokens/s

KMSorSMS · 2025-02-10T13:30:32Z

Have you test again without the --cache_lens args? As it has no use now in local_chat run.

Yes. Got the same performance on local_chat. The difference is in the openai api server.

With --cache_lens prompt eval count: 16 token(s) prompt eval duration: 1.1299359798431396s prompt eval rate: 14.160094275625383 tokens/s eval count: 1000 token(s) eval duration: 98.3808274269104s eval rate: 10.164582125952592 tokens/s

Without: --cache_lens prompt eval count: 16 token(s) prompt eval duration: 1.133697271347046s prompt eval rate: 14.11311503024876 tokens/s eval count: 1000 token(s) eval duration: 98.45745468139648s eval rate: 10.156671257000815 tokens/s

Good! This is exactly what we have test and post on the Report(our result is 8.73 tokens/s in decoding as our single socket only get 32 cores). The cache_lens args have no use in the current version of our local_chat.py.

ArYuZzz · 2025-02-11T08:56:34Z

I installed 0.3.
Few notes, in my Ubuntu 22.04 installation was needed to add the: sudo add-apt-repository ppa:ubuntu-toolchain-r/test sudo apt-get update sudo apt-get install --only-upgrade libstdc++6
Othewise it gives this error: ImportError: /lib/x86_64-linux-gnu/libstdc++.so.6: version GLIBCXX_3.4.32' not found (required by /home/ubuntuai/ktransformers/.venv/lib/python3.11/site-packages/KTransformersOps.cpython-311-x86_64-linux-gnu.so)`
But I guess my CPU is not compatible? EPYC 7402
(.venv) ubuntuai@ubuntuai ~/ktransformers (main) [0|SIGILL]> ktransformers --model_path deepseek-ai/DeepSeek-R1 --gguf_path /mnt/llms/models/DeepSeek-R1-UD-Q2_K_XL/ --max_new_tokens 512 --length 512 --total_context 512 --port 5000 --host 0.0.0.0 --cpu_infer 24
2025-02-10 11:14:11,145 INFO /home/ubuntuai/ktransformers/.venv/lib/python3.11/site-packages/ktransformers/server/main.py[29]: Creating SQL tables
2025-02-10 11:14:11,147 INFO /home/ubuntuai/ktransformers/.venv/lib/python3.11/site-packages/ktransformers/server/api/openai/assistants/assistants.py[75]: Creating default assistant
__AVX512F__
fish: Job 1, 'ktransformers --model_path deep…' terminated by signal SIGILL (Illegal instruction)
(.venv) ubuntuai@ubuntuai ~/ktransformers (main) [0|SIGILL]>
AMD CPUs like EPYC does not support AMX instruction, our V0.3 currently only supports Intel Xeon CPUs.

Does v0.3 supports earlier Xeon CPUs like gen3 or gen2 which only support AVX-512?
And if not, does v0.2 support AVX-512 and can also accelerate the inference of deepseek?

KMSorSMS · 2025-02-11T09:15:35Z

I installed 0.3.
Few notes, in my Ubuntu 22.04 installation was needed to add the: sudo add-apt-repository ppa:ubuntu-toolchain-r/test sudo apt-get update sudo apt-get install --only-upgrade libstdc++6
Othewise it gives this error: ImportError: /lib/x86_64-linux-gnu/libstdc++.so.6: version GLIBCXX_3.4.32' not found (required by /home/ubuntuai/ktransformers/.venv/lib/python3.11/site-packages/KTransformersOps.cpython-311-x86_64-linux-gnu.so)`
But I guess my CPU is not compatible? EPYC 7402
(.venv) ubuntuai@ubuntuai ~/ktransformers (main) [0|SIGILL]> ktransformers --model_path deepseek-ai/DeepSeek-R1 --gguf_path /mnt/llms/models/DeepSeek-R1-UD-Q2_K_XL/ --max_new_tokens 512 --length 512 --total_context 512 --port 5000 --host 0.0.0.0 --cpu_infer 24
2025-02-10 11:14:11,145 INFO /home/ubuntuai/ktransformers/.venv/lib/python3.11/site-packages/ktransformers/server/main.py[29]: Creating SQL tables
2025-02-10 11:14:11,147 INFO /home/ubuntuai/ktransformers/.venv/lib/python3.11/site-packages/ktransformers/server/api/openai/assistants/assistants.py[75]: Creating default assistant
__AVX512F__
fish: Job 1, 'ktransformers --model_path deep…' terminated by signal SIGILL (Illegal instruction)
(.venv) ubuntuai@ubuntuai ~/ktransformers (main) [0|SIGILL]>
AMD CPUs like EPYC does not support AMX instruction, our V0.3 currently only supports Intel Xeon CPUs.
Does v0.3 supports earlier Xeon CPUs like gen3 or gen2 which only support AVX-512? And if not, does v0.2 support AVX-512 and can also accelerate the inference of deepseek?

The preview version of V0.3 doesn't support earlier Xeon CPU. But V0.2 can support as it doesn't contain AMX instruction acceleration implementation. So you can try V0.2 to accelerate the inference of deepseek

chenht2022 · 2025-02-11T09:16:50Z

AMX instructions are not supported on Xeon gen2 and gen3 , and you can also check the supported instruction sets through command lscpu. If AMX instruction are supported, there will be flags such as amx_bf16, amx_tile, amx_int8
V0.2 will automatically use the AVX512 instruction set on hardware that supports AVX512

WuNein · 2025-02-11T09:37:01Z

must I download the BF16 version gguf to use V0.3?

https://huggingface.co/unsloth/DeepSeek-R1-GGUF/tree/main/DeepSeek-R1-BF16

It is giant!!

chenht2022 · 2025-02-11T09:55:13Z

must I download the BF16 version gguf to use V0.3?

https://huggingface.co/unsloth/DeepSeek-R1-GGUF/tree/main/DeepSeek-R1-BF16

It is giant!!

Yes, for V0.3 preview version. We will consider optimizing it in the official release version, maybe support online dequantization.

ArYuZzz · 2025-02-11T10:17:09Z

I installed 0.3.
Few notes, in my Ubuntu 22.04 installation was needed to add the: sudo add-apt-repository ppa:ubuntu-toolchain-r/test sudo apt-get update sudo apt-get install --only-upgrade libstdc++6
Othewise it gives this error: ImportError: /lib/x86_64-linux-gnu/libstdc++.so.6: version GLIBCXX_3.4.32' not found (required by /home/ubuntuai/ktransformers/.venv/lib/python3.11/site-packages/KTransformersOps.cpython-311-x86_64-linux-gnu.so)`
But I guess my CPU is not compatible? EPYC 7402
(.venv) ubuntuai@ubuntuai ~/ktransformers (main) [0|SIGILL]> ktransformers --model_path deepseek-ai/DeepSeek-R1 --gguf_path /mnt/llms/models/DeepSeek-R1-UD-Q2_K_XL/ --max_new_tokens 512 --length 512 --total_context 512 --port 5000 --host 0.0.0.0 --cpu_infer 24
2025-02-10 11:14:11,145 INFO /home/ubuntuai/ktransformers/.venv/lib/python3.11/site-packages/ktransformers/server/main.py[29]: Creating SQL tables
2025-02-10 11:14:11,147 INFO /home/ubuntuai/ktransformers/.venv/lib/python3.11/site-packages/ktransformers/server/api/openai/assistants/assistants.py[75]: Creating default assistant
__AVX512F__
fish: Job 1, 'ktransformers --model_path deep…' terminated by signal SIGILL (Illegal instruction)
(.venv) ubuntuai@ubuntuai ~/ktransformers (main) [0|SIGILL]>
AMD CPUs like EPYC does not support AMX instruction, our V0.3 currently only supports Intel Xeon CPUs.
Does v0.3 supports earlier Xeon CPUs like gen3 or gen2 which only support AVX-512? And if not, does v0.2 support AVX-512 and can also accelerate the inference of deepseek?
The preview version of V0.3 doesn't support earlier Xeon CPU. But V0.2 can support as it doesn't contain AMX instruction acceleration implementation. So you can try V0.2 to accelerate the inference of deepseek

Thanks for reply and contribution. Will the upcoming v0.3 supports AVX512?

KMSorSMS · 2025-02-11T10:39:49Z

I installed 0.3.
Few notes, in my Ubuntu 22.04 installation was needed to add the: sudo add-apt-repository ppa:ubuntu-toolchain-r/test sudo apt-get update sudo apt-get install --only-upgrade libstdc++6
Othewise it gives this error: ImportError: /lib/x86_64-linux-gnu/libstdc++.so.6: version GLIBCXX_3.4.32' not found (required by /home/ubuntuai/ktransformers/.venv/lib/python3.11/site-packages/KTransformersOps.cpython-311-x86_64-linux-gnu.so)`
But I guess my CPU is not compatible? EPYC 7402
(.venv) ubuntuai@ubuntuai ~/ktransformers (main) [0|SIGILL]> ktransformers --model_path deepseek-ai/DeepSeek-R1 --gguf_path /mnt/llms/models/DeepSeek-R1-UD-Q2_K_XL/ --max_new_tokens 512 --length 512 --total_context 512 --port 5000 --host 0.0.0.0 --cpu_infer 24
2025-02-10 11:14:11,145 INFO /home/ubuntuai/ktransformers/.venv/lib/python3.11/site-packages/ktransformers/server/main.py[29]: Creating SQL tables
2025-02-10 11:14:11,147 INFO /home/ubuntuai/ktransformers/.venv/lib/python3.11/site-packages/ktransformers/server/api/openai/assistants/assistants.py[75]: Creating default assistant
__AVX512F__
fish: Job 1, 'ktransformers --model_path deep…' terminated by signal SIGILL (Illegal instruction)
(.venv) ubuntuai@ubuntuai ~/ktransformers (main) [0|SIGILL]>
AMD CPUs like EPYC does not support AMX instruction, our V0.3 currently only supports Intel Xeon CPUs.
Does v0.3 supports earlier Xeon CPUs like gen3 or gen2 which only support AVX-512? And if not, does v0.2 support AVX-512 and can also accelerate the inference of deepseek?
The preview version of V0.3 doesn't support earlier Xeon CPU. But V0.2 can support as it doesn't contain AMX instruction acceleration implementation. So you can try V0.2 to accelerate the inference of deepseek
Thanks for reply and contribution. Will the upcoming v0.3 supports AVX512?

We will consider it. But most likely it will not contain as the V0.3 most focus on AMX optimization

web-traveler mentioned this issue Jan 1, 2025

Feature Request: add DeepSeek-v3 support ggml-org/llama.cpp#10981

Closed

4 tasks

UnicornChan closed this as completed in #122 Feb 10, 2025

chenht2022 reopened this Feb 10, 2025

Azure-Tang closed this as completed Feb 12, 2025

xiongsp mentioned this issue Feb 14, 2025

反复出现GLIBCXX_3.4.30 not found 和 CUDA error: no kernel image is available for execution on the device #304

Closed

Support for DeepseekV3 680B #117

Support for DeepseekV3 680B #117

Comments

sorasoras commented Dec 25, 2024

fengyang95 commented Dec 26, 2024

Nottlespike commented Dec 26, 2024

TyraVex commented Dec 27, 2024 • edited Loading

Nottlespike commented Dec 27, 2024

sorasoras commented Dec 27, 2024

Nottlespike commented Dec 27, 2024

sorasoras commented Dec 27, 2024

Nottlespike commented Dec 27, 2024

mahald commented Dec 28, 2024

lzumot commented Dec 29, 2024

Azure-Tang commented Dec 30, 2024 • edited Loading

Nottlespike commented Dec 30, 2024

Azure-Tang commented Dec 31, 2024

16x3b commented Dec 31, 2024 • edited Loading

sorasoras commented Jan 1, 2025

Nottlespike commented Jan 2, 2025 • edited Loading

Nottlespike commented Jan 6, 2025

ELigoP commented Jan 9, 2025

hvico commented Jan 9, 2025

16x3b commented Jan 12, 2025

sorasoras commented Jan 13, 2025

16x3b commented Jan 14, 2025

sorasoras commented Jan 14, 2025

bitnom commented Jan 28, 2025

whisper-bye commented Jan 28, 2025

ChandanVerma commented Jan 29, 2025

Azure-Tang commented Jan 29, 2025

Nondzu commented Jan 29, 2025

ChandanVerma commented Jan 30, 2025

Nottlespike commented Feb 6, 2025

Nottlespike commented Feb 6, 2025

Azure-Tang commented Feb 7, 2025

Azure-Tang commented Feb 7, 2025

Azure-Tang commented Feb 7, 2025

Nottlespike commented Feb 7, 2025

huliangbing commented Feb 7, 2025

RodriMora commented Feb 7, 2025 • edited Loading

KMSorSMS commented Feb 8, 2025

Azure-Tang commented Feb 10, 2025

davidsyoung commented Feb 10, 2025

RodriMora commented Feb 10, 2025

RodriMora commented Feb 10, 2025

KMSorSMS commented Feb 10, 2025 • edited Loading

chenht2022 commented Feb 10, 2025

RodriMora commented Feb 10, 2025

KMSorSMS commented Feb 10, 2025

RodriMora commented Feb 10, 2025

KMSorSMS commented Feb 10, 2025

ArYuZzz commented Feb 11, 2025 • edited Loading

KMSorSMS commented Feb 11, 2025

chenht2022 commented Feb 11, 2025 • edited Loading

WuNein commented Feb 11, 2025

chenht2022 commented Feb 11, 2025

ArYuZzz commented Feb 11, 2025

KMSorSMS commented Feb 11, 2025

TyraVex commented Dec 27, 2024 •

edited

Loading

Azure-Tang commented Dec 30, 2024 •

edited

Loading

16x3b commented Dec 31, 2024 •

edited

Loading

Nottlespike commented Jan 2, 2025 •

edited

Loading

RodriMora commented Feb 7, 2025 •

edited

Loading

KMSorSMS commented Feb 10, 2025 •

edited

Loading

ArYuZzz commented Feb 11, 2025 •

edited

Loading

chenht2022 commented Feb 11, 2025 •

edited

Loading