-
-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Kernel] Initial commit containing new Triton kernels for multi lora serving. #5025
Conversation
…computation. These (should) handle any shape and data type, apply to loras in a paged format and compute at the actual lora rank and also speed up for grouped lora requests (especially prefill).
@Yard1 Any idea why the import fails? Works locally, but the kernel is in a new folder so maybe that path has to be added somewhere? |
@FurtherAI add |
@Yard1 Is there a way to rerun the tests without an empty commit? |
@FurtherAI Making an empty commit with |
@FurtherAI Does this allow for larger vocabulary Sizes? For example NeMO-12B has a vocab size of 131072
|
@tensimixt Yeah I think it does. It shouldn't have any issues with different sizes. I'll test it at some point |
This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you! |
This pull request has been automatically closed due to inactivity. Please feel free to reopen if you intend to continue working on it. Thank you! |
SGMV Triton Kernels
New Triton kernels for multi lora computation. These (should) handle any shape and data type, apply to loras in a paged format and compute at the actual lora rank and also speed up for grouped lora requests (especially prefill).
The PR contains the kernels, tests for the kernels and benchmarks. A follow up PR will work on adding the ability to use these in vLLM.
ping @Yard1