-
Notifications
You must be signed in to change notification settings - Fork 360
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add LLaVA Support #484
Add LLaVA Support #484
Conversation
…llava continue working
Code Metrics Report=============================================================================== Language Files Lines Code Comments Blanks =============================================================================== Dockerfile 1 34 25 0 9 Happy 1 442 369 0 73 JSON 10 66 65 0 1 Python 36 1412 1220 38 154 TOML 18 518 458 11 49 ------------------------------------------------------------------------------- Jupyter Notebooks 1 0 0 0 0 |- Markdown 1 60 30 22 8 |- Python 1 96 87 1 8 (Total) 156 117 23 16 ------------------------------------------------------------------------------- Markdown 21 1554 0 1173 381 |- BASH 5 101 98 0 3 |- JSON 1 12 12 0 0 |- Python 5 98 88 0 10 |- Rust 4 221 198 10 13 |- TOML 2 75 63 0 12 (Total) 2061 459 1183 419 ------------------------------------------------------------------------------- Rust 132 44214 40090 775 3349 |- Markdown 77 723 13 671 39 (Total) 44937 40103 1446 3388 =============================================================================== Total 221 48240 42227 1997 4016 =============================================================================== |
Hi @chenwanqq! Please let me know when this is ready for review. |
Hhh I think it's ready. Maybe some docs and examples are a little bit hurry but I think we can review it🤣 |
The addition of AnyMOE (which modifies the dependencies of NormalModel) affects the LLaVA llm backend. However, the current solution based on copied code is just a temporary one, so I have removed the NormalModel trait from LLaVALLM altogether. Once the fusedRoPE is fixed, we should change LLaVA to rely on an ordinary model as soon as possible. Edit: I add Moe support code for LLaVA, but never really test it. As I see the Phi3V, seems AnyMoe only affects the llm part, and have nothing to do with image embedding? |
And LLaVANext is just LLaVA1.5 plus some multiple resolution trick(anyres). So they share some infrastuctres. If this causes any confusion, we can modify it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @chenwanqq! This looks amazing, thanks for adding it.
I think there are some inconsistencies (probably mistakes?) with the image tokens in the examples: some look like they are from Idefics (no image token) and some like they are from Phi 3 (have <|image_1|>
) but according to the HF model card, it expects <image>
. This also extends to the regexes in the inputs processor. I have commented where the changes should be made, can you please update it?
mistralrs-core/src/vision_models/llava/llava_next_inputs_processor.rs
Outdated
Show resolved
Hide resolved
I see. I think we should use the Llava
That makes sense. Given that we support multiple images, if we could document that it would be great! 😃 |
I've fix the image tag problem and anymoe problem accordingly. May we can have more review? |
Hi @chenwanqq! I made some edits for some small updates to the model support matrix and model architecture docs. This looks good, but it crashes when I run:
And
With
It seems to come from the CLIP embeddings. Note that for ISQ models, the VBs are on the CPU and are cast using I would reccomend that you check out the Idefics 2 model, as I do When I run without ISQ, it successfully produces the correct output though - great work here! |
That's exactly the problem! With the new commit, I believe I fixed it (tested in llava-v1.6-vicuna-7b-hf, llava-v1.6-mistral-7b-hf, and llava-1.5-7b-hf). However, when I deal with llama-like models (such as Vicuna), there is a device mismatch problem with rmsnorm:
When I checked the code, I found that the Llama and Mistral code have slight differences in initialization: mistral.rs/mistralrs-core/src/models/llama.rs Lines 291 to 299 in c8b71e9
mistral.rs/mistralrs-core/src/models/mistral.rs Lines 288 to 296 in c8b71e9
I change it as Mistral way in my copied version of backend llama llm. I don't have Llama weights right now (and it's not very easy for me to download them). I think maybe you can check if the Llama text model has this issue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Thank you for adding this. I'll need to check the RoPE instability in a future PR but I think the current solution + LlavaModel is good enough.
Introduction
This implementation is based on my work for candle. However, it incorporates some notable differences:
Still Working!
Some Notes
* Certain modifications regarding cross-GPU mapping support [link] might result in significant memory usage problems.I can not reproduce this problem right nowStatus