Add LLaVA Support #484

chenwanqq · 2024-06-27T09:58:24Z

Introduction

This implementation is based on my work for candle. However, it incorporates some notable differences:

I have completely removed support for the model format used in the liuhaotian/llava (original) repo. Instead, I adopted the model format from the llava-hf repo, providing a more standardized approach to utilizing the model, particularly the tokenizer.
Similar to the Python transformer library, I have segmented the original project into LLaVA (1.5) and LLaVANext (1.6). This reduction in the use of if-else statements simplifies the code.
I have revamped the process of concatenating text and image features. The updated method aligns closely with phi3v's implementation in this repo, ensuring a more uniform coding style.

Still Working!

Some Notes

The fused RoPE might cause issues, possibly due to precision-related matters. Please refer to this What are the differences between the llama block in this repo and the implementation in candle-transformer? #465. Therefore, I have implemented an alternative version in the llava-llm folder.
* Certain modifications regarding cross-GPU mapping support [link] might result in significant memory usage problems. I can not reproduce this problem right now

Status

…llava continue working

…llava

github-actions · 2024-06-27T09:59:24Z

Code Metrics Report

  ===============================================================================
 Language            Files        Lines         Code     Comments       Blanks
===============================================================================
 Dockerfile              1           34           25            0            9
 Happy                   1          442          369            0           73
 JSON                   10           66           65            0            1
 Python                 36         1412         1220           38          154
 TOML                   18          518          458           11           49
-------------------------------------------------------------------------------
 Jupyter Notebooks       1            0            0            0            0
 |- Markdown             1           60           30           22            8
 |- Python               1           96           87            1            8
 (Total)                            156          117           23           16
-------------------------------------------------------------------------------
 Markdown               21         1554            0         1173          381
 |- BASH                 5          101           98            0            3
 |- JSON                 1           12           12            0            0
 |- Python               5           98           88            0           10
 |- Rust                 4          221          198           10           13
 |- TOML                 2           75           63            0           12
 (Total)                           2061          459         1183          419
-------------------------------------------------------------------------------
 Rust                  132        44214        40090          775         3349
 |- Markdown            77          723           13          671           39
 (Total)                          44937        40103         1446         3388
===============================================================================
 Total                 221        48240        42227         1997         4016
===============================================================================

EricLBuehler · 2024-07-01T14:51:53Z

Hi @chenwanqq! Please let me know when this is ready for review.

chenwanqq · 2024-07-01T15:00:24Z

Hi @chenwanqq! Please let me know when this is ready for review.

Hhh I think it's ready. Maybe some docs and examples are a little bit hurry but I think we can review it🤣

…llava

chenwanqq · 2024-07-01T16:04:03Z

Great, I'll take a look. I think that the recent merge of #476 caused some merge conflicts here, can you please take a look?

Edit: Seems like they are trivial to fix.

The addition of AnyMOE (which modifies the dependencies of NormalModel) affects the LLaVA llm backend. However, the current solution based on copied code is just a temporary one, so I have removed the NormalModel trait from LLaVALLM altogether. Once the fusedRoPE is fixed, we should change LLaVA to rely on an ordinary model as soon as possible.

Edit: I add Moe support code for LLaVA, but never really test it. As I see the Phi3V, seems AnyMoe only affects the llm part, and have nothing to do with image embedding?

chenwanqq · 2024-07-01T16:07:10Z

And LLaVANext is just LLaVA1.5 plus some multiple resolution trick(anyres). So they share some infrastuctres. If this causes any confusion, we can modify it.

EricLBuehler

Hi @chenwanqq! This looks amazing, thanks for adding it.

I think there are some inconsistencies (probably mistakes?) with the image tokens in the examples: some look like they are from Idefics (no image token) and some like they are from Phi 3 (have <|image_1|>) but according to the HF model card, it expects <image>. This also extends to the regexes in the inputs processor. I have commented where the changes should be made, can you please update it?

docs/LLaVA.md

mistralrs-core/src/vision_models/llava/llava_next.rs

mistralrs-core/src/vision_models/llava/llava_next_inputs_processor.rs

mistralrs/Cargo.toml

mistralrs/examples/llava/main.rs

mistralrs/examples/llava_next/main.rs

chenwanqq · 2024-07-03T09:34:27Z

Seem quite a lot for me to check!🧐 I will see it later, and just to explain something that:
(1) As I mentioned, I did not use the original image tag processing logic of llava, but instead adopted a processing logic similar to phi3v. This is due to compatibility considerations with the input of this repo. For example, the original logic of llava does not calculate num_img_tokens before encode_image, etc.; Additionally, phi3v's logic is better for handling multiple images. Since in llava is just a placeholder, if we set up the chat_template correctly (the same as the backend llm), the effects of the two implementations will be the same. I think we should add some special notifications?
(2) llava supports multiple images, but this is only practically meaningful in version 1.5. In version 1.6, the introduction of multi-resolution processing has led to longer feature representations for each image, but the context length is only 4K, so multiple images will be truncated.
(3) Yes, it is only the rope problem that I copy the code of LLM.

EricLBuehler · 2024-07-03T09:39:32Z

As I mentioned, I did not use the original image tag processing logic of llava, but instead adopted a processing logic similar to phi3v. This is due to compatibility considerations with the input of this repo. For example, the original logic of llava does not calculate num_img_tokens before encode_image, etc.; Additionally, phi3v's logic is better for handling multiple images. Since in llava is just a placeholder, if we set up the chat_template correctly (the same as the backend llm), the effects of the two implementations will be the same. I think we should add some special notifications?

I see. I think we should use the Llava <image> token to avoid confusion, regardless that it is a placeholder. Should be an easy change, most of the comments I made are about that.

) llava supports multiple images, but this is only practically meaningful in version 1.5. In version 1.6, the introduction of multi-resolution processing has led to longer feature representations for each image, but the context length is only 4K, so multiple images will be truncated.

That makes sense. Given that we support multiple images, if we could document that it would be great! 😃

chenwanqq · 2024-07-04T08:55:13Z

I've fix the image tag problem and anymoe problem accordingly. May we can have more review?

EricLBuehler · 2024-07-04T09:48:49Z

Hi @chenwanqq! I made some edits for some small updates to the model support matrix and model architecture docs.

This looks good, but it crashes when I run:

cargo run --release --features cuda -- --port 1234 --isq Q4K vision-plain -m llava-hf/llava-v1.6-mistral-7b-hf -a llava_next

And

python3 examples/server/llava_next.py

With

prompt step - Model failed with error: WithBacktrace { inner: DeviceMismatchBinaryOp { lhs: Cuda { gpu_id: 0 }, rhs: Cpu, op: "conv2d" }, backtrace: Backtrace [{ fn: "candle_core::error::Error::bt" }, { fn: "candle_core::storage::Storage::same_device" }, { fn: "candle_core::storage::Storage::conv2d" }, { fn: "candle_core::conv::<impl candle_core::tensor::Tensor>::conv2d_single_group" }, { fn: "candle_core::conv::<impl candle_core::tensor::Tensor>::conv2d" }, { fn: "<candle_nn::conv::Conv2d as candle_core::Module>::forward" }, { fn: "<mistralrs_core::vision_models::clip::ClipVisionEmbeddings as candle_core::Module>::forward" }, { fn: "mistralrs_core::vision_models::clip::ClipVisionTransformer::forward_get_hidden_states" }, { fn: "mistralrs_core::vision_models::llava::llava_next::Model::encode_images" }, { fn: "<mistralrs_core::vision_models::llava::llava_next::Model as mistralrs_core::pipeline::VisionModel>::forward" }, { fn: "<mistralrs_core::pipeline::vision::VisionPipeline as mistralrs_core::pipeline::Pipeline>::forward_inputs" }, { fn: "mistralrs_core::pipeline::Pipeline::step::{{closure}}" }, { fn: "mistralrs_core::engine::Engine::run::{{closure}}" }, { fn: "tokio::runtime::runtime::Runtime::block_on" }, { fn: "std::sys_common::backtrace::__rust_begin_short_backtrace" }, { fn: "core::ops::function::FnOnce::call_once{{vtable.shim}}" }, { fn: "std::sys::pal::unix::thread::Thread::new::thread_start" }, { fn: "start_thread", file: "./nptl/pthread_create.c", line: 442 }, { fn: "__GI___clone3", file: "./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S", line: 81 }] }

It seems to come from the CLIP embeddings. Note that for ISQ models, the VBs are on the CPU and are cast using mapper.set_nm_device. Pass loading_isq as false if this is not going to be touched by ISQ, and otherwise, be sure to pass the correct parameter from the normal loading metadata.

I would reccomend that you check out the Idefics 2 model, as I do .set_device(text_model.device().clone()) several times, which will set it to the primary device. This may be what you want. Please let me know if you have any questions!

When I run without ISQ, it successfully produces the correct output though - great work here!

chenwanqq · 2024-07-05T08:00:20Z

Hi @chenwanqq! I made some edits for some small updates to the model support matrix and model architecture docs.

This looks good, but it crashes when I run:
cargo run --release --features cuda -- --port 1234 --isq Q4K vision-plain -m llava-hf/llava-v1.6-mistral-7b-hf -a llava_next
And
python3 examples/server/llava_next.py
With
prompt step - Model failed with error: WithBacktrace { inner: DeviceMismatchBinaryOp { lhs: Cuda { gpu_id: 0 }, rhs: Cpu, op: "conv2d" }, backtrace: Backtrace [{ fn: "candle_core::error::Error::bt" }, { fn: "candle_core::storage::Storage::same_device" }, { fn: "candle_core::storage::Storage::conv2d" }, { fn: "candle_core::conv::<impl candle_core::tensor::Tensor>::conv2d_single_group" }, { fn: "candle_core::conv::<impl candle_core::tensor::Tensor>::conv2d" }, { fn: "<candle_nn::conv::Conv2d as candle_core::Module>::forward" }, { fn: "<mistralrs_core::vision_models::clip::ClipVisionEmbeddings as candle_core::Module>::forward" }, { fn: "mistralrs_core::vision_models::clip::ClipVisionTransformer::forward_get_hidden_states" }, { fn: "mistralrs_core::vision_models::llava::llava_next::Model::encode_images" }, { fn: "<mistralrs_core::vision_models::llava::llava_next::Model as mistralrs_core::pipeline::VisionModel>::forward" }, { fn: "<mistralrs_core::pipeline::vision::VisionPipeline as mistralrs_core::pipeline::Pipeline>::forward_inputs" }, { fn: "mistralrs_core::pipeline::Pipeline::step::{{closure}}" }, { fn: "mistralrs_core::engine::Engine::run::{{closure}}" }, { fn: "tokio::runtime::runtime::Runtime::block_on" }, { fn: "std::sys_common::backtrace::__rust_begin_short_backtrace" }, { fn: "core::ops::function::FnOnce::call_once{{vtable.shim}}" }, { fn: "std::sys::pal::unix::thread::Thread::new::thread_start" }, { fn: "start_thread", file: "./nptl/pthread_create.c", line: 442 }, { fn: "__GI___clone3", file: "./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S", line: 81 }] }
It seems to come from the CLIP embeddings. Note that for ISQ models, the VBs are on the CPU and are cast using mapper.set_nm_device. Pass loading_isq as false if this is not going to be touched by ISQ, and otherwise, be sure to pass the correct parameter from the normal loading metadata.

I would reccomend that you check out the Idefics 2 model, as I do .set_device(text_model.device().clone()) several times, which will set it to the primary device. This may be what you want. Please let me know if you have any questions!

When I run without ISQ, it successfully produces the correct output though - great work here!

That's exactly the problem! With the new commit, I believe I fixed it (tested in llava-v1.6-vicuna-7b-hf, llava-v1.6-mistral-7b-hf, and llava-1.5-7b-hf). However, when I deal with llama-like models (such as Vicuna), there is a device mismatch problem with rmsnorm:

prompt step - Model failed with error: DeviceMismatchBinaryOp { lhs: Cuda { gpu_id: 0 }, rhs: Cpu, op: "rms-norm" }
^C

When I checked the code, I found that the Llama and Mistral code have slight differences in initialization:

mistral.rs/mistralrs-core/src/models/llama.rs

Lines 291 to 299 in c8b71e9

    
           let rms_1 = RmsNorm::new( 
        
               cfg.hidden_size, 
        
               cfg.rms_norm_eps, 
        
               mapper.set_device(layer_idx, vb.pp("input_layernorm"), loading_isq), 
        
           )?; 
        
           let rms_2 = RmsNorm::new( 
        
               cfg.hidden_size, 
        
               cfg.rms_norm_eps, 
        
               mapper.set_device(layer_idx, vb.pp("post_attention_layernorm"), loading_isq),

mistral.rs/mistralrs-core/src/models/mistral.rs

Lines 288 to 296 in c8b71e9

    
           let input_layernorm = RmsNorm::new( 
        
               cfg.hidden_size, 
        
               cfg.rms_norm_eps, 
        
               mapper.set_device(layer_idx, vb.pp("input_layernorm"), false), 
        
           )?; 
        
           let post_attention_layernorm = RmsNorm::new( 
        
               cfg.hidden_size, 
        
               cfg.rms_norm_eps, 
        
               mapper.set_device(layer_idx, vb.pp("post_attention_layernorm"), false),

I change it as Mistral way in my copied version of backend llama llm. I don't have Llama weights right now (and it's not very easy for me to download them). I think maybe you can check if the Llama text model has this issue.

EricLBuehler

Looks good! Thank you for adding this. I'll need to check the RoPE instability in a future PR but I think the current solution + LlavaModel is good enough.

chenwanqq added 19 commits June 14, 2024 16:48

first commit

301a829

continue construction

5bf944b

Merge branch 'EricLBuehler:master' into llava

cf516ec

continue working

0a4e601

Merge branch 'llava' of https://github.com/chenwanqq/mistral.rs into …

3f188a0

…llava continue working

finish models

975347e

keep working

cb7833b

keep working

18ed03f

Merge branch 'llava' into master

9572f89

keep working

81851fe

meet serious problem

ac3de15

debugging

93aadda

debugging

ba1a398

Merge branch 'EricLBuehler:master' into llava

b3b46e2

debugging

6bdf529

Merge branch 'llava' of https://github.com/chenwanqq/mistral.rs into …

ede233d

…llava

restore inrelate codes

04161ef

restore change

f7a619b

finish llama-llavanext

ac268d2

EricLBuehler added new feature New feature or request models Additions to model or architectures labels Jun 27, 2024

chenwanqq added 4 commits June 28, 2024 18:12

modify num_image_tokens and num_image_patches

eab6de2

modify code structure

c87db09

update llava1.5

da84418

add some examples and docs

376343b

chenwanqq marked this pull request as ready for review July 1, 2024 14:58

Update main.rs. remove a redundant comment

3768c5b

chenwanqq added 5 commits July 1, 2024 23:08

update some deps

86219a7

Merge branch 'llava' of https://github.com/chenwanqq/mistral.rs into …

7ffafa1

…llava

Merge branch 'master' into llava

3bf1b59

fmt

6c8126c

modify llavallm. not support anymoe yet

106b23c

chenwanqq added 4 commits July 2, 2024 10:18

Merge branch 'EricLBuehler:master' into llava

bcae4ad

add moe support

6b3fb78

Merge branch 'master' into llava

2262e95

rust fmt

7543319

EricLBuehler reviewed Jul 3, 2024

View reviewed changes

chenwanqq added 4 commits July 4, 2024 16:34

update image tag

d5bd3f8

typo

51ffe76

Merge branch 'EricLBuehler:master' into llava

03cd36d

add anymoe for llava

a2c1e72

chenwanqq and others added 3 commits July 4, 2024 16:55

delete redundant image

e5eaf85

Update docs and fromstr for llava and llavanext

cb7d4a1

Add fromstr impl

7d1ae69

EricLBuehler and others added 3 commits July 4, 2024 05:49

Fix examples

00d3618

fix device proble related to isq

e636a3c

typo(which not generated by me?)

f4f76f3

Update LLaVA documentation and chat template options

41ee9a1

EricLBuehler approved these changes Jul 5, 2024

View reviewed changes

EricLBuehler merged commit 62146d9 into EricLBuehler:master Jul 5, 2024
15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add LLaVA Support #484

Add LLaVA Support #484

chenwanqq commented Jun 27, 2024 •

edited

Loading

github-actions bot commented Jun 27, 2024 •

edited

Loading

EricLBuehler commented Jul 1, 2024

chenwanqq commented Jul 1, 2024

chenwanqq commented Jul 1, 2024 •

edited

Loading

chenwanqq commented Jul 1, 2024

EricLBuehler left a comment

chenwanqq commented Jul 3, 2024

EricLBuehler commented Jul 3, 2024

chenwanqq commented Jul 4, 2024

EricLBuehler commented Jul 4, 2024

chenwanqq commented Jul 5, 2024

EricLBuehler left a comment

Add LLaVA Support #484

Add LLaVA Support #484

Conversation

chenwanqq commented Jun 27, 2024 • edited Loading

Introduction

Some Notes

Status

github-actions bot commented Jun 27, 2024 • edited Loading

EricLBuehler commented Jul 1, 2024

chenwanqq commented Jul 1, 2024

chenwanqq commented Jul 1, 2024 • edited Loading

chenwanqq commented Jul 1, 2024

EricLBuehler left a comment

Choose a reason for hiding this comment

chenwanqq commented Jul 3, 2024

EricLBuehler commented Jul 3, 2024

chenwanqq commented Jul 4, 2024

EricLBuehler commented Jul 4, 2024

chenwanqq commented Jul 5, 2024

EricLBuehler left a comment

Choose a reason for hiding this comment

chenwanqq commented Jun 27, 2024 •

edited

Loading

github-actions bot commented Jun 27, 2024 •

edited

Loading

chenwanqq commented Jul 1, 2024 •

edited

Loading