Implement the Llama 3.2 vision models #796

EricLBuehler · 2024-09-26T00:40:40Z

🚨🚨🚨Model is working and ready for imminent release!🚨🚨🚨

Last few steps:

Forward pass runs
Correct values confirmed from inputs processor
Correct values confirmed from vision model
Correct values confirmed from text model

Implementation status:

github-actions · 2024-09-26T00:41:50Z

Code Metrics Report

  ===============================================================================
 Language            Files        Lines         Code     Comments       Blanks
===============================================================================
 C Header                2           35           28            0            7
 Dockerfile              1           34           25            0            9
 Happy                   1          442          369            0           73
 JSON                   12          105          104            0            1
 Python                 50         2165         1841           64          260
 TOML                   20          621          556            2           63
 YAML                    2           21           19            2            0
-------------------------------------------------------------------------------
 Jupyter Notebooks       4            0            0            0            0
 |- Markdown             2           77           32           31           14
 |- Python               2          196          169            1           26
 (Total)                            273          201           32           40
-------------------------------------------------------------------------------
 Markdown               34         2425            0         1850          575
 |- BASH                 5          101           98            0            3
 |- JSON                 1           12           12            0            0
 |- Python               5           92           82            0           10
 |- Rust                 8          478          425           22           31
 |- TOML                 2           75           63            0           12
 (Total)                           3183          680         1872          631
-------------------------------------------------------------------------------
 Rust                  253        71654        64733         1353         5568
 |- Markdown           121         1174           25         1081           68
 (Total)                          72828        64758         2434         5636
===============================================================================
 Total                 380        77502        67675         3271         6556
===============================================================================

EricLBuehler · 2024-09-27T18:16:02Z

RUST_BACKTRACE=1 cargo run --features cuda -- --port 1234 vision-plain -m meta-llama/Llama-3.2-11B-Vision-Instruct -a mllama

EricLBuehler · 2024-09-29T02:29:04Z

Run:

cargo run --features cuda --release -- -i vision-plain -m meta-llama/Llama-3.2-11B-Vision-Instruct -a mllama

And then...

> \image https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg <|image|>Where was this photo most likely taken?
The photo appears to be taken in the White Mountains of New Hampshire, USA. The mountain in the background is likely Mount Washington, which is the highest peak in the Northeastern United States and is known for its iconic summit and challenging weather conditions.
>

EricLBuehler · 2024-09-29T12:14:41Z

> Hello!
How can I assist you today?
> What is the date?
The current date is September 29, 2024.
> \image https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg <|image|>Where was this photo most likely taken?
The photo appears to be of a mountainous landscape with a snow-covered peak in the background. Based on the scenery, it's likely that the photo was taken in a region with mountains, possibly in North America or Europe. However, without more information or context, it's difficult to pinpoint the exact location.

That being said, the mountainous landscape and snow-covered peak remind me of the Northeastern United States, particularly the Appalachian Mountains or the White Mountains in New Hampshire. The White Mountains are home to Mount Washington, the highest peak in the Northeast, which is known for its rugged terrain and snowy winters.

If I had to take a guess, I would say that the photo was likely taken in New Hampshire or another mountainous region in the Northeastern United States.
> \image https://upload.wikimedia.org/wikipedia/commons/f/fd/Pink_flower.jpg <|image|>What type of flower is this?
The flower in the photo appears to be a type of wildflower or a daisy-like flower. However, based on the shape and color of the petals, it's difficult to determine the exact type of flower without more information or a closer look.

That being said, the flower's white petals and yellow center remind me of a daisy or a sunflower. However, the petals seem to be slightly more delicate and have a more rounded shape than a typical daisy or sunflower.

If I had to take a guess, I would say that the flower is likely a type of wildflower, such as a buttercup or a dandelion. However, without more information or a closer look, it's difficult to determine the exact type of flower.
>

Add the MLlama vision bits

0fa6564

EricLBuehler added new feature New feature or request models Additions to model or architectures labels Sep 26, 2024

EricLBuehler added 16 commits September 25, 2024 20:49

Restructure

c3132ca

Typos

bbb683e

Add skeleton for text model, add text mlp

b2dfb5f

Add the self and cross attn text model parts

6c754bd

Add mllama model

edafa94

Add most of the preprocessor

2b237e4

Add the rest of the processor and wire things up

e7cda94

Clean up a bit

4a114e8

Add an example

83dbf93

Rename

cc4f86d

Loads now

4c5af8e

Another batch of fixes

5982b03

Vision model forward runs

cc8bbda

Add back in the cache for cross attn

0dd5f26

Inputs processor gives correct values

c2f29e3

Fix the nans

f7e94be

EricLBuehler added 9 commits September 28, 2024 03:57

Problem seems to be in vision encoder

bb10fea

Upcasting seems to do something

f512388

Problems confirmed to ONLY be in text model

941d8ef

Maybe remove some nans

3a5b190

Seems to work now!!

f4193e7

Confirmed working, remove the debuggers

3c38532

Preapply the tanh

4c158fd

Rework the interactive mode

61d5774

A bugfix

37839c4

Another bugfix!

f64dbf0

EricLBuehler added 5 commits September 29, 2024 05:33

Add device mapping support

21f282c

Add ISQ support for mllama

eef7cbb

Add ISQ support

473892a

Add support for no images and multi images

910b6c3

Fix dim

df4b20d

EricLBuehler and others added 4 commits September 29, 2024 08:28

Fix slice assign dim

fc53ef9

Add examples and docs

0d65def

Add a demo video

1357a5c

Update VLLAMA.md

f00bd10

EricLBuehler merged commit f33ac29 into master Sep 29, 2024
12 checks passed

EricLBuehler deleted the mllama branch September 29, 2024 15:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement the Llama 3.2 vision models #796

Implement the Llama 3.2 vision models #796

EricLBuehler commented Sep 26, 2024 •

edited

Loading

github-actions bot commented Sep 26, 2024 •

edited

Loading

EricLBuehler commented Sep 27, 2024 •

edited

Loading

EricLBuehler commented Sep 29, 2024

EricLBuehler commented Sep 29, 2024

Implement the Llama 3.2 vision models #796

Implement the Llama 3.2 vision models #796

Conversation

EricLBuehler commented Sep 26, 2024 • edited Loading

github-actions bot commented Sep 26, 2024 • edited Loading

EricLBuehler commented Sep 27, 2024 • edited Loading

EricLBuehler commented Sep 29, 2024

EricLBuehler commented Sep 29, 2024

EricLBuehler commented Sep 26, 2024 •

edited

Loading

github-actions bot commented Sep 26, 2024 •

edited

Loading

EricLBuehler commented Sep 27, 2024 •

edited

Loading