Add VLM support #220

merveenoyan · 2025-01-16T11:44:38Z

This PR adds VLM support (closing other one for the sake of collaboration) @aymeric-roucher

This PR at the creation is probably broken (since you wanted to see it) primarily because as of now when you're writing memory you adopt chat templates like following:

messages = [
  {"role": "user", "content": "I'm doing great. How can I help you today?"},
]

whereas with VLMs we do like following so I modified a bit.

messages = [
  {"role": "user", "content": [{"type": "text", "text": "I'm doing great. How can I help you today?"},
{"type":"image"}
]

but you access content and modify it here and there so it is broken: (fixing)

Secondly I need to check if I'm adding images necessarily only once because it will break the inference if we pass one image more than once, I add it in multiple steps, so I will see.

Will open to review once I fix these.

merveenoyan · 2025-01-16T11:55:17Z

@aymeric-roucher we can keep images in action step with a separate key. normally models do not produce images, so if we put images with "images" key it will break chat template. if we keep images for the sake of keeping images, we can keep it under a different key like "observation_images" (just like how we do in the function).

merveenoyan · 2025-01-16T18:43:09Z

we need to unify the image handling for both OpenAI & transformers I think, I saw you overwrote templates which could break transformers. will handle tomorrow

merveenoyan · 2025-01-17T16:28:34Z

@aymeric-roucher can I write the tests now? will there be changes to API?

merveenoyan · 2025-01-19T13:39:16Z

src/smolagents/monitoring.py

@@ -40,7 +40,7 @@ def reset(self):
        self.total_input_token_count = 0
        self.total_output_token_count = 0

-    def update_metrics(self, step_log):
+    def update_metrics(self, step_log, agent):


@aymeric-roucher why did you add agent here, I don't see any changes where it's
used (just icymi)

@merve this is a general change in callback logic: the idea is to let callback functions access the whole agent, for instance to read token counts from agent.monitor. But not sure this is the most ergonomic solution.

Is this change in the callback logic necessary for the VLM support? Or could we address it in a separate PR?

@albertvillanova in cases like taking screenshots callbacks are necessary across every step, I think other than that, no

image handling is a bit like that if image is required to be kept at every step or you require to dynamically retrieve images (e.g. from a knowledge base)

aymeric-roucher · 2025-01-20T16:29:58Z

@merveenoyan do TransformersModel VLMs work in the current state? Also one image-related test is failing.

aymeric-roucher · 2025-01-20T16:50:52Z

docs/source/en/conceptual_guides/react.md

-> [!TIP]
-> Read [Open-source LLMs as LangChain Agents](https://huggingface.co/blog/open-source-llms-as-agents) blog post to learn more about multi-step agents.
+1. **Thought:** This is the first step initializing the system, prompting it on how it should behave (`SystemPromptStep`), the facts about the task at hand (`PlanningStep`) and providing the task at hand (`TaskStep`).  System prompt, facts and task prompt are appended to the memory. Facts are updated at each step until the agent receives the final response. If there's any images in the prompt, they are fed to `TaskStep`.
+2. **Action:** This is where all the action is taken, including LLM inference and callback function execution. After the inference takes place, the output of LLM/VLM (called "observations") is fed to `ActionStep`. Callbacks are functions executed at the end of every step. A good callback example is taking screenshots and add it to agent's state in an agentic web browser.


@merveenoyan in React framework, both Thought and Action are in the while loop.
So I'd really make a distinction here between 1. Initialization and 2 While loop with 2.1 Thought (basically the LLM generation + parsing) and 2.2 Action (execution of the action)

i've reworded the blog post as well to convey this.

sounds good!

merveenoyan · 2025-01-22T13:21:38Z

taking a look now

…nto add-vlm-support

HuggingFaceDocBuilderDev · 2025-01-24T14:04:15Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

…nto add-vlm-support

aymeric-roucher · 2025-01-24T14:14:27Z

src/smolagents/agents.py

@@ -587,7 +588,11 @@ def _run(self, task: str, images: List[str] | None = None) -> Generator[str, Non
                step_log.duration = step_log.end_time - step_start_time
                self.logs.append(step_log)
                for callback in self.step_callbacks:
-                    callback(step_log, self)
+                    # For compatibility with old callbacks that don't take the agent as an argument
+                    if len(inspect.signature(callback).parameters) == 1:


@albertvillanova open question: wondering if it's good to support two options for passing args to a function?

As commented above, it is to avoid breaking the code that use old actions with a single argument.

Add VLM support #220 (comment)

albertvillanova · 2025-01-24T14:14:42Z

src/smolagents/models.py

@@ -220,15 +220,15 @@ def get_clean_message_list(
                        }
                    else:
                        message["content"][i]["image"] = encode_image_base64(element["image"])
-
+        breakpoint()


For debugging?

Should be removed I think!

I removed it few commits ago, sorry!

albertvillanova

Thanks a lot! This contribution is AWESOME!!! 🤗

vlm initial commit

5a4d736

transformers integration for vlms

aef7a51

merveenoyan mentioned this pull request Jan 16, 2025

Add VLM support #177

Closed

Add webbrowser example and make it work 🥳🥳

f321d26

aymeric-roucher added 3 commits January 17, 2025 16:06

Refactor image support

79c20e6

Allow modifying agent attributes in callback

64d4ff1

Improve vlm browser example

68f0742

time.sleep(0.5) before screenshot to let js animations happen

99f9ebe

merveenoyan commented Jan 19, 2025

View reviewed changes

merveenoyan and others added 5 commits January 19, 2025 16:43

test to validate internal workflow for passing images

48a1b01

Update test_agents.py

71a1ca5

Improve error logging

5961dce

Merge branch 'main' into add-vlm-support

20c6286

Switch to OpenAIServerModel

bdd8847

aymeric-roucher force-pushed the add-vlm-support branch from de7e376 to bdd8847 Compare January 20, 2025 11:14

Improve the example

2b0e900

aymeric-roucher mentioned this pull request Jan 20, 2025

Improve python executor's error logging #275

Merged

aymeric-roucher added 2 commits January 20, 2025 15:59

Merge branch 'main' into add-vlm-support

f50bb45

Format

3c75298

add docs about steps, callbacks & co

a9cfd43

aymeric-roucher reviewed Jan 20, 2025

View reviewed changes

aymeric-roucher added 3 commits January 21, 2025 14:22

Add precisions in doc

3da8cdd

Improve browser

f693e49

Tiny prompting update

84c6e73

Merge branch 'main' into add-vlm-support

f93bc29

aymeric-roucher and others added 6 commits January 24, 2025 14:11

Update webbrowser

13fa5c2

Merge branch 'add-vlm-support' of github.com:huggingface/smolagents i…

ec86a8e

…nto add-vlm-support

Merge branch 'main' into add-vlm-support

bd72270

fix for single message case where final message list is empty

fa2f2f2

forgot debugger lol

c85264c

accommodate VLM-like chat template and fix tests

fc5c2e5

aymeric-roucher added 3 commits January 24, 2025 15:10

Improve example wording

92ad324

Merge branch 'add-vlm-support' of github.com:huggingface/smolagents i…

89ef4f0

…nto add-vlm-support

Style fixes

7500679

aymeric-roucher reviewed Jan 24, 2025

View reviewed changes

albertvillanova reviewed Jan 24, 2025

View reviewed changes

merveenoyan and others added 7 commits January 24, 2025 15:25

clarify naming and fix tests

218a34a

clarify naming and fix tests

7e92a02

test fix

6b67e57

Fix style

ad20498

Add bm25 to fix one of the doc tests

031d546

fix mocking in VL test

8ba63bd

fix mocking in VL test

0e359f9

merveenoyan marked this pull request as ready for review January 24, 2025 14:45

merveenoyan and others added 7 commits January 24, 2025 16:01

fix bug in fallback

0c85733

add transformers model

a0f4059

remove chrome dir from helium

951e611

Update Transformers example with flatten_messages_as_text

5bbd5a5

Add doc for flatten_messages_as_text

a475ed6

Merge branch 'main' into add-vlm-support

9c39a76

Fix merge error

3b343b8

albertvillanova approved these changes Jan 24, 2025

View reviewed changes

albertvillanova merged commit 408b52a into main Jan 24, 2025
5 checks passed

andstor mentioned this pull request Jan 29, 2025

[smolagents] llm input messages are empty Arize-ai/openinference#1237

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add VLM support #220

Add VLM support #220

merveenoyan commented Jan 16, 2025 •

edited

Loading

merveenoyan commented Jan 16, 2025

merveenoyan commented Jan 16, 2025

merveenoyan commented Jan 17, 2025

merveenoyan Jan 19, 2025 •

edited

Loading

aymeric-roucher Jan 19, 2025

albertvillanova Jan 20, 2025

merveenoyan Jan 20, 2025

aymeric-roucher commented Jan 20, 2025

aymeric-roucher Jan 20, 2025

merveenoyan Jan 22, 2025

merveenoyan commented Jan 22, 2025

HuggingFaceDocBuilderDev commented Jan 24, 2025

aymeric-roucher Jan 24, 2025

albertvillanova Jan 24, 2025 •

edited

Loading

albertvillanova Jan 24, 2025

aymeric-roucher Jan 24, 2025

merveenoyan Jan 24, 2025

albertvillanova left a comment

Add VLM support #220

Add VLM support #220

Conversation

merveenoyan commented Jan 16, 2025 • edited Loading

merveenoyan commented Jan 16, 2025

merveenoyan commented Jan 16, 2025

merveenoyan commented Jan 17, 2025

merveenoyan Jan 19, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aymeric-roucher commented Jan 20, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

merveenoyan commented Jan 22, 2025

HuggingFaceDocBuilderDev commented Jan 24, 2025

Choose a reason for hiding this comment

albertvillanova Jan 24, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

albertvillanova left a comment

Choose a reason for hiding this comment

merveenoyan commented Jan 16, 2025 •

edited

Loading

merveenoyan Jan 19, 2025 •

edited

Loading

albertvillanova Jan 24, 2025 •

edited

Loading