-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add VLM support #220
Add VLM support #220
Conversation
@aymeric-roucher we can keep images in action step with a separate key. normally models do not produce images, so if we put images with "images" key it will break chat template. if we keep images for the sake of keeping images, we can keep it under a different key like "observation_images" (just like how we do in the function). |
we need to unify the image handling for both OpenAI & transformers I think, I saw you overwrote templates which could break transformers. will handle tomorrow |
@aymeric-roucher can I write the tests now? will there be changes to API? |
src/smolagents/monitoring.py
Outdated
@@ -40,7 +40,7 @@ def reset(self): | |||
self.total_input_token_count = 0 | |||
self.total_output_token_count = 0 | |||
|
|||
def update_metrics(self, step_log): | |||
def update_metrics(self, step_log, agent): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aymeric-roucher why did you add agent here, I don't see any changes where it's
used (just icymi)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@merve this is a general change in callback logic: the idea is to let callback functions access the whole agent, for instance to read token counts from agent.monitor
. But not sure this is the most ergonomic solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this change in the callback logic necessary for the VLM support? Or could we address it in a separate PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@albertvillanova in cases like taking screenshots callbacks are necessary across every step, I think other than that, no
image handling is a bit like that if image is required to be kept at every step or you require to dynamically retrieve images (e.g. from a knowledge base)
de7e376
to
bdd8847
Compare
@merveenoyan do |
> [!TIP] | ||
> Read [Open-source LLMs as LangChain Agents](https://huggingface.co/blog/open-source-llms-as-agents) blog post to learn more about multi-step agents. | ||
1. **Thought:** This is the first step initializing the system, prompting it on how it should behave (`SystemPromptStep`), the facts about the task at hand (`PlanningStep`) and providing the task at hand (`TaskStep`). System prompt, facts and task prompt are appended to the memory. Facts are updated at each step until the agent receives the final response. If there's any images in the prompt, they are fed to `TaskStep`. | ||
2. **Action:** This is where all the action is taken, including LLM inference and callback function execution. After the inference takes place, the output of LLM/VLM (called "observations") is fed to `ActionStep`. Callbacks are functions executed at the end of every step. A good callback example is taking screenshots and add it to agent's state in an agentic web browser. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@merveenoyan in React framework, both Thought and Action are in the while loop.
So I'd really make a distinction here between 1. Initialization and 2 While loop with 2.1 Thought (basically the LLM generation + parsing) and 2.2 Action (execution of the action)
i've reworded the blog post as well to convey this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good!
taking a look now |
…nto add-vlm-support
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
@@ -587,7 +588,11 @@ def _run(self, task: str, images: List[str] | None = None) -> Generator[str, Non | |||
step_log.duration = step_log.end_time - step_start_time | |||
self.logs.append(step_log) | |||
for callback in self.step_callbacks: | |||
callback(step_log, self) | |||
# For compatibility with old callbacks that don't take the agent as an argument | |||
if len(inspect.signature(callback).parameters) == 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@albertvillanova open question: wondering if it's good to support two options for passing args to a function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As commented above, it is to avoid breaking the code that use old actions with a single argument.
src/smolagents/models.py
Outdated
@@ -220,15 +220,15 @@ def get_clean_message_list( | |||
} | |||
else: | |||
message["content"][i]["image"] = encode_image_base64(element["image"]) | |||
|
|||
breakpoint() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For debugging?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be removed I think!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed it few commits ago, sorry!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot! This contribution is AWESOME!!! 🤗
This PR adds VLM support (closing other one for the sake of collaboration) @aymeric-roucher
whereas with VLMs we do like following so I modified a bit.
but you access
content
and modify it here and there so it is broken: (fixing)Will open to review once I fix these.