Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add VLM support #220

Merged
merged 73 commits into from
Jan 24, 2025
Merged

Add VLM support #220

merged 73 commits into from
Jan 24, 2025

Conversation

merveenoyan
Copy link
Contributor

@merveenoyan merveenoyan commented Jan 16, 2025

This PR adds VLM support (closing other one for the sake of collaboration) @aymeric-roucher

  • This PR at the creation is probably broken (since you wanted to see it) primarily because as of now when you're writing memory you adopt chat templates like following:
messages = [
  {"role": "user", "content": "I'm doing great. How can I help you today?"},
]

whereas with VLMs we do like following so I modified a bit.

messages = [
  {"role": "user", "content": [{"type": "text", "text": "I'm doing great. How can I help you today?"},
{"type":"image"}
]

but you access content and modify it here and there so it is broken: (fixing)

  • Secondly I need to check if I'm adding images necessarily only once because it will break the inference if we pass one image more than once, I add it in multiple steps, so I will see.

Will open to review once I fix these.

@merveenoyan
Copy link
Contributor Author

@aymeric-roucher we can keep images in action step with a separate key. normally models do not produce images, so if we put images with "images" key it will break chat template. if we keep images for the sake of keeping images, we can keep it under a different key like "observation_images" (just like how we do in the function).

@merveenoyan merveenoyan mentioned this pull request Jan 16, 2025
@merveenoyan
Copy link
Contributor Author

we need to unify the image handling for both OpenAI & transformers I think, I saw you overwrote templates which could break transformers. will handle tomorrow

@merveenoyan
Copy link
Contributor Author

@aymeric-roucher can I write the tests now? will there be changes to API?

@@ -40,7 +40,7 @@ def reset(self):
self.total_input_token_count = 0
self.total_output_token_count = 0

def update_metrics(self, step_log):
def update_metrics(self, step_log, agent):
Copy link
Contributor Author

@merveenoyan merveenoyan Jan 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aymeric-roucher why did you add agent here, I don't see any changes where it's
used (just icymi)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@merve this is a general change in callback logic: the idea is to let callback functions access the whole agent, for instance to read token counts from agent.monitor. But not sure this is the most ergonomic solution.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change in the callback logic necessary for the VLM support? Or could we address it in a separate PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@albertvillanova in cases like taking screenshots callbacks are necessary across every step, I think other than that, no

image handling is a bit like that if image is required to be kept at every step or you require to dynamically retrieve images (e.g. from a knowledge base)

@aymeric-roucher
Copy link
Collaborator

@merveenoyan do TransformersModel VLMs work in the current state? Also one image-related test is failing.

> [!TIP]
> Read [Open-source LLMs as LangChain Agents](https://huggingface.co/blog/open-source-llms-as-agents) blog post to learn more about multi-step agents.
1. **Thought:** This is the first step initializing the system, prompting it on how it should behave (`SystemPromptStep`), the facts about the task at hand (`PlanningStep`) and providing the task at hand (`TaskStep`). System prompt, facts and task prompt are appended to the memory. Facts are updated at each step until the agent receives the final response. If there's any images in the prompt, they are fed to `TaskStep`.
2. **Action:** This is where all the action is taken, including LLM inference and callback function execution. After the inference takes place, the output of LLM/VLM (called "observations") is fed to `ActionStep`. Callbacks are functions executed at the end of every step. A good callback example is taking screenshots and add it to agent's state in an agentic web browser.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@merveenoyan in React framework, both Thought and Action are in the while loop.
So I'd really make a distinction here between 1. Initialization and 2 While loop with 2.1 Thought (basically the LLM generation + parsing) and 2.2 Action (execution of the action)

i've reworded the blog post as well to convey this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good!

@merveenoyan
Copy link
Contributor Author

taking a look now

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@@ -587,7 +588,11 @@ def _run(self, task: str, images: List[str] | None = None) -> Generator[str, Non
step_log.duration = step_log.end_time - step_start_time
self.logs.append(step_log)
for callback in self.step_callbacks:
callback(step_log, self)
# For compatibility with old callbacks that don't take the agent as an argument
if len(inspect.signature(callback).parameters) == 1:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@albertvillanova open question: wondering if it's good to support two options for passing args to a function?

Copy link
Member

@albertvillanova albertvillanova Jan 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As commented above, it is to avoid breaking the code that use old actions with a single argument.

@@ -220,15 +220,15 @@ def get_clean_message_list(
}
else:
message["content"][i]["image"] = encode_image_base64(element["image"])

breakpoint()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For debugging?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be removed I think!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed it few commits ago, sorry!

@merveenoyan merveenoyan marked this pull request as ready for review January 24, 2025 14:45
Copy link
Member

@albertvillanova albertvillanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot! This contribution is AWESOME!!! 🤗

@albertvillanova albertvillanova merged commit 408b52a into main Jan 24, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants