Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weird tool call arguments, resulting in UnexpectedModelBehaviour / validation error #81

Open
intellectronica opened this issue Nov 21, 2024 · 11 comments
Assignees

Comments

@intellectronica
Copy link

See https://github.com/intellectronica/pydantic-ai-experiments/blob/main/scratch.ipynb

Traceback (most recent call last):
  File "/usr/local/python/3.12.1/lib/python3.12/site-packages/pydantic_ai/_result.py", line 189, in validate
    result = self.type_adapter.validate_json(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/python/3.12.1/lib/python3.12/site-packages/pydantic/type_adapter.py", line 425, in validate_json
    return self.validator.validate_json(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
pydantic_core._pydantic_core.ValidationError: 2 validation errors for Question
reflection
  Field required [type=missing, input_value={'_': {'reflection': "The...n': 'Is it an animal?'}}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.10/v/missing
question
  Field required [type=missing, input_value={'_': {'reflection': "The...n': 'Is it an animal?'}}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.10/v/missing

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/python/3.12.1/lib/python3.12/site-packages/pydantic_ai/agent.py", line 654, in _handle_model_response
    result_data = result_tool.validate(call)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/python/3.12.1/lib/python3.12/site-packages/pydantic_ai/_result.py", line 203, in validate
    raise ToolRetryError(m) from e
pydantic_ai._result.ToolRetryError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/python/3.12.1/lib/python3.12/site-packages/pydantic_ai/agent.py", line 181, in run
    either = await self._handle_model_response(model_response, deps)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/python/3.12.1/lib/python3.12/site-packages/pydantic_ai/agent.py", line 657, in _handle_model_response
    self._incr_result_retry()
  File "/usr/local/python/3.12.1/lib/python3.12/site-packages/pydantic_ai/agent.py", line 751, in _incr_result_retry
    raise exceptions.UnexpectedModelBehavior(
pydantic_ai.exceptions.UnexpectedModelBehavior: Exceeded maximum retries (1) for result validation

See the prefixed "_" :? It's not there on earlier calls. Possibly a hallucination.

I think this can be avoided with strict mode. Would be great to have it as an option for OpenAI calls.

@intellectronica
Copy link
Author

@samuelcolvin ^^^^^

@samuelcolvin
Copy link
Member

weird, not sure what's going on, I ran your code, and it worked first time.

I ran it directly as a script, and it worked fine:

from enum import Enum
from textwrap import dedent
from typing import List

from pydantic import BaseModel, Field
from pydantic_ai import Agent, CallContext


class Question(BaseModel):
    reflection: str = Field(..., description='Considering the questions and answers so far, what are things we can ask next?')
    question: str = Field(..., description='The question to ask the other player')


asking_agent = Agent('openai:gpt-4o', result_type=Question)


@asking_agent.system_prompt
async def asking_agent_system_prompt(ctx: CallContext[List]) -> str:
    turns = ctx.deps
    prompt = dedent(f"""
        You are playing a game of 20 questions.
        You are trying to guess the object the other player is thinking of.
        In each turn, you can ask a yes or no question.
        The other player will answer with "yes", "no".
    """).strip()
    if len(turns) > 0:
        prompt += f"\nHere are the questions you have asked so far and the answers you have received:\n"
        prompt += '\n'.join([' * ' + turn for turn in turns])
    return prompt


class Answer(str, Enum):
    YES = 'yes'
    NO = 'no'
    YOU_WIN = 'you win'


class AnswerResponse(BaseModel):
    reflection: str = Field(..., description=(
        'Considering the question, what is the answer? '
        'Is it "yes" or "no"? Or did they guess the '
        'object and the answer is "you win"?'))
    answer: Answer = Field(..., description='The answer to the question - "yes", "no", or "you win"')


ansering_agent = Agent('openai:gpt-4o', result_type=AnswerResponse)


@ansering_agent.system_prompt
async def answering_agent_system_prompt(ctx: CallContext[str]) -> str:
    prompt = dedent(f"""
        You are playing a game of 20 questions.
        The other player is trying to guess the object you are thinking of.
        The object you are thinking of is: {ctx.deps}.
        Answer with "yes" or "no", or "you win" if the other player has guessed the object.
    """).strip()
    return prompt


def twenty_questions(mytery_object):
    turns = []
    while True:
        question = asking_agent.run_sync('Ask the next question', deps=turns).data.question
        answer = ansering_agent.run_sync(question, deps=mytery_object).data.answer.value
        if answer == Answer.YOU_WIN:
            print('You Win!')
            break
        elif len(turns) >= 20:
            print('You Lose!')
            break
        else:
            turns.append(f'{question} - {answer}')
            print(f'{len(turns)}. QUESTION: {question}\nANSWER: {answer}\n')

twenty_questions('a cat')

output:

1. QUESTION: Is it something commonly found indoors?
ANSWER: yes

2. QUESTION: Does it use electricity?
ANSWER: no

3. QUESTION: Is it used for storage?
ANSWER: no

4. QUESTION: Is it used for entertainment purposes?
ANSWER: no

5. QUESTION: Is it used for cleaning?
ANSWER: no

6. QUESTION: Is it a piece of furniture?
ANSWER: no

7. QUESTION: Is it used for writing or drawing?
ANSWER: no

8. QUESTION: Is it used for personal grooming or hygiene?
ANSWER: no

9. QUESTION: Is it used in the kitchen?
ANSWER: no

10. QUESTION: Is it related to health or safety?
ANSWER: no

11. QUESTION: Is it used for decoration?
ANSWER: no

12. QUESTION: Is it used for organizing?
ANSWER: no

13. QUESTION: Is it used for communication?
ANSWER: no

14. QUESTION: Is it used for comfort or relaxation?
ANSWER: yes

15. QUESTION: Is it something you can wear indoors?
ANSWER: no

16. QUESTION: Is it something you can sit or lie on?
ANSWER: no

17. QUESTION: Is it something you can hold or carry? 
ANSWER: yes

18. QUESTION: Is it a textile item like a pillow or a blanket?
ANSWER: no

19. QUESTION: Is it used to provide warmth?
ANSWER: no

20. QUESTION: Is it something you use to hold or support things?
ANSWER: no

You Lose!

@intellectronica
Copy link
Author

intellectronica commented Nov 21, 2024 via email

@samuelcolvin
Copy link
Member

Thanks, yup I'll look into it.

@sydney-runkle sydney-runkle added the bug Something isn't working label Dec 5, 2024
@samuelcolvin samuelcolvin removed the bug Something isn't working label Dec 18, 2024
@samuelcolvin
Copy link
Member

@sydney-runkle we should add strict to model settings.

@sydney-runkle
Copy link
Contributor

@samuelcolvin, strict in what sense? Like for pydantic validation?

@intellectronica
Copy link
Author

intellectronica commented Dec 20, 2024 via email

@intellectronica
Copy link
Author

Here's how I make it work, with a subclass of OpenAIModel and redefinition of the static method that prepares the function calls: https://gist.github.com/intellectronica/9b190aca94bf4372c4b08e8b016922ec

Not sure how this could be passed via model_settings, since they're not available at this point.

@dmontagu
Copy link
Contributor

dmontagu commented Jan 2, 2025

We could change the relevant method from being a staticmethod to being an instance method which has access to the settings. Probably makes sense to do that either way for the sake of overriding.

That said, I'll note that strict mode has some unfortunate limitations — at the very least, it seems to be incompatible with types that have additionalProperties in the JSON schema, such as dict[str, int], even when it isn't at the top level.

Because of that, I don't think we can use strict mode by default, though I do think it should be possible to plumb it through as an opt-in parameter if you don't need/want to use JSON schemas with additionalProperties.

Would it work for you to turn it on always (via model configuration)? If not, it may be possible to make this work on a per-tool basis, though in that case there would be some awkward API decisions to be made. (I.e., do we add a strict option to Tool, despite not necessarily being a "standard" thing across models?)

(I'll also note that we explicitly confirmed that you can get reasonable responses from OpenAI making use of JSON schemas with additionalProperties, at least for appropriate prompts, despite getting an error message about it when you turn strict mode on.)

@intellectronica
Copy link
Author

  • Agree that strict shouldn't be the default.
  • I think it's worth having it available as a setting for Agent, Tool, and Run, though, for models that support it. In my experience it can provide meaningful improvements in reliability, resulting in code that is focused on the task, rather than on guardrails, checks, and retries. In my own work and when advising other developers I tend to start without strict but introduce it (including making the necessary changes to the structure) if generating the desired structure proves unreliable. It is also useful in very small projects where you're not going to invest in comprehensive evals and just want to get a result you can rely on.
  • Given that it's quite easy to get it to work with just a little bit of custom code I don't think having it as a built-in setting is high priority.

@dmontagu
Copy link
Contributor

dmontagu commented Jan 3, 2025

I think it's a reasonable idea to add strict as a field on ModelSettings (which will work for Agent and Agent.run) and as a keyword argument to Agent.tool.

Note that we'll probably need to add some way to do error handling if we get an error response from the model due to it failing (server-side) to generate compatible JSON. I'm not sure what such errors will look like in general, and obviously it's likely to be somewhat model-specific. (Potentially closely related, we should add better error handling of "refusal" responses.) I don't think that necessarily needs to be a requirement in an initial implementation of this functionality though.

PR welcome, or we'll get to it eventually.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants