Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix tests/test_examples_run #436

Open
vazirim opened this issue Feb 17, 2025 · 2 comments
Open

Fix tests/test_examples_run #436

vazirim opened this issue Feb 17, 2025 · 2 comments
Assignees
Labels
bug Something isn't working

Comments

@vazirim
Copy link
Member

vazirim commented Feb 17, 2025

Describe the bug
The test test_examples_run is currently failing due to non-determinism (litellm is not passing temperature:0 to replicate).

We need to figure out if moving away from Replicate can help (e.g. ollama), and how to run the nightly test as a github action (perhaps with watsonx instead).

To Reproduce
Run that test.

Expected behavior

Screenshots

Desktop (please complete the following information):

  • OS: [e.g. iOS]
  • Browser [e.g. Chrome, Safari]
  • Version [e.g. 22]

Additional context
Add any other context about the problem here.

@vazirim vazirim added the bug Something isn't working label Feb 17, 2025
@jgchn
Copy link
Collaborator

jgchn commented Feb 18, 2025

I ran most of the valid examples in test_examples_run and compared the results between Replicate (granite-3.1-8b-instruct) and Ollama (granite-code:8b). I noticed that Ollama sometimes returns shorter answers than those of Replicate but it seems mostly correct.

For example:

examples/fibonacci/fib.pdl

Replicate

(skipping over some details because the output is very long)

Now computing fibonacci(17)

def fibonacci(n: int) -> int:
    if n <= 0:
        return "Input should be a positive integer."
    elif n == 1:
        return 0
    elif n == 2:
        return 1
    else:
        return fibonacci(n - 1) + fibonacci(n - 2)
The result is: 987
Ollama

(skipping over some details because the output is very long)

Find a random number between 1 and 20
11
Now computing fibonacci(11)

def fibonacci(n):
if n <= 1:
return n
else:
return fibonacci(n-1) + fibonacci(n-2)
The result is: 89

Note: the granite model used on Replicate actually produced incorrect code (fib(17)=1597, not 987). The model on ollama produced correct results. I wonder if the difference is because granite-code optimized for coding tasks.

examples/hello/hello-model-chaining.pdl

  • Replicate:

    Hello
    Hello
    Did you say Hello?
    Yes, I did
    
  • Ollama:

    Hello
    Hello! How can I help you today?
    
    Did you say Hello! How can I help you today?
    ?
    

Note: Pretty similar in terms of output.

examples/talk/4-function.pdl

  • Replicate:

    The sentence 'I love Paris!' translates to 'Je t'aime Paris!' in French.
    The sentence 'I love Madrid!' translates to 'Amo Madrid!' in Spanish.
    
  • Ollama:

    Je suis adepte de Paris!'
    
    'Te quiero mucho Madrid!'
    

Note: Again, the outputs are similar.


Next, I looked at some of the files that were labeled as Non-Deterministic.

examples/react/demo.pdl:

  • Replicate: the output is long so I will not paste here, but the model correctly uses the ReAct pattern and arrived at the right answer.
  • Ollama: when setting temperature: 0 in the program, this is what we get. The result is incorrect and the model does not make use of the tools provided.
How many years ago was the discoverer of the Hudson River born? Keep in mind we are in 2025.
The discoverer of the Hudson River was not born in 2025. The Hudson River was discovered in 1697 by a group of Dutch settlers led by John Smith. The river is located in present-day New York City and has been an important transportation route for centuries.
Action:
<tool_call>[{"name": "Finish", "arguments": {"topic": "1697"}}]

examples/code/code.pdl

  • Replicate
    This Java method, `deserializeOffsetMap`, is part of the `OffsetUtil` class in the `streamsets/datacollector` repository. It's designed to convert a JSON string, `lastSourceOffset`, into a `Map<String, String>`.
    
    Here's a breakdown of the code:
    
    1. The method is annotated with `@SuppressWarnings("unchecked")`, which suppresses a compiler warning about unchecked or unsafe operations. This is because the `readValue` method of `JSON_MAPPER` can return a `Map` of any type, but we're casting it to `Map<String, String>`.
    
    2. The method takes a single parameter, `lastSourceOffset`, which is expected to be a JSON string representing a map.
    
    3. It initializes a `Map<String, String>` variable named `offsetMap`.
    
    4. If `lastSourceOffset` is `null` or empty, it creates a new `HashMap` and assigns it to `offsetMap`.
    
    5. If `lastSourceOffset` is not `null` or empty, it uses `JSON_MAPPER.readValue(lastSourceOffset, Map.class)` to deserialize the JSON string into a `Map`. The `Map.class` argument tells the `readValue` method to deserialize the JSON into a `Map` of any key and value types.
    
    6. Finally, it returns the `offsetMap`.
    
    In summary, this method converts a JSON string representing a map into a `Map<String, String>`. If the input string is `null` or empty, it returns an empty map.
    
  • Ollama
    The code is a Java method that takes a string `lastSourceOffset` as input and returns a `Map<String, String>`. The method uses the Jackson library to deserialize the JSON-formatted string into a map. If the input string is empty or null, an empty HashMap is returned. Otherwise, the string is deserialized into a Map using the `JSON_MAPPER.readValue()` method.
    

Note: Replicate produced an elaborate explanation while Ollama was very concise.

examples/code/code-eval.pdl / examples/tutorial/data_block.pdl

Note: Replicate and Ollama yielded different similarity scores/metrics.

I was able to test this far before reaching the Replicate API limit. Do we care about correctness for test_examples_run or are we only concerned with ensuring that valid programs can be run? @vazirim @mandel

@jgchn
Copy link
Collaborator

jgchn commented Feb 18, 2025

Also linking this blog post that @starpit shared for integrating ollama with Github Action.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants