Skip to content

raccoonaihq/webvoyager-eval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation



WebVoyager Eval for Raccoon AI

This repo is a fork of WebVoyager, tweaked to evaluate RaccoonAI on the WebVoyager dataset, check out their repo and paper.

Few Things

  • The WebVoyager dataset had quite a few time-sensitive tasks (like booking flights or finding hotels). We updated these with current dates and references. You can see exactly what changed in the commit.

  • The original WebVoyager evaluation is basic and binary, just success or failure. It doesn’t reward agents for clever navigation or penalize them for getting lost or taking too much time. The tasks aren’t super complex, so this kind of scoring makes sense for this dataset.

  • That’s why we’re building something way better: actbench.
    It's a framework for systematically evaluating web automation agents and LAMs, providing a much more detailed report card.

    We’re working on a dataset with real-world tasks of all difficulty levels, across all sorts of websites. This will help us assess from multiple angles like speed, accuracy, efficiency, costs and more. We’re also designing a much more nuanced scoring system.

    It’s still a work in progress, and if you’d like to collaborate, check it out here.

  • WebVoyager's original evaluation used GPT-4V to analyze screenshots. We ran a small experiment with 20 tasks and found that the results were pretty much the same whether we included screenshots or not (using GPT-4o). Since our system rarely lands on completely wrong websites (especially when they're pre-specified), sending screenshots just slowed things down and added unnecessary costs.

    We're still to run the benchmark, so some stuff might change, and we’ll publish those sample results and the full benchmark data soon.

Running this Benchmark

  1. Copy the example env and fill in the required secrets:
cp .env.example .env
  1. Create and activate a virtual environment:
python -m venv .venv  
source .venv/bin/activate 
  1. Run a test:
python run.py -n 5  
  1. Run the full benchmark:
python run.py

Parallelism can be controlled with -p (default: 10). -n controls the number of tasks (default: all). -t defines the dataset file (default: data/WebVoyager_data.jsonl).

About

WebVoyager Eval for Raccoon AI

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages