WebVoyager Eval for Raccoon AI

This repo is a fork of WebVoyager, tweaked to evaluate RaccoonAI on the WebVoyager dataset, check out their repo and paper.

Few Things

The WebVoyager dataset had quite a few time-sensitive tasks (like booking flights or finding hotels). We updated these with current dates and references. You can see exactly what changed in the commit.
The original WebVoyager evaluation is basic and binary, just success or failure. It doesn’t reward agents for clever navigation or penalize them for getting lost or taking too much time. The tasks aren’t super complex, so this kind of scoring makes sense for this dataset.
That’s why we’re building something way better: actbench.
It's a framework for systematically evaluating web automation agents and LAMs, providing a much more detailed report card.

We’re working on a dataset with real-world tasks of all difficulty levels, across all sorts of websites. This will help us assess from multiple angles like speed, accuracy, efficiency, costs and more. We’re also designing a much more nuanced scoring system.

It’s still a work in progress, and if you’d like to collaborate, check it out here.
WebVoyager's original evaluation used GPT-4V to analyze screenshots. We ran a small experiment with 20 tasks and found that the results were pretty much the same whether we included screenshots or not (using GPT-4o). Since our system rarely lands on completely wrong websites (especially when they're pre-specified), sending screenshots just slowed things down and added unnecessary costs.

We're still to run the benchmark, so some stuff might change, and we’ll publish those sample results and the full benchmark data soon.

Running this Benchmark

Copy the example env and fill in the required secrets:

cp .env.example .env

Create and activate a virtual environment:

python -m venv .venv  
source .venv/bin/activate

Run a test:

python run.py -n 5

Run the full benchmark:

python run.py

Parallelism can be controlled with -p (default: 10). -n controls the number of tasks (default: all). -t defines the dataset file (default: data/WebVoyager_data.jsonl).

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
assets		assets
data		data
downloads		downloads
evaluation		evaluation
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
prompts.py		prompts.py
requirements.txt		requirements.txt
run.py		run.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WebVoyager Eval for Raccoon AI

Few Things

Running this Benchmark

About

Releases

Packages

Languages

License

raccoonaihq/webvoyager-eval

Folders and files

Latest commit

History

Repository files navigation

WebVoyager Eval for Raccoon AI

Few Things

Running this Benchmark

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages