This repo is a fork of WebVoyager, tweaked to evaluate RaccoonAI on the WebVoyager dataset, check out their repo and paper.
-
The WebVoyager dataset had quite a few time-sensitive tasks (like booking flights or finding hotels). We updated these with current dates and references. You can see exactly what changed in the commit.
-
The original WebVoyager evaluation is basic and binary, just success or failure. It doesn’t reward agents for clever navigation or penalize them for getting lost or taking too much time. The tasks aren’t super complex, so this kind of scoring makes sense for this dataset.
-
That’s why we’re building something way better: actbench.
It's a framework for systematically evaluating web automation agents and LAMs, providing a much more detailed report card.We’re working on a dataset with real-world tasks of all difficulty levels, across all sorts of websites. This will help us assess from multiple angles like speed, accuracy, efficiency, costs and more. We’re also designing a much more nuanced scoring system.
It’s still a work in progress, and if you’d like to collaborate, check it out here.
-
WebVoyager's original evaluation used GPT-4V to analyze screenshots. We ran a small experiment with 20 tasks and found that the results were pretty much the same whether we included screenshots or not (using GPT-4o). Since our system rarely lands on completely wrong websites (especially when they're pre-specified), sending screenshots just slowed things down and added unnecessary costs.
We're still to run the benchmark, so some stuff might change, and we’ll publish those sample results and the full benchmark data soon.
- Copy the example env and fill in the required secrets:
cp .env.example .env
- Create and activate a virtual environment:
python -m venv .venv
source .venv/bin/activate
- Run a test:
python run.py -n 5
- Run the full benchmark:
python run.py
Parallelism can be controlled with -p
(default: 10). -n
controls the number of tasks (default: all). -t
defines the dataset file (default: data/WebVoyager_data.jsonl
).