PhantomWiki generates on-demand datasets to evaluate reasoning and retrieval capabilities of LLMs.
First install Prolog on your machine, then PhantomWiki with pip
:
pip install phantom-wiki
Note
This package has been tested with Python 3.12. We require Python 3.10+ to support match statements.
To build from source, you can clone this repository and run pip install .
.
Generate PhantomWiki datasets with random generation seed 1:
- In Python:
import phantom_wiki as pw
pw.generate_dataset(
output_dir="/path/to/output",
seed=1,
use_multithreading=True,
)
- In a terminal:
phantom-wiki-generate -od "/path/to/output" --seed 1 --use-multithreading
(You can also use the shorthand alias pw-generate
.)
Note
We do not support --use-multithreading
on macOS yet, so you should skip this flag (or set it to False
).
The following generation script creates datasets of various sizes with random generation seed 1:
./data/generate-v1.sh /path/to/output/ 1 --use-multithreading
- Universe sizes 25, 50, 500, ..., 5K, 500K, 1M (number of documents)
- Question template depth 20 (proportional to difficulty)
For example, it executes the following command to generate a size 5K universe (5000 = --max-family-tree-size * --num-family-trees
):
pw-generate \
-od /path/to/output/depth_20_size_5000_seed_1 \
--seed 1 \
--question-depth 20 \
--num-family-trees 100 \
--max-family-tree-size 50 \
--max-family-tree-depth 20 \
--article-format json \
--question-format json \
--use-multithreading
For convenience of development, we provide pre-generated PhantomWiki datasets on HuggingFace (sizes 50, 500, and 5000 with seeds 1, 2, and 3).
from datasets import load_dataset
# Download the document corpus
ds_corpus = load_dataset("kilian-group/phantom-wiki-v1", "text-corpus")
# Download the question-answer pairs
ds_qa = load_dataset("kilian-group/phantom-wiki-v1", "question-answer")
PhantomWiki uses the Prolog logic programming language, available on all operating systems through SWI-Prolog. We recommend installing SWI-prolog through your distribution or through conda, for example:
# On macOS: with homebrew
brew install swi-prolog
# On Linux: with apt
sudo add-apt-repository ppa:swi-prolog/stable
sudo apt-get update
sudo apt-get install swi-prolog
# On Linux: with conda
conda install conda-forge::swi-prolog
# On Windows: download and install binary from https://www.swi-prolog.org/download/stable
There are 2 options:
-
(Recommended) Install the package in editable mode using pip:
pip install -e .
-
If you use VSCode, you can add to the python path without installing the package:
- Create a file in the repo root called
.env
- Add
PYTHONPATH=src
- Restart VSCode
- Create a file in the repo root called
First, install dependencies and vLLM to match your hardware (GPU, CPU, etc.):
pip install phantom-wiki[eval]
pip install "vllm>=0.6.6"
If you're installing from source, use pip install -e ".[eval]"
.
Anthropic
- Create an API key at https://console.anthropic.com/settings/keys
- Set your Anthropic API key as an environment variable. Or in your conda environment:
export ANTHROPIC_API_KEY=xxxxx
# or
conda env config vars set ANTHROPIC_API_KEY=xxxxx
Rate limits: https://docs.anthropic.com/en/api/rate-limits#updated-rate-limits
π¨ The Anthropic API has particularly low rate limits so it takes longer to get predictions.
Google Gemini
- Create an API key at https://aistudio.google.com/app/apikey
- Set your Gemini API key as an environment variable. Or in your conda environment:
export GEMINI_API_KEY=xxxx
# or
conda env config vars set GEMINI_API_KEY=xxxxx
OpenAI
- Create an API key at https://platform.openai.com/settings/organization/api-keys
- Set your OpenAI API key as an environment variable. Or in your conda environment:
export OPENAI_API_KEY=xxxxx
# or
conda env config vars set OPENAI_API_KEY=xxxxx
TogetherAI
- Register for an account at https://api.together.ai
- Set your TogetherAI API key as an environment variable. Or in your conda environment:
export TOGETHER_API_KEY=xxxxx
# or
conda env config vars set TOGETHER_API_KEY=xxxxx
vLLM
Original setup instructions: https://docs.vllm.ai/en/stable/getting_started/installation.html#install-the-latest-code
Additional notes:
- It's recommended to download the model manually:
huggingface-cli download MODEL_REPO_ID
The models and their configs are downloaded directly from HuggingFace and almost all models on HF are fair game (see also: https://docs.vllm.ai/en/stable/models/supported_models.html#supported-models)
Note
For vLLM inference, make sure to request access for Gemma, Llama 3.1, 3.2, and 3.3 models on HuggingFace before proceeding.
π§ͺ To generate the predictions from an LLM with a prompting METHOD
, run the following command:
python -m phantom_eval --method METHOD --server SERVER --model_name MODEL_NAME_OR_PATH --split_list SPLIT_LIST -od OUTPUT_DIRECTORY
We implement lightweight interfaces to Anthropic, OpenAI, Gemini, and Together APIs, which you can select by specifying SERVER
, e.g. anthropic
, openai
, gemini
, together
respectively.
We also implement an interface to vllm
server, to evaluate local LLMs.
Example usages:
METHOD
can bezeroshot
,fewshot
,cot
,react
,zeroshot-rag
etc.- Evaluate GPT-4o through checkpoint names
--server openai --model_name gpt-4o-2024-11-20
or with name aliases--server openai --model_name gpt-4o
. We pass on the model name to the API, so any LLM name supported by the API is supported by our interface. Similarly for Anthropic, Gemini, and Together. - Evaluate Huggingface LLMs through Model Card name
--server vllm --model_name deepseek-ai/DeepSeek-R1-Distill-Qwen-32B
, or through local weights path--server vllm --model_name /absolute/path/to/weights/
.
Tip
To generate a slurm script for clusters at Cornell (g2, empire, aida) with the appropriate GPU allocation, run bash eval/create_eval.sh
script and follow the prompted steps.
π To generate the tables and figures, run the following command from the root directory, replacing METHODS
with a space-separated list of prompting techniques e.g. "zeroshot cot zeroshot-rag cot-rag react"
.
./eval/evaluate.sh OUTPUT_DIRECTORY MODEL_NAME_OR_PATH METHODS
# For local datasets, specify the dataset path and add the --from_local flag
DATASET="/path/to/dataset/" ./eval/evaluate.sh OUTPUT_DIRECTORY MODEL_NAME_OR_PATH METHODS --from_local
Here, OUTPUT_DIRECTORY is the same as when generating the predictions. This script will create the following subdirectories in OUTPUT_DIRECTORY: scores/
and figures/
.
@article{gong2025phantomwiki,
title={{PhantomWiki}: On-Demand Datasets for Reasoning and Retrieval Evaluation},
author={Gong, Albert and Stankevi{\v{c}}i{\=u}t{\.e}, Kamil{\.e} and Wan, Chao and Kabra, Anmol and Thesmar, Raphael and Lee, Johann and Klenke, Julius and Gomes, Carla P and Weinberger, Kilian Q},
journal={arXiv preprint arXiv:2502.20377},
year={2025}
}