The Validation Gap: A Mechanistic Analysis of How Language Models Compute Arithmetic but Fail to Validate It
This repository contains the code for the paper: Leonardo Bertolazzi, Philipp Mondorf, Barbara Plank and Raffaella Bernardi (2025). The Validation Gap: A Mechanistic Analysis of How Language Models Compute Arithmetic but Fail to Validate It.
Abstract: The ability of large language models (LLMs) to validate their output and identify potential errors is crucial for ensuring robustness and reliability. However, current research indicates that LLMs struggle with self-correction, encountering significant challenges in detecting errors. While studies have explored methods to enhance self-correction in LLMs, relatively little attention has been given to understanding the models' internal mechanisms underlying error detection. In this paper, we present a mechanistic analysis of error detection in LLMs, focusing on simple arithmetic problems. Through circuit analysis, we identify the computational subgraphs responsible for detecting arithmetic errors across four smaller-sized LLMs. Our findings reveal that all models heavily rely on consistency heads--attention heads that assess surface-level alignment of numerical values in arithmetic solutions. Moreover, we observe that the models' internal arithmetic computation primarily occurs in higher layers, whereas validation takes place in middle layers, before the final arithmetic results are fully encoded. This structural dissociation between arithmetic computation and validation seems to explain why current LLMs struggle to detect even simple arithmetic errors.
The repository is organized as follows:
llm_error_detection/
: Core library containing all source code for experimentsscripts/
: Python scripts for running individual experimentsbash_scripts/
: Shell scripts that wrap Python scripts for easier executiondata/
: Generated datasets (created during experiments)results/
: Experimental results and outputs (created during experiments)discovered-circuits/
: Circuit analysis resultsattention_analysis/
: Attention pattern analysis resultsprobing/
: Probing experiment results- And other experiment-specific subdirectories
Important Note: Most bash scripts require setting the CACHE_DIR
variable at the beginning of the script. This directory is used for storing downloaded model weights. Example:
# Set at the beginning of bash scripts
CACHE_DIR="/path/to/your/cache/directory"
This project uses the TransformerLens library and an adapted version of the Auto-Circuit library.
All code was developed and tested on Ubuntu 22.04 with Python 3.11.6.
To run the code, we recommend using Poetry:
poetry install # Install dependencies
poetry shell # Activate virtual environment
# Work for a while
deactivate
To generate the mathematical reasoning datasets for different models, use:
# Run directly through bash
./bash_scripts/data_generation.sh
You can modify the following variables in the script to control data generation:
# Available templates
TEMPLATES=("0" "1" "2" "3" "4" "5" "6" "7")
# Supported models
MODELS=("meta-llama/Llama-3.2-3B-Instruct"
"microsoft/Phi-3-mini-4k-instruct"
"Qwen/Qwen2.5-1.5B-Instruct"
"Qwen/Qwen2.5-Math-1.5B-Instruct")
Note: This data generation step is necessary for all further experiments.
To run the experiments, execute the following scripts in the specified order.
To generate circuits with the respective plots for error detection of mistakes at the level of arithmetic results and numeric answers, and for computation, run:
# Run directly through bash
./bash_scripts/circuit_discovery.sh
The script will identify circuits for each model and template and save them in results/discovered-circuits/tokenwise/
.
After obtaining individual circuits for each template, you can generate soft intersection circuits and evaluate their faithfulness score for different overlap thresholds by running:
# Run directly through bash
./bash_scripts/eval_soft_intersection.sh
The script will save plots for each soft intersection circuit and their faithfulness scores in results/discovered-circuits/tokenwise/template_intersection
.
To evaluate the base models on the error detection and computation tasks, run:
# Run directly through bash
./bash_scripts/baseline_accuracy.sh
To evaluate the IoU (Intersection over Union) and IoM (Intersection over Minimum) of the identified circuits and produce the relative plots, run:
# Run directly through bash
./bash_scripts/overlap_edges.sh
The attention patterns on prompts with different types of errors can be visualized by running:
# Run directly through bash
./bash_scripts/interpret_attn.sh
To replicate the consistency head patching experiment, run the bash script:
# Run directly through bash
./bash_scripts/intervene_attn.sh
To replicate the probing experiment, run the script:
# Run directly through bash
./bash_scripts/probing.sh
To replicate the "bridge" the validation gap experiment, run:
# Run directly through bash
./bash_scripts/intervene_residual.sh
This work is licensed under a CC BY-SA 4.0.
If you find our work helpful, cite this paper as:
@misc{bertolazzi2025validationgapmechanisticanalysis,
title={The Validation Gap: A Mechanistic Analysis of How Language Models Compute Arithmetic but Fail to Validate It},
author={Leonardo Bertolazzi and Philipp Mondorf and Barbara Plank and Raffaella Bernardi},
year={2025},
eprint={2502.11771},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.11771},
}