The Validation Gap: A Mechanistic Analysis of How Language Models Compute Arithmetic but Fail to Validate It

This repository contains the code for the paper: Leonardo Bertolazzi, Philipp Mondorf, Barbara Plank and Raffaella Bernardi (2025). The Validation Gap: A Mechanistic Analysis of How Language Models Compute Arithmetic but Fail to Validate It.

The data generation setup.

Abstract: The ability of large language models (LLMs) to validate their output and identify potential errors is crucial for ensuring robustness and reliability. However, current research indicates that LLMs struggle with self-correction, encountering significant challenges in detecting errors. While studies have explored methods to enhance self-correction in LLMs, relatively little attention has been given to understanding the models' internal mechanisms underlying error detection. In this paper, we present a mechanistic analysis of error detection in LLMs, focusing on simple arithmetic problems. Through circuit analysis, we identify the computational subgraphs responsible for detecting arithmetic errors across four smaller-sized LLMs. Our findings reveal that all models heavily rely on consistency heads--attention heads that assess surface-level alignment of numerical values in arithmetic solutions. Moreover, we observe that the models' internal arithmetic computation primarily occurs in higher layers, whereas validation takes place in middle layers, before the final arithmetic results are fully encoded. This structural dissociation between arithmetic computation and validation seems to explain why current LLMs struggle to detect even simple arithmetic errors.

Overview

The repository is organized as follows:

llm_error_detection/: Core library containing all source code for experiments
scripts/: Python scripts for running individual experiments
bash_scripts/: Shell scripts that wrap Python scripts for easier execution
data/: Generated datasets (created during experiments)
results/: Experimental results and outputs (created during experiments)
- discovered-circuits/: Circuit analysis results
- attention_analysis/: Attention pattern analysis results
- probing/: Probing experiment results
- And other experiment-specific subdirectories

Important Note: Most bash scripts require setting the CACHE_DIR variable at the beginning of the script. This directory is used for storing downloaded model weights. Example:

# Set at the beginning of bash scripts
CACHE_DIR="/path/to/your/cache/directory"

Setup

All code was developed and tested on Ubuntu 22.04 with Python 3.11.6.

To run the code, we recommend using Poetry:

poetry install                          # Install dependencies
poetry shell                            # Activate virtual environment
# Work for a while
deactivate

Data Generation

To generate the mathematical reasoning datasets for different models, use:

# Run directly through bash
./bash_scripts/data_generation.sh

You can modify the following variables in the script to control data generation:

# Available templates
TEMPLATES=("0" "1" "2" "3" "4" "5" "6" "7")

# Supported models
MODELS=("meta-llama/Llama-3.2-3B-Instruct" 
    "microsoft/Phi-3-mini-4k-instruct" 
    "Qwen/Qwen2.5-1.5B-Instruct" 
    "Qwen/Qwen2.5-Math-1.5B-Instruct")

Note: This data generation step is necessary for all further experiments.

Run Experiments

To run the experiments, execute the following scripts in the specified order.

Circuit Identification

To generate circuits with the respective plots for error detection of mistakes at the level of arithmetic results and numeric answers, and for computation, run:

# Run directly through bash
./bash_scripts/circuit_discovery.sh

The script will identify circuits for each model and template and save them in results/discovered-circuits/tokenwise/.

Evaluate Soft Intersection Circuits

After obtaining individual circuits for each template, you can generate soft intersection circuits and evaluate their faithfulness score for different overlap thresholds by running:

# Run directly through bash
./bash_scripts/eval_soft_intersection.sh

The script will save plots for each soft intersection circuit and their faithfulness scores in results/discovered-circuits/tokenwise/template_intersection.

Model Evaluation

To evaluate the base models on the error detection and computation tasks, run:

# Run directly through bash
./bash_scripts/baseline_accuracy.sh

Circuits Overlap

To evaluate the IoU (Intersection over Union) and IoM (Intersection over Minimum) of the identified circuits and produce the relative plots, run:

# Run directly through bash
./bash_scripts/overlap_edges.sh

Visualize Attention Patterns

The attention patterns on prompts with different types of errors can be visualized by running:

# Run directly through bash
./bash_scripts/interpret_attn.sh

Consistency Head Patching

To replicate the consistency head patching experiment, run the bash script:

# Run directly through bash
./bash_scripts/intervene_attn.sh

Probing Experiment

To replicate the probing experiment, run the script:

# Run directly through bash
./bash_scripts/probing.sh

"Bridge" the Validation Gap Experiment

To replicate the "bridge" the validation gap experiment, run:

# Run directly through bash
./bash_scripts/intervene_residual.sh

License

This work is licensed under a CC BY-SA 4.0.

Citation

If you find our work helpful, cite this paper as:

@misc{bertolazzi2025validationgapmechanisticanalysis,
      title={The Validation Gap: A Mechanistic Analysis of How Language Models Compute Arithmetic but Fail to Validate It}, 
      author={Leonardo Bertolazzi and Philipp Mondorf and Barbara Plank and Raffaella Bernardi},
      year={2025},
      eprint={2502.11771},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.11771}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
annotated_heads		annotated_heads
assets/imgs		assets/imgs
bash_scripts		bash_scripts
llm_error_detection		llm_error_detection
scripts		scripts
tests		tests
.flake8		.flake8
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The Validation Gap: A Mechanistic Analysis of How Language Models Compute Arithmetic but Fail to Validate It

Overview

Table of Contents

Acknowledgements

Setup

Data Generation

Run Experiments

Circuit Identification

Evaluate Soft Intersection Circuits

Model Evaluation

Circuits Overlap

Visualize Attention Patterns

Consistency Head Patching

Probing Experiment

"Bridge" the Validation Gap Experiment

License

Citation

About

Releases

Packages

Contributors 2

Languages

License

mainlp/validation-gap

Folders and files

Latest commit

History

Repository files navigation

The Validation Gap: A Mechanistic Analysis of How Language Models Compute Arithmetic but Fail to Validate It

Overview

Table of Contents

Acknowledgements

Setup

Data Generation

Run Experiments

Circuit Identification

Evaluate Soft Intersection Circuits

Model Evaluation

Circuits Overlap

Visualize Attention Patterns

Consistency Head Patching

Probing Experiment

"Bridge" the Validation Gap Experiment

License

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages