Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new paper: #51

Open
wyzh0912 opened this issue Feb 23, 2025 · 0 comments
Open

Add new paper: #51

wyzh0912 opened this issue Feb 23, 2025 · 0 comments

Comments

@wyzh0912
Copy link
Contributor

Title

The Validation Gap: A Mechanistic Analysis of How Language Models Compute Arithmetic but Fail to Validate It

Published Date

2025-02-17

Source

arXiv

Head Name

Consistency head

Summary

  • Innovation: The paper investigates the mechanisms behind arithmetic error detection in LLMs by identifying specific computational subgraphs, or circuits, responsible for detecting errors in arithmetic tasks. It highlights a structural dissociation between arithmetic computation and validation within these models, suggesting that this separation contributes to the models' difficulties in error detection.

  • Tasks: The study uses a mechanistic analysis approach, employing edge attribution patching to identify circuits in LLMs that are responsible for detecting arithmetic errors. The analysis involves generating controlled arithmetic problem prompts, both correct and with intentional errors, to examine how different parts of the model contribute to error detection.

  • Significant Result: The research finds that error detection circuits are structurally similar across different models and are primarily governed by attention heads termed consistency heads, which focus on surface-level alignment of numerical values. The study also shows that integrating latent activations from higher layers into lower layers can enhance models' error detection capabilities, effectively closing the validation gap.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant