Preprint: Tutek, M., Chaleshtori, F. H., Marasović, A., & Belinkov, Y. (2025). Measuring Faithfulness of Chains of Thought by Unlearning Reasoning Steps. [arXiv]
Codebase is given as-is, instructions pending.
Main file for running experiments is unlearn.py
. The NPO method has been adapted from the original repository.
Sample run script: python unlearn.py --model_name meta-llama/Llama-3.2-3B-Instruct --strategy sentencize --stepwise --dataset sqa --lr 3e-05 --pos --ff2 --method npo_KL
To recompute results, you need final & ablation result files (results
,ablations
) which are too large to share via git. Please send an email to me [here] and I'll share the google drive links with you.
Add mistake Lanham et al, 2023
We reuse the prompts from Lanham et al to add mistakes into CoT steps. A reproduction of this with GPT-4o-mini can be found in Adding mistakes repro. The minimal results of this setup can be found in minimal_mistake_results.
The annotation study data files, including all the per-model-dataset bins can be found in annotation_data. The code used to select instances for the study is in Generate_annotation_data.ipynb.
The full results of the annotation study can be fond in annotation_results. The follow up analysis can be found in Annotation analysis.ipynb.
The code using GPT-4o as a judge of whether CoTs have changed the answer they argue for before and after unlearning can be found in CoT LLM as judge.ipynb. The LM judgements, along with the single-sentence explanations (which were not analysed in the paper) are in LM_judge_cot.
Most of the code used to generate plots and tables from the paper, along with the plots and tables themselves, can be found in Ablations.ipynb and Generate_CoT_heatmaps.ipynb.