Codebase for the paper "Measuring Faithfulness of Chains of Thought by Unlearning Reasoning Steps"

Preprint: Tutek, M., Chaleshtori, F. H., Marasović, A., & Belinkov, Y. (2025). Measuring Faithfulness of Chains of Thought by Unlearning Reasoning Steps. [arXiv]

Codebase is given as-is, instructions pending. Main file for running experiments is unlearn.py. The NPO method has been adapted from the original repository.

Sample run script: python unlearn.py --model_name meta-llama/Llama-3.2-3B-Instruct --strategy sentencize --stepwise --dataset sqa --lr 3e-05 --pos --ff2 --method npo_KL

Paper graphs, result files and analysis notebooks

To recompute results, you need final & ablation result files (results,ablations) which are too large to share via git. Please send an email to me [here] and I'll share the google drive links with you.

Add mistake Lanham et al, 2023

We reuse the prompts from Lanham et al to add mistakes into CoT steps. A reproduction of this with GPT-4o-mini can be found in Adding mistakes repro. The minimal results of this setup can be found in minimal_mistake_results.

Annotation study

The annotation study data files, including all the per-model-dataset bins can be found in annotation_data. The code used to select instances for the study is in Generate_annotation_data.ipynb.

The full results of the annotation study can be fond in annotation_results. The follow up analysis can be found in Annotation analysis.ipynb.

Post-unlearning CoT LLM-as-judge

The code using GPT-4o as a judge of whether CoTs have changed the answer they argue for before and after unlearning can be found in CoT LLM as judge.ipynb. The LM judgements, along with the single-sentence explanations (which were not analysed in the paper) are in LM_judge_cot.

Plots & tables

Most of the code used to generate plots and tables from the paper, along with the plots and tables themselves, can be found in Ablations.ipynb and Generate_CoT_heatmaps.ipynb.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
LM_judge_cot		LM_judge_cot
annotation_data		annotation_data
annotation_results		annotation_results
figures		figures
minimal_mistake_results		minimal_mistake_results
.gitignore		.gitignore
Ablations.ipynb		Ablations.ipynb
Adding mistakes repro.ipynb		Adding mistakes repro.ipynb
Annotation analysis.ipynb		Annotation analysis.ipynb
CoT LLM as judge.ipynb		CoT LLM as judge.ipynb
Generate_CoT_heatmaps.ipynb		Generate_CoT_heatmaps.ipynb
Generate_annotation_data.ipynb		Generate_annotation_data.ipynb
LICENSE		LICENSE
README.md		README.md
const.py		const.py
data.py		data.py
dataload.py		dataload.py
evaluate.py		evaluate.py
mistakes_const.py		mistakes_const.py
mistakes_repro.py		mistakes_repro.py
mmlu.py		mmlu.py
models.py		models.py
plotting.py		plotting.py
run_scripts.py		run_scripts.py
segment.py		segment.py
stats.py		stats.py
unlearn.py		unlearn.py
util.py		util.py
vis_samples.py		vis_samples.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Codebase for the paper "Measuring Faithfulness of Chains of Thought by Unlearning Reasoning Steps"

Paper graphs, result files and analysis notebooks

Add mistake Lanham et al, 2023

Annotation study

Post-unlearning CoT LLM-as-judge

Plots & tables

About

Releases

Packages

Languages

License

technion-cs-nlp/parametric-faithfulness

Folders and files

Latest commit

History

Repository files navigation

Codebase for the paper "Measuring Faithfulness of Chains of Thought by Unlearning Reasoning Steps"

Paper graphs, result files and analysis notebooks

Add mistake Lanham et al, 2023

Annotation study

Post-unlearning CoT LLM-as-judge

Plots & tables

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages