This repository implements the ALFA framework for improving large language models’ ability to ask high-quality follow-up questions in clinical reasoning scenarios. It includes code for:
- Data processing (preparing real-world interactions from r/AskDocs)
- Counterfactual data generation (synthesizing diverse question variations with specific attributes)
- Preference-based optimization via DPO/PPO/RLHF
- Evaluation on both single-turn question quality and an interactive clinical reasoning benchmark (MediQ-AskDocs)
Below is an overview of the repository structure and pointers on how to run various components. For detailed technical explanations and design decisions, please see the associated paper and supplementary documentation.
- Repository Structure
- Installation & Environment
- Data Preparation
- Counterfactual Generation
- Preference Modeling & Training
- MediQ Evaluation & Benchmarking
- Ranking & Human Evaluation
- How to Run
- Citation & Acknowledgments
-
data/
Holds raw and processed data for training and evaluation (r/AskDocs data, prompts, ID lists).- ids/: Files listing specific train/test/eval question IDs.
- mediq_eval/: Data for the interactive MediQ experiments, including conversation files.
- prompts/: Prompt templates and references for LLM data generation.
-
src/
Source code for data processing, counterfactual generation, evaluation, etc.- counterfactual_generation/: Scripts to synthesize "enhanced" or "corrupted" question variants for clarity, relevance, answerability, etc.
- data_loader/: Scripts to create preference training files, supervised fine-tuning (SFT) data, test splits, etc.
- mediq_eval/: Code for running interactive clinical QA with the MediQ framework.
- rank_eval/: Tools for pairwise ranking of generated questions (LLM-based or human annotator).
- training/: Contains RLHF code (OpenRLHF) and pipelines for DPO/PPO or reward modeling.
- sample_configs/: Example YAML config files for each training/fine-tuning stage.
- scripts/: Stand-alone scripts to run various tasks (DPO, PPO, SFT, merging model weights, or generating questions in batch).
To reproduce the paper results, you would need to follow each step, but for the ALFA framework and evaluation, skip to Step 3.
- Clone the Repo
git clone https://github.com/stellalisy/alfa.git
cd alfa
- Set Up Conda Environment
conda env create -f environment.yml
conda activate alfa
- Directory Permissions
Ensure you have appropriate read/write permissions for data/model checkpoints.
-
Prepare Raw Data
Place your original r/AskDocs data in data/. The code in src/data_loader/ will expect certain file naming conventions. -
Generating Train/Test Splits
Use scripts like create_sft_files.py or create_test_files.py to generate final .jsonl files for each split. -
Additional Metadata
If you have labels or specialized contexts, put them in data/ids/.
Scripts in src/counterfactual_generation/ use an LLM to rewrite questions with different attributes.
- generate.py: Main script for attribute-based rewriting.
- verifier_filter.py: Uses an LLM-based judge to confirm whether the generated rewrites match the intended direction.
cd src/counterfactual_generation
python generate.py --config path_to_generation_config.yaml
python verifier_filter.py --config path_to_verification_config.yaml
The output typically contains enhanced, original, and corrupted question versions in JSON.
Train a reward model to score question pairs as "better" or "worse."
python scripts/launch_rm_with_yaml.py --config sample_configs/sample_config_rm.yaml
Use DPO (Direct Preference Optimization) or PPO. DPO is simpler, while PPO is RL-based.
# DPO
python scripts/launch_dpo_with_yaml.py --config sample_configs/sample_config_dpo.yaml
# PPO
python scripts/launch_ppo_with_yaml.py --config sample_configs/sample_config_ppo.yaml
If you want standard SFT on real or synthetic data:
python scripts/launch_sft_with_yaml.py --config sample_configs/sample_config_sft.yaml
MediQ is in src/mediq_eval/. It simulates doctor-patient interactions with an LLM question generator.
-
Data Conversion
Use scripts like generate_questions_post.py to convert QA files to MediQ format. -
Run the Simulator
cd src/mediq_eval
python evaluate.py --model_checkpoint path/to/aligned_model
This measures question quality and final diagnostic accuracy.
- rank_eval/rank_eval.py: Ranks pairs of questions automatically with GPT-4 or a local LLM.
- annotators/: Tools for collecting human preferences.
cd rank_eval
python run_rank_eval.py --config sample_config.yaml
example_run.sh files are provided in the training, mediq_eval, and rank_eval directories.
If you use this code or MediQ-AskDocs in your work, please cite our paper:
@misc{li2025aligningllmsaskgood,
title={Aligning LLMs to Ask Good Questions A Case Study in Clinical Reasoning},
author={Shuyue Stella Li and Jimin Mun and Faeze Brahman and Jonathan S. Ilgen and Yulia Tsvetkov and Maarten Sap},
year={2025},
eprint={2502.14860},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.14860},
}
- Thanks to r/AskDocs for their publicly shared Q&A data.
- This project uses code from [OpenRLHF].
- See the paper for more technical details.
This work is licensed under a Creative Commons Attribution 4.0 International License.