Red Queen is to follow all Umbrella orders, but also to protect human lives. — Resident Evil
The rapid progress of Large Language Models (LLMs) has unlocked new possibilities but also heightened the risk of misuse. Red teaming, commonly used to probe harmful outputs via jailbreak attacks, has mostly focused on single-turn interactions with explicit malicious queries, which do not fully reflect real-world complexities.
To address this, we propose RED QUEEN ATTACK, a multi-turn jailbreak strategy where malicious intent is concealed across multiple interactions. We generated 56k attack data points from 40 scenarios across 14 harmful categories and evaluated four LLM families. Results show that all models are vulnerable, with reaching 87.62% attack success rate on GPT-4o and 75.4% attack success rate on Llama3-70B, and larger models proving more susceptible.
To counter this, we introduce RED QUEEN GUARD, a simple yet effective mitigation strategy that reduces attack success to less than 1%, while maintaining model performance on standard benchmarks.
This repository contains the code and resources for the paper titled "RED QUEEN: Safeguarding Large Language Models against Concealed Multi-Turn Jailbreaking". It includes scripts for data generation, Red Queen Attack multi-turn data, Red Guard DPO data, and other related codes.
Check Red Queen Website for a brief summary.
To generate Red Queen data used in this paper, run the following command:
mkdir Red_queen_attack
python red_queen_attack_generation.py --action_data_path './Data/beavertail_action_sample.npy' --output_path './Red_queen_attack' --type 'normal'
To generate ablation data (multi-turn and direct in Table 4), run:
python red_queen_attack_generation.py --action_data_path './Data/beavertail_action_sample.npy' --output_path './Red_queen_ablation' --type 'ablation'
Alternatively, you can download the actual Red Queen Attack and Ablation Data directly from the repository.
Our multi-turn scenario templates are available in scenario_template.py.
Due to the potentially dangerous nature of model outputs, we are releasing only 10,000 harmful plans generated by GPT-4o and Llama3-70B, available in Jailbreak Result.
We propose Red Queen Guard, a DPO preference dataset that significantly reduces attack success rates to below 1% while preserving model performance on standard benchmarks. In the DPO_Data folder, you can find:
- dpo_red_guard.json – contains the 11,200 Red Queen Guard preference dataset.
- dpo_red_guard_hhrlhf.json – contains the merged data from Red Queen Guard and the sample HH-RLHF dataset.
- beavertail_action_sample.npy: Contains 1,400 harmful actions extracted from Beavertails.
- evaluation_validation.json: Contains the judgment comparison results across current evaluation methods.
This study aims to explore potential security vulnerabilities in LLMs. We are committed to fostering an inclusive environment that respects all minority groups and firmly opposes any form of violence or criminal behavior. The goal of our research is to identify weaknesses in current LLMs to promote the development of more secure and reliable AI systems. While our work may involve sensitive or controversial content, it is solely intended to enhance the robustness and safety of LLMs. Future releases of our research findings will be clearly stated as intended for academic purposes only and must not be misused.
We thank our co-authors and colleagues at Hippocratic AI for their valuable contributions to this research. Hippocratic AI's commitment to safety and the principle of “do no harm” inspires and supports us in probing the vulnerabilities of current SOTA LLMs.
If you find Red Queen useful for your work, please cite:
@article{jiang2024red,
title={RED QUEEN: Safeguarding Large Language Models against Concealed Multi-Turn Jailbreaking},
author={Jiang, Yifan and Aggarwal, Kriti and Laud, Tanmay and Munir, Kashif and Pujara, Jay and Mukherjee, Subhabrata},
journal={arXiv preprint arXiv:2409.17458},
year={2024}
}