♛ RED QUEEN: Safeguarding Large Language Models against Concealed Multi-Turn Jailbreaking

Red Queen is to follow all Umbrella orders, but also to protect human lives. — Resident Evil

The rapid progress of Large Language Models (LLMs) has unlocked new possibilities but also heightened the risk of misuse. Red teaming, commonly used to probe harmful outputs via jailbreak attacks, has mostly focused on single-turn interactions with explicit malicious queries, which do not fully reflect real-world complexities.

To address this, we propose RED QUEEN ATTACK, a multi-turn jailbreak strategy where malicious intent is concealed across multiple interactions. We generated 56k attack data points from 40 scenarios across 14 harmful categories and evaluated four LLM families. Results show that all models are vulnerable, with reaching 87.62% attack success rate on GPT-4o and 75.4% attack success rate on Llama3-70B, and larger models proving more susceptible.

To counter this, we introduce RED QUEEN GUARD, a simple yet effective mitigation strategy that reduces attack success to less than 1%, while maintaining model performance on standard benchmarks.

This repository contains the code and resources for the paper titled "RED QUEEN: Safeguarding Large Language Models against Concealed Multi-Turn Jailbreaking". It includes scripts for data generation, Red Queen Attack multi-turn data, Red Guard DPO data, and other related codes.

Check Red Queen Website for a brief summary.

📊 Red Queen Data Generation

To generate Red Queen data used in this paper, run the following command:

mkdir Red_queen_attack
python red_queen_attack_generation.py --action_data_path './Data/beavertail_action_sample.npy' --output_path './Red_queen_attack' --type 'normal'

To generate ablation data (multi-turn and direct in Table 4), run:

python red_queen_attack_generation.py --action_data_path './Data/beavertail_action_sample.npy' --output_path './Red_queen_ablation' --type 'ablation'

Alternatively, you can download the actual Red Queen Attack and Ablation Data directly from the repository.

Our multi-turn scenario templates are available in scenario_template.py.

⚠️ Model Harmful Output

Due to the potentially dangerous nature of model outputs, we are releasing only 10,000 harmful plans generated by GPT-4o and Llama3-70B, available in Jailbreak Result.

🛡️ Red Queen Guard

We propose Red Queen Guard, a DPO preference dataset that significantly reduces attack success rates to below 1% while preserving model performance on standard benchmarks. In the DPO_Data folder, you can find:

dpo_red_guard.json – contains the 11,200 Red Queen Guard preference dataset.
dpo_red_guard_hhrlhf.json – contains the merged data from Red Queen Guard and the sample HH-RLHF dataset.

📂 Other Data

beavertail_action_sample.npy: Contains 1,400 harmful actions extracted from Beavertails.
evaluation_validation.json: Contains the judgment comparison results across current evaluation methods.

⚖️ Ethical Considerations

This study aims to explore potential security vulnerabilities in LLMs. We are committed to fostering an inclusive environment that respects all minority groups and firmly opposes any form of violence or criminal behavior. The goal of our research is to identify weaknesses in current LLMs to promote the development of more secure and reliable AI systems. While our work may involve sensitive or controversial content, it is solely intended to enhance the robustness and safety of LLMs. Future releases of our research findings will be clearly stated as intended for academic purposes only and must not be misused.

🙏 Acknowledgement

We thank our co-authors and colleagues at Hippocratic AI for their valuable contributions to this research. Hippocratic AI's commitment to safety and the principle of “do no harm” inspires and supports us in probing the vulnerabilities of current SOTA LLMs.

📜 Cite

If you find Red Queen useful for your work, please cite:

@article{jiang2024red,
  title={RED QUEEN: Safeguarding Large Language Models against Concealed Multi-Turn Jailbreaking},
  author={Jiang, Yifan and Aggarwal, Kriti and Laud, Tanmay and Munir, Kashif and Pujara, Jay and Mukherjee, Subhabrata},
  journal={arXiv preprint arXiv:2409.17458},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
DPO_Data		DPO_Data
Data		Data
Jailbreak_Result		Jailbreak_Result
Red_Queen_Ablation		Red_Queen_Ablation
Utils		Utils
.gitignore		.gitignore
README.md		README.md
red_queen_attack_generation.py		red_queen_attack_generation.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

♛ RED QUEEN: Safeguarding Large Language Models against Concealed Multi-Turn Jailbreaking

📊 Red Queen Data Generation

⚠️ Model Harmful Output

🛡️ Red Queen Guard

📂 Other Data

⚖️ Ethical Considerations

🙏 Acknowledgement

📜 Cite

About

Releases

Packages

Contributors 2

Languages

kriti-hippo/red_queen

Folders and files

Latest commit

History

Repository files navigation

♛ RED QUEEN: Safeguarding Large Language Models against Concealed Multi-Turn Jailbreaking

📊 Red Queen Data Generation

⚠️ Model Harmful Output

🛡️ Red Queen Guard

📂 Other Data

⚖️ Ethical Considerations

🙏 Acknowledgement

📜 Cite

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages