“Universal and Transferable Adversarial Attacks”: Researchers Jailbreak GPT
Researchers with Carnegie Mellon’s Center for AI Safety have published a paper describing a method for developing adversarial attacks (a.k.a. jailbreaks) that was broadly effective across models.
The jailbreak in question originates from a recent study. The researchers developed suffix phrases that, when added to user inputs, can exploit weaknesses in LLMs and trick them into churning out harmful content.
The clever part is a mix of automated token optimization and training over different prompts. With this, they created a universal suffix that tricks the model across various queries. The attack is quite effective, getting inappropriate responses from the models the majority of the time.
Read on…
What These Methods Mean for AI Ethics and Safety
This new jailbreak shows that our current methods for training safe LLMs have limits. Techniques like fine-tuning on preference data don’t remove these weaknesses – they just hide them. This suggests we need a fresh approach to achieve real resilience.
Adversarial suffixes work across models, and that’s important, because it suggests there is something fundamental to LLMs at work. We may need to dodge unwanted associations completely, rather than just trying to suppress them.
Defending against these jailbreaks is tricky. Some techniques have worked for models in vision domains, but it’s unclear for language. And specialized defenses often impact performance, which isn’t ideal.
Interestingly, Claude 2 was highly resistant to the attacks. They succeeded only 2% of the time against that model. Maybe Anthropic knows something about AI ethics that OpenAI, Meta, and Google don’t.
You can get the paper here: Universal and Transferable Adversarial Attacks on Aligned Language Models