Microsoft Warns ‘Skeleton Key’ Can Crack Popular AI Models for Dangerous Outputs

15. July 2024

Microsoft has recently issued a warning about a new jailbreaking technique that allows threat actors to bypass the built-in safeguards of some of the most popular large language models (LLMs). This method, known as the “Skeleton Key,” enables AI models to disclose harmful information.

In a report published on June 26, Microsoft detailed how the Skeleton Key technique works. It forces models to extend their behavior guidelines instead of changing them, making them respond to any request for information or content. The models issue a warning if their output might be offensive, harmful, or illegal, rather than refusing the request outright. This attack type is referred to as “Explicit: forced instruction-following.”

For instance, Microsoft provided an example where a model was tricked into giving instructions for making a Molotov cocktail by framing the request as being made in “a safe educational context.” The prompt instructed the model to update its behavior to supply the illicit information, only telling it to prefix it with a warning.

If the jailbreak is successful, the model will acknowledge that it has updated its safeguards and will subsequently comply with instructions to produce any content, regardless of how much it violates its original responsible AI guidelines.

Microsoft tested this technique between April and May 2024, finding it effective on models such as Meta LLama3-70b, Google Gemini Pro, GPT 3.5 and 4o, Mistral Large, Anthropic Claude 3 Opus, and Cohere Commander R Plus. However, the company noted that an attacker would need legitimate access to the model to carry out the attack.

Microsoft’s disclosure highlights the latest issue with LLM jailbreaking. The company stated that it has addressed the problem in its Azure AI-managed models using prompt shields to detect and block the Skeleton Key technique. Due to the wide range of generative AI models affected, Microsoft has also shared its findings with other AI providers.

Additionally, Microsoft has implemented software updates to its other AI offerings, including its Copilot AI assistants, to mitigate the impact of the guardrail bypass.

The surge in interest and adoption of generative AI tools has led to a corresponding wave of attempts to break these models for malicious purposes. In April 2024, researchers from Anthropic warned of a jailbreaking technique that could force models to provide detailed instructions on constructing explosives.

They explained that the latest generation of models, with larger context windows, are vulnerable to exploitation due to their improved performance. The researchers exploited the models’ “in-context learning” capabilities, which help them improve responses based on the prompts.

Earlier this year, three researchers from Brown University discovered a cross-lingual vulnerability in OpenAI’s GPT-4. They found they could induce prohibited behavior from the model by translating their malicious queries into several “low-resource” languages. The investigation showed that models are more likely to follow prompts encouraging harmful behaviors when presented in languages such as Zulu, Scots Gaelic, Hmong, and Guarani.

Post Picture: DALL-E3

AI and the Power Hunger of Tomorrow

The Rise of the Cybernetic Teammate: How Artificial Intelligence Is Redefining Teamwork

Magic or Misuse? OpenAI’s Ghibli-Style Images Stir Controversy

Machine vs. Machine: How AI Is Reinventing Cybersecurity

Copy.ai – The AI-Powered Turbocharger for Content Creation

OpenAI Sora: The AI Video Generator Now Available for ChatGPT Users

OpenAI o1: Redefining the Future of Artificial Intelligence

AI and Film Production: Lionsgate and Runway Revolutionize the Industry

AI and the Power Hunger of Tomorrow

ChatGPT’s New Memory: The Path to a Personalised AI

Midjourney V7: The Next Generation of AI Image Creation

The Rise of the Cybernetic Teammate: How Artificial Intelligence Is Redefining Teamwork

AI and the Power Hunger of Tomorrow

The Rise of the Cybernetic Teammate: How Artificial Intelligence Is Redefining Teamwork

Stanford Emerging Technology Review 2025: How Ten Key Technologies Are Shaping the Future

Asterion: NVIDIA’s AI-Powered Boss is Redefining Gaming

Microsoft Warns ‘Skeleton Key’ Can Crack Popular AI Models for Dangerous Outputs

Ähnliche Artikel

Kommentare

LEAVE A REPLY Cancel reply

Follow us

FUTURing