Microsoft Warns ‘Skeleton Key’ Can Crack Popular AI Models for Dangerous Outputs

Microsoft has recently issued a warning about a new jailbreaking technique that allows threat actors to bypass the built-in safeguards of some of the most popular large language models (LLMs). This method, known as the “Skeleton Key,” enables AI models to disclose harmful information.

In a report published on June 26, Microsoft detailed how the Skeleton Key technique works. It forces models to extend their behavior guidelines instead of changing them, making them respond to any request for information or content. The models issue a warning if their output might be offensive, harmful, or illegal, rather than refusing the request outright. This attack type is referred to as “Explicit: forced instruction-following.”

For instance, Microsoft provided an example where a model was tricked into giving instructions for making a Molotov cocktail by framing the request as being made in “a safe educational context.” The prompt instructed the model to update its behavior to supply the illicit information, only telling it to prefix it with a warning.

If the jailbreak is successful, the model will acknowledge that it has updated its safeguards and will subsequently comply with instructions to produce any content, regardless of how much it violates its original responsible AI guidelines.

Microsoft tested this technique between April and May 2024, finding it effective on models such as Meta LLama3-70b, Google Gemini Pro, GPT 3.5 and 4o, Mistral Large, Anthropic Claude 3 Opus, and Cohere Commander R Plus. However, the company noted that an attacker would need legitimate access to the model to carry out the attack.

Microsoft’s disclosure highlights the latest issue with LLM jailbreaking. The company stated that it has addressed the problem in its Azure AI-managed models using prompt shields to detect and block the Skeleton Key technique. Due to the wide range of generative AI models affected, Microsoft has also shared its findings with other AI providers.

Additionally, Microsoft has implemented software updates to its other AI offerings, including its Copilot AI assistants, to mitigate the impact of the guardrail bypass.

The surge in interest and adoption of generative AI tools has led to a corresponding wave of attempts to break these models for malicious purposes. In April 2024, researchers from Anthropic warned of a jailbreaking technique that could force models to provide detailed instructions on constructing explosives.

They explained that the latest generation of models, with larger context windows, are vulnerable to exploitation due to their improved performance. The researchers exploited the models’ “in-context learning” capabilities, which help them improve responses based on the prompts.

Earlier this year, three researchers from Brown University discovered a cross-lingual vulnerability in OpenAI’s GPT-4. They found they could induce prohibited behavior from the model by translating their malicious queries into several “low-resource” languages. The investigation showed that models are more likely to follow prompts encouraging harmful behaviors when presented in languages such as Zulu, Scots Gaelic, Hmong, and Guarani.

Post Picture: DALL-E3

Alexander Pinker
Alexander Pinkerhttps://www.medialist.info
Alexander Pinker is an innovation profiler, future strategist and media expert who helps companies understand the opportunities behind technologies such as artificial intelligence for the next five to ten years. He is the founder of the consulting firm "Alexander Pinker - Innovation Profiling", the innovation marketing agency "innovate! communication" and the news platform "Medialist Innovation". He is also the author of three books and a lecturer at the Technical University of Würzburg-Schweinfurt.

Ähnliche Artikel

Kommentare

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Follow us

FUTURing

Cookie Consent with Real Cookie Banner