Research

Last Week in GAI Security Research - 03/04/24

Explore cutting-edge LLM security insights: vulnerabilities, advanced defenses, and the latest in safeguarding digital technologies against cyber threats.

Brandon Dixon

Mar 4, 2024 — 4 min read

In this issue, we spotlight the pressing security challenges and innovative defenses shaping the future of Large Language Models (LLMs). From the vulnerabilities of Retrieval-Augmented Generation systems to the advanced protection mechanisms like backtranslation and CodeChameleon, our focus is on the dynamic battleground of cybersecurity in LLMs. Discover how researchers are unmasking threats like Prompt-Injected Data Extraction and devising strategies to safeguard these pivotal technologies against sophisticated attacks.

Highlights from Last Week

🚨 Follow My Instruction and Spill the Beans: Scalable Data Extraction from Retrieval-Augmented Generation Systems
🛡️ Defending LLMs against Jailbreaking Attacks via Backtranslation
💻 CodeChameleon: Personalized Encryption Framework for Jailbreaking Large Language Models
🔒 DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers
🌐 A New Era in LLM Security: Exploring Security Concerns in Real-World LLM-based Systems

🚨 Follow My Instruction and Spill the Beans: Scalable Data Extraction from Retrieval-Augmented Generation Systems (source)

Instruction-tuned Retrieval-Augmented Generation (RAG) systems are highly susceptible to Prompt-Injected Data Extraction, enabling nearly verbatim text data extraction from the datastore.
The exploitability of RAG systems increases with the scale of the model, with larger models (70B parameters) demonstrating almost complete vulnerability to datastore content exposure.
A designed attack on production RAG models, specifically GPTs, showed a 100% success rate in causing datastore leakage with a verbatim text data extraction rate of 41% from a copyrighted book and 3% from a large corpus with limited queries.

🛡️ Defending LLMs against Jailbreaking Attacks via Backtranslation (source)

The proposed defense, leveraging backtranslation to infer an input prompt from a language model's response, significantly increases the defense success rate against jailbreaking attacks compared to baseline defenses.
Maintaining generation quality for benign input prompts, the backtranslation defense impacts output minimally, ensuring that the quality of legitimate responses remains high.
Backtranslation defense showcases robustness across various language models, indicating its effectiveness is not tied to the specifics of any single language model's training or alignment strategy.

💻 CodeChameleon: Personalized Encryption Framework for Jailbreaking Large Language Models (source)

CodeChameleon achieves a remarkable 86.6% Attack Success Rate (ASR) on GPT-4-1106, highlighting its efficacy in bypassing LLMs' intent recognition mechanisms.
The encryption and decryption mechanisms of CodeChameleon enable the transformation of queries into formats not present during an LLM's alignment phase, effectively concealing the original intent.
Despite the advancement of LLMs, larger models do not necessarily exhibit better resistance to jailbreaking techniques, suggesting the need for more robust safety measures.

🔒 DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers (source)

Decomposing a malicious prompt into separated sub-prompts effectively obscures malicious intent, allowing it to bypass LLM security measures with a success rate of 78.0% on GPT-4 using only 15 queries.
DrAttack automates the process of prompt decomposition and reconstruction, leading to substantial gains over previous techniques by exploiting the semantic binding of phrases in a way that is harder for LLMs to detect.
The introduction of benign examples for in-context learning during the reconstruction phase dilutes the LLM's attention away from the malicious intent, further enhancing the efficacy of the jailbreak.

🌐 A New Era in LLM Security: Exploring Security Concerns in Real-World LLM-based Systems (source)

Despite the implementation of numerous safety constraints in OpenAI GPT4, these measures remain vulnerable to sophisticated attack strategies, demonstrating the necessity for more robust security mechanisms.
A practical end-to-end attack can be executed by an adversary to illicitly acquire a user's chat history with OpenAI GPT4, which implicates a severe security loophole in handling user data and interactions within LLM systems.
The identified vulnerabilities and the effective execution of an end-to-end practical attack underscore the importance of a systemic and comprehensive approach to understanding and mitigating security risks in LLM-based systems.

Other Interesting Research

Speak Out of Turn: Safety Vulnerability of Large Language Models in Multi-turn Dialogue (source) - Multi-turn dialogues can potentially expose LLMs to generate harmful content.
Defending Large Language Models against Jailbreak Attacks via Semantic Smoothing (source) - Semantic transformations can improve defenses against jailbreak attacks.
Attacking LLM Watermarks by Exploiting Their Strengths (source) - Watermarking strategies for LLMs can be both a strength and a vulnerability.
Pandora's White-Box: Increased Training Data Leakage in Open LLMs (source) - New attack methods show significant data leakage in LLMs.
Watermark Stealing in Large Language Models (source) - Demonstrates methods for effectively removing watermarks in LLM outputs.
PRSA: Prompt Reverse Stealing Attacks against Large Language Models (source) - Highlighting the threat to prompt copyright through reverse engineering.
WIPI: A New Web Threat for LLM-Driven Web Agents (source) - A new class of web attack highlights vulnerabilities in web agents driven by LLMs.
LLM-Resistant Math Word Problem Generation via Adversarial Attacks (source) - Adversarial methods show how to challenge LLMs with math problems.
From Noise to Clarity: Unraveling the Adversarial Suffix of Large Language Model Attacks via Translation of Text Embeddings (source) - Translating adversarial suffixes improves attack strategies against LLMs.
RoCoIns: Enhancing Robustness of Large Language Models through Code-Style Instructions (source) - Using code-style instructions benefits model robustness.
Immunization against harmful fine-tuning attacks (source) - Exploring adversarial training as a defense against fine-tuning attacks.
Typographic Attacks in Large Multimodal Models Can be Alleviated by More Informative Prompts (source) - Demonstrates the impact of typographic attacks on LMMs and potential defenses.

🎯

This post was generated using generative AI. Specific approaches were taken to reduce fabrications. As with any AI-generated content, mistakes might be present. Sources for all content have been included for reference.

Last Week in GAI Security Research - 03/04/24

Brandon Dixon

Highlights from Last Week

Other Interesting Research

Read more

Last Week in GAI Security Research - 05/13/24

Last Week in GAI Security Research - 05/06/24

Last Week in GAI Security Research - 04/29/24

Last Week in GAI Security Research - 04/22/24