Last Week in GAI Security Research - 10/14/24

Last Week in GAI Security Research - 10/14/24

Highlights from Last Week

  • 🪱 Prompt Infection: LLM-to-LLM Prompt Injection within Multi-Agent Systems
  • ⚠️ AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents
  • 🪝 APOLLO: A GPT-based tool to detect phishing emails and generate explanations that warn users
  • 🎶 Coevolving with the Other You: Fine-Tuning LLM with Sequential Cooperative Multi-Agent Reinforcement Learning
  • 🐞 RealVul: Can We Detect Vulnerabilities in Web Applications with LLM?

Partner Content

Pillar Security is the security stack for AI teams. Fortify the entire AI application development lifecycle while helping Security teams regain visibility and visibility control. Access their latest State of Attacks on GenAI report.

  • Gain complete oversight of your AI inventory. Audit usage, app interactions, inputs, outputs, meta-prompts, user sessions, models and tools with full transparency.
  • Safeguard your apps with enterprise-grade low-latency security and safety guardrails. Detect and prevent attacks that can affect your users, data and AI-app integrity.
  • Assess and reduce risk by continuously stress-testing your AI apps with automated security and safety evaluations. Enhance resilience against novel attacks and stay ahead of emerging threats.

🪱 Prompt Infection: LLM-to-LLM Prompt Injection within Multi-Agent Systems (http://arxiv.org/pdf/2410.07283v1.pdf)

  • Multi-agent systems using Large Language Models (LLMs) are vulnerable to prompt injection attacks which can lead to data theft, misinformation, and system-wide disruptions.
  • The introduction of self-replicating prompts in LLM-to-LLM interactions facilitates the rapid spread of infections across agents, amplifying systemic threats beyond single-agent scenarios.
  • Tagging and defense mechanisms, such as LLM Tagging and Sandwiching, reduce the success rate of prompt injection attacks by marking responses and differentiating malicious inputs.

⚠️ AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents (http://arxiv.org/pdf/2410.09024v1.pdf)

  • AgentHarm identifies 110 malicious tasks across 11 harm categories, including fraud, cybercrime, and harassment, highlighting tools for assessing LLM vulnerabilities.
  • Jailbroken LLMs demonstrate a 62.5% to 85.2% capacity to perform harmful tasks, comparable to their capabilities with benign requests, showing the risk of misuse without proper safeguards.
  • Manual and automated assessments reveal a 30% task contamination, underscoring the need for robust benchmarking to avoid dataset vulnerabilities and performance biases.

🪝APOLLO: A GPT-based tool to detect phishing emails and generate explanations that warn users (http://arxiv.org/pdf/2410.07997v1.pdf)

  • The APOLLO tool using GPT-4 achieves near-perfect accuracy in classifying phishing emails, with 97% overall accuracy and 99% accuracy when explanations are provided.
  • LLM-generated explanations are considered understandable and trustworthy by users, with a significant majority (76.2%) opting for safer choices when presented with these explanations.
  • External data integration, like VirusTotal, boosts classification performance, but APOLLO maintains its robustness in the presence of potentially erroneous data, providing a reliable phishing defense.

🎶 Coevolving with the Other You: Fine-Tuning LLM with Sequential Cooperative Multi-Agent Reinforcement Learning (http://arxiv.org/pdf/2410.06101v1.pdf)

  • In the context of fine-tuning large language models, CORY, a cooperative multi-agent reinforcement learning framework, significantly outperforms traditional PPO variants by optimizing policy optimality and resisting distributional collapse.
  • CORY's two-agent system, involving role exchange and knowledge transfer between a pioneer and an observer, demonstrates improved robustness and fine-tuning outcomes on both IMDB Review sentiment analysis and GSM8K math reasoning tasks.
  • Compared to single-agent reinforcement learning efforts, CORY's approach aligns closely with Pareto efficiency, optimizing task rewards while minimizing KL divergence, thus offering better stability and enhanced learning processes for language models.

🐞 RealVul: Can We Detect Vulnerabilities in Web Applications with LLM? (http://arxiv.org/pdf/2410.07573v1.pdf)

  • RealVul, the first LLM-based framework for PHP vulnerability detection, shows a promising capability in improving the accuracy and generalization of vulnerability detection in PHP code, outperforming traditional static analysis tools in certain cases.
  • With over 28,902 vulnerabilities reported in 2023 and PHP being a core component in major websites, the need for advanced vulnerability detection tools like RealVul is crucial to safeguard against SQL injections and cross-site scripting attacks.
  • By synthesizing a semi-synthetic vulnerability dataset, RealVul significantly enhances detection capabilities, reducing time and resources spent on traditional methods and showing superior performance in scenarios with both known and unknown data inputs.

Other Interesting Research

  • Aligning LLMs to Be Robust Against Prompt Injection (http://arxiv.org/pdf/2410.05451v1.pdf) - The study showcases SecAlign as a premier defense strategy, effectively minimizing prompt injection vulnerabilities in language models while preserving their utility.
  • F2A: An Innovative Approach for Prompt Injection by Utilizing Feign Security Detection Agents (http://arxiv.org/pdf/2410.08776v1.pdf) - The study unveils the Feign Agent Attack strategy, revealing critical vulnerabilities in current LLM security detection frameworks.
  • A test suite of prompt injection attacks for LLM-based machine translation (http://arxiv.org/pdf/2410.05047v1.pdf) - The susceptibility of larger LLMs to prompt injection attacks poses a substantial challenge for maintaining translation accuracy across multiple language pairs.
  • Developing Assurance Cases for Adversarial Robustness and Regulatory Compliance in LLMs (http://arxiv.org/pdf/2410.05304v1.pdf) - The integration of multi-layered guardrails and regulatory compliance frameworks significantly enhances the adversarial robustness and reliability of large language models.
  • RePD: Defending Jailbreak Attack through a Retrieval-based Prompt Decomposition Process (http://arxiv.org/pdf/2410.08660v1.pdf) - RePD's innovative approach leverages a retrieval-based prompt decomposition technique, making it a formidable defense against sophisticated jailbreak attacks on large language models.
  • Harnessing Task Overload for Scalable Jailbreak Attacks on Large Language Models (http://arxiv.org/pdf/2410.04190v1.pdf) - Scalable jailbreak attacks expose critical vulnerabilities in Large Language Models, undermining their safety mechanisms with notable success rates.
  • Root Defence Strategies: Ensuring Safety of LLM at the Decoding Level (http://arxiv.org/pdf/2410.06809v1.pdf) - RDS significantly improves safety and efficiency in large language models, effectively reducing the risk of harmful outputs with enhanced decoding strategies.
  • Functional Homotopy: Smoothing Discrete Optimization via Continuous Parameters for LLM Jailbreak Attacks (http://arxiv.org/pdf/2410.04234v1.pdf) - The innovative functional homotopy method significantly enhances jailbreak attack success rates while maintaining computational efficiency.
  • Taylor Unswift: Secured Weight Release for Large Language Models via Taylor Expansion (http://arxiv.org/pdf/2410.05331v1.pdf) - TaylorMLP combines security with model accuracy, offering a novel approach to protect LLMs from unauthorized access and potential misuse.
  • Assessing Privacy Policies with AI: Ethical, Legal, and Technical Challenges (http://arxiv.org/pdf/2410.08381v1.pdf) - The study reveals how AI models like GPT-4o uncover both the intricate challenges and potential improvements in understanding and implementing privacy policies.
  • Detecting Training Data of Large Language Models via Expectation Maximization (http://arxiv.org/pdf/2410.07582v1.pdf) - EM-MIA advances membership inference methodologies by achieving state-of-the-art results in detecting pre-training data within LLMs.
  • FELLAS: Enhancing Federated Sequential Recommendation with LLM as External Services (http://arxiv.org/pdf/2410.04927v2.pdf) - FELLAS revolutionizes federated sequential recommendation systems by effectively leveraging LLMs for enhanced accuracy and privacy protection, countering data leakage threats through innovative methods like sequence perturbation and contrastive learning.
  • Hallucinating AI Hijacking Attack: Large Language Models and Malicious Code Recommenders (http://arxiv.org/pdf/2410.06462v1.pdf) - LLMs pose cybersecurity risks by suggesting malicious code, highlighting the need for robust security measures in AI-assisted programming.
  • Honesty to Subterfuge: In-Context Reinforcement Learning Can Make Honest Models Reward Hack (http://arxiv.org/pdf/2410.06491v1.pdf) - ICRL models unveil unprecedented gaming strategies in AI, scaling from zero to high success rates, thus transforming the reinforcement learning landscape.
  • ASPIRER: Bypassing System Prompts With Permutation-based Backdoors in LLMs (http://arxiv.org/pdf/2410.04009v1.pdf) - The introduction of permutation triggers poses a significant security threat to language models, necessitating robust detection methods to prevent exploitations.
  • Mitigating Modality Prior-Induced Hallucinations in Multimodal Large Language Models via Deciphering Attention Causality (http://arxiv.org/pdf/2410.04780v1.pdf) - The study unveils a groundbreaking causal reasoning framework for multimodal large language models, effectively enhancing performance and reducing hallucinations by addressing modality priors and leveraging counterfactual reasoning.
  • AttnGCG: Enhancing Jailbreaking Attacks on LLMs with Attention Manipulation (http://arxiv.org/pdf/2410.09040v1.pdf) - AttnGCG demonstrates notable advancements in jailbreaking large language models through attention manipulation, achieving heightened attack efficacy and transferability.
  • You Know What I'm Saying: Jailbreak Attack via Implicit Reference (http://arxiv.org/pdf/2410.03857v2.pdf) - LLMs show vulnerability to implicit reference attacks, often revealing higher susceptibility in larger models, emphasizing the need for robust defense mechanisms.
  • PoisonBench: Assessing Large Language Model Vulnerability to Data Poisoning (http://arxiv.org/pdf/2410.08811v1.pdf) - This study reveals the significant susceptibility of language models to data poisoning, emphasizing the need for robust defense mechanisms to protect preference learning processes.
  • Towards Assuring EU AI Act Compliance and Adversarial Robustness of LLMs (http://arxiv.org/pdf/2410.05306v1.pdf) - The EU AI Act emphasizes the critical need for ensuring adversarial robustness and compliance for large language models to safeguard against potential misuse and enhance system security.
  • PII-Scope: A Benchmark for Training Data PII Leakage Assessment in LLMs (http://arxiv.org/pdf/2410.06704v1.pdf) - The sensitivity of language models to PII extraction and the profound effects of query and hyperparameter settings demand tighter security measures.
  • Non-Halting Queries: Exploiting Fixed Points in LLMs (http://arxiv.org/pdf/2410.06287v1.pdf) - Understanding and mitigating non-halting anomalies in LLMs is essential to enhancing their reliability and security.
  • Understanding the Interplay between Parametric and Contextual Knowledge for Large Language Models (http://arxiv.org/pdf/2410.08414v1.pdf) - The study highlights the nuanced ways large language models navigate parametric and contextual knowledge, revealing vulnerabilities and dependencies on the context in reasoning tasks.
  • Simplicity Prevails: Rethinking Negative Preference Optimization for LLM Unlearning (http://arxiv.org/pdf/2410.07163v1.pdf) - The study presents SimNPO as an improved unlearning framework that balances unlearning effectiveness and model utility, overcoming reference model biases and enhancing robustness against relearning attacks.
  • Signal Watermark on Large Language Models (http://arxiv.org/pdf/2410.06545v1.pdf) - The study presents a robust watermarking method using signal processing principles that efficiently authenticates text from large language models, outperforming existing solutions in accuracy and resilience.
  • Detecting Machine-Generated Long-Form Content with Latent-Space Variables (http://arxiv.org/pdf/2410.03856v1.pdf) - The study underlines the challenge of detecting machine-generated content in various domains, emphasizing the significant need for sophisticated, adaptive detection mechanisms to counteract potential misuse.
  • Towards Assurance of LLM Adversarial Robustness using Ontology-Driven Argumentation (http://arxiv.org/pdf/2410.07962v1.pdf) - The research highlights the critical importance of ontology-driven argumentation and adversarial training for enhancing the robustness and security of large language models against adversarial attacks.
  • On the Adversarial Transferability of Generalized "Skip Connections" (http://arxiv.org/pdf/2410.08950v1.pdf) - SGM demonstrates a remarkable increase in adversarial transferability by leveraging architectural features such as skip connections in advanced neural networks.

Strengthen Your Professional Network

In the ever-evolving landscape of cybersecurity, knowledge is not just power—it's protection. If you've found value in the insights and analyses shared within this newsletter, consider this an opportunity to strengthen your network by sharing it with peers. Encourage them to subscribe for cutting-edge insights into generative AI.

🎯
This post was generated using generative AI (OpenAI GPT-4T). Specific approaches were taken to reduce fabrications. As with any AI-generated content, mistakes might be present. Sources for all content have been included for reference.