Research

Last Week in GAI Security Research - 10/07/24

Highlights from Last Week

🫥 The Perfect Blend: Redefining RLHF with Mixture of Judges
🔏 Confidential Prompting: Protecting User Prompts from Cloud LLM Providers
🦺 Overriding Safety protections of Open-source Models
🤖 Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents
🧠 Undesirable Memorization in Large Language Models: A Survey
🚧 The potential of LLM-generated reports in DevSecOps
🏋 AutoPenBench: Benchmarking Generative Agents for Penetration Testing

Partner Content

Codemod is the end-to-end platform for code automation at scale. Save days of work by running recipes to automate framework upgrades

Leverage the AI-powered Codemod Studio for quick and efficient codemod creation, coupled with the opportunity to engage in a vibrant community for sharing and discovering code automations.
Streamline project migrations with seamless one-click dry-runs and easy application of changes, all without the need for deep automation engine knowledge.
Boost large team productivity with advanced enterprise features, including task automation and CI/CD integration, facilitating smooth, large-scale code deployments.

🫥 The Perfect Blend: Redefining RLHF with Mixture of Judges (http://arxiv.org/pdf/2409.20370v1.pdf)

The Constrained Generative Policy Optimization (CGPO) framework developed significantly mitigates reward hacking and optimizes multiple objectives for Large Language Models (LLMs) post-training.
CGPO outperformed existing Reinforcement Learning from Human Feedback (RLHF) algorithms by 7.4% on benchmark tasks and various coding and math challenges, demonstrating its superior optimization capabilities.
The Mixture Judges (MoJs) approach within CGPO offers a scalable and effective solution to handle multi-constraint, multi-objective optimizations, reducing false refusal ratios and enhancing model alignment across diverse tasks.

🔏 Confidential Prompting: Protecting User Prompts from Cloud LLM Providers (http://arxiv.org/pdf/2409.19134v1.pdf)

Prompt Obfuscation (PO) maintains LLM efficiency while protecting sensitive prompts by creating indistinguishable fake n-grams, preserving output confidentiality with minimal computational overhead.
Secure Multi-party Decoding (SMD) allows shared yet secure LLM inference among multiple users by dividing the weights into private and public states, significantly improving efficiency compared to tradition Confidential Computing (CC) methods.
Experimentation demonstrated that the combined SMD+PO approach achieves up to 100 times better throughput and maintains consistent output for various model sizes and user loads, making it scalable for large-scale deployments.

🦺 Overriding Safety protections of Open-source Models (http://arxiv.org/pdf/2409.19476v1.pdf)

Fine-tuning models with harmful data significantly increases the Attack Success Rate (ASR) by 35%, while fine-tuning with safe data decreases the ASR by 51.68%.
Safe fine-tuned models demonstrate higher trustworthiness with lower perplexity and entropy compared to harmful fine-tuned models, which exhibit higher uncertainty and knowledge drift.
Accuracy for safe fine-tuned models drops by 6% under false information prompts, compared to basemodels which see an 11% accuracy drop.

🤖 Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents (http://arxiv.org/pdf/2410.02644v1.pdf)

Memory retrieval attacks against LLM agents showed a success rate of 84.30%, indicating a major vulnerability.
Existing defenses were found largely ineffective in protecting LLM agents from adversarial techniques, highlighting the need for stronger security measures.
Prompt injection and memory poisoning attacks pose significant challenges, with evaluations showing a high average attack success rate of 72.68%.

🧠 Undesirable Memorization in Large Language Models: A Survey (http://arxiv.org/pdf/2410.02650v1.pdf)

Large language models (LLMs) exhibit memorization capabilities that can lead to privacy and security risks due to the unintended retention of sensitive information.
Models with larger parameter counts and duplicated training data amplify the degree of memorization, requiring systematic strategies to mitigate potential privacy breaches.
Differential privacy and unlearning methods are explored as techniques to safeguard against the exposure of sensitive data while maintaining LLM performance.

🚧 The potential of LLM-generated reports in DevSecOps (http://arxiv.org/pdf/2410.01899v1.pdf)

The misuse of DevSecOps tools leads to alert fatigue, where false positive rates exceed 50%, resulting in desensitization and oversight of critical vulnerabilities.
LLM-generated security reports show promise in motivating action among developers by clearly outlining security issues and potential financial impacts, although their motivational effectiveness varies by report style.
The financial impact of unaddressed vulnerabilities, exemplified by cases like the 2017 Equifax breach, highlights the severe consequences including financial losses, reputational damage, and legal repercussions.

🏋 AutoPenBench: Benchmarking Generative Agents for Penetration Testing (http://arxiv.org/pdf/2410.03225v1.pdf)

The fully autonomous generative agent achieved a 21% success rate in penetration tasks, indicating limitations in handling complex cybersecurity scenarios autonomously.
Agents using Large Language Models, such as GPT-4o, displayed a 64% success rate when operating with human assistance, revealing the advantage of semi-autonomous collaboration.
The AUTOPENBENCH framework offers 33 tasks with varying difficulty, serving as an open-source benchmark to evaluate the performance and extend the capabilities of generative agents in penetration testing.

Other Interesting Research

Code Vulnerability Repair with Large Language Model using Context-Aware Prompt Tuning (http://arxiv.org/pdf/2409.18395v1.pdf) - Context-aware prompt tuning amplifies the repair success rate of LLMs for buffer overflow vulnerabilities from 15% to 63%, underscoring the importance of domain knowledge in prompt design.
GenTel-Safe: A Unified Benchmark and Shielding Framework for Defending Against Prompt Injection Attacks (http://arxiv.org/pdf/2409.19521v1.pdf) - GenTel-Shield demonstrated superior performance in detecting diverse prompt injection attacks, setting new benchmarks for LLM safety and security.
System-Level Defense against Indirect Prompt Injection Attacks: An Information Flow Control Perspective (http://arxiv.org/pdf/2409.19091v1.pdf) - The f-secure LLM system significantly enhances defense against prompt injection attacks while maintaining high efficiency and functionality.
Robust LLM safeguarding via refusal feature adversarial training (http://arxiv.org/pdf/2409.20089v1.pdf) - Refusal Feature Adversarial Training (ReFAT) significantly boosts LLM robustness against adversarial attacks with minimal computational overhead.
Mitigating Backdoor Threats to Large Language Models: Advancement and Challenges (http://arxiv.org/pdf/2409.19993v1.pdf) - The integration of backdoors in large language models through hidden triggers poses significant risks, necessitating advanced detection and mitigation strategies to safeguard against malicious behaviors.
TrojVLM: Backdoor Attack Against Vision Language Models (http://arxiv.org/pdf/2409.19232v1.pdf) - TrojVLM reveals significant vulnerabilities in Vision Language Models, achieving high success rates in backdoor attacks while preserving text output quality and semantic coherence.
Federated Instruction Tuning of LLMs with Domain Coverage Augmentation (http://arxiv.org/pdf/2409.20135v1.pdf) - FedDCA substantially boosts domain-specific LLM performance and privacy preservation in federated learning using innovative domain coverage strategies.
The Early Bird Catches the Leak: Unveiling Timing Side Channels in LLM Serving Systems (http://arxiv.org/pdf/2409.20002v1.pdf) - LLM-based systems are susceptible to novel timing channel attacks that can compromise the confidentiality of prompts, necessitating immediate mitigation strategies.
Efficient Federated Intrusion Detection in 5G ecosystem using optimized BERT-based model (http://arxiv.org/pdf/2409.19390v1.pdf) - The integration of federated learning and large language models like BERT offers a scalable, privacy-preserving, and highly efficient solution for intrusion detection in 5G networks.
MedHalu: Hallucinations in Responses to Healthcare Queries by Large Language Models (http://arxiv.org/pdf/2409.19492v1.pdf) - Harnessing expert feedback markedly enhances the ability of LLMs to detect healthcare-related hallucinations, bridging the gap between laypeople and expert performance in this critical area.
Wait, but Tylenol is Acetaminophen... Investigating and Improving Language Models' Ability to Resist Requests for Misinformation (http://arxiv.org/pdf/2409.20385v1.pdf) - Enhancing the alignment and fine-tuning processes of LLMs significantly improves their ability to handle misleading and high-stakes requests, reducing the risk of misinformation.
Ranking Over Scoring: Towards Reliable and Robust Automated Evaluation of LLM-Generated Medical Explanatory Arguments (http://arxiv.org/pdf/2409.20565v1.pdf) - The novel proxy task-based evaluation methodology for LLM-generated medical arguments aligns effectively with human judgments, offering a scalable and robust alternative to traditional evaluation methods.
Jailbreak Antidote: Runtime Safety-Utility Balance via Sparse Representation Adjustment in Large Language Models (http://arxiv.org/pdf/2410.02298v1.pdf) - Jailbreak Antidote offers a scalable, adaptive solution for enhancing AI safety in real-time without sacrificing utility, showing high efficacy across diverse LLMs and jailbreak scenarios.
Buckle Up: Robustifying LLMs at Every Customization Stage via Data Curation (http://arxiv.org/pdf/2410.02220v1.pdf) - Enhancing language model safety through comprehensive multi-stage defense and data curation offers promising results in overcoming adversarial threats.
Adversarial Suffixes May Be Features Too! (http://arxiv.org/pdf/2410.00451v1.pdf) - Adversarial suffixes effectively exploit vulnerabilities in language models, challenging current safety alignment and fine-tuning strategies.
Cut the Crap: An Economical Communication Pipeline for LLM-based Multi-Agent Systems (http://arxiv.org/pdf/2410.02506v1.pdf) - AgentPrune optimizes multi-agent communication by reducing redundant messages, lowering token costs, and enhancing robustness against attacks.
Domain-Specific Retrieval-Augmented Generation Using Vector Stores, Knowledge Graphs, and Tensor Factorization (http://arxiv.org/pdf/2410.02721v1.pdf) - Leveraging domain-specific knowledge graphs and augmentation techniques substantially boosts the performance and reliability of language models in specialized contexts.
Backdooring Vision-Language Models with Out-Of-Distribution Data (http://arxiv.org/pdf/2410.01264v1.pdf) - Discover how attackers can exploit vision-language models using clever backdoors and OOD data techniques to infiltrate high-stakes AI tasks.
Endless Jailbreaks with Bijection Learning (http://arxiv.org/pdf/2410.01294v1.pdf) - Universal jailbreaks exploiting bijection learning highlight critical vulnerabilities in large language models, with significant implications for model safety and security.
HarmAug: Effective Data Augmentation for Knowledge Distillation of Safety Guard Models (http://arxiv.org/pdf/2410.01524v1.pdf) - HarmAug surpasses the performance of larger, more resource-intensive safety guard models in detecting harmful LLM-generated instructions while drastically lowering computational demands.
Controlled Generation of Natural Adversarial Documents for Stealthy Retrieval Poisoning (http://arxiv.org/pdf/2410.02163v1.pdf) - Low-perplexity adversarial document generation promotes stealthy information retrieval poisoning, evading detection filters while retaining semantic integrity.
Discovering Clues of Spoofed LM Watermarks (http://arxiv.org/pdf/2410.02693v1.pdf) - Spoofing attacks can compromise LLM watermarking, but statistical tests and advanced methods can effectively detect such manipulations.
DomainLynx: Leveraging Large Language Models for Enhanced Domain Squatting Detection (http://arxiv.org/pdf/2410.02095v1.pdf) - DomainLynx advances domain squatting detection, offering enhanced accuracy and cost-efficiency with AI-driven threat analysis.
PclGPT: A Large Language Model for Patronizing and Condescending Language Detection (http://arxiv.org/pdf/2410.00361v1.pdf) - Enhanced PclGPT model significantly advances the detection of subtle patronizing and condescending language, highlighting improved biases management and recognition capabilities across diverse multilingual contexts.
Automated Red Teaming with GOAT: the Generative Offensive Agent Tester (http://arxiv.org/pdf/2410.01606v1.pdf) - The GOAT system for automated red teaming reveals critical vulnerabilities in large language models, achieving near-perfect attack success rates.
Buckle Up: Robustifying LLMs at Every Customization Stage via Data Curation (http://arxiv.org/pdf/2410.02220v2.pdf) - The study highlights a revolutionary approach to enhancing language model security through a data-driven, multi-stage defensive framework that shows unparalleled effectiveness in thwarting jailbreaking attempts.
FlipAttack: Jailbreak LLMs via Flipping (http://arxiv.org/pdf/2410.02832v1.pdf) - FlipAttack innovates LLM jailbreaks using left-side noise, demonstrating impressive success rates and limitations of current defenses.
Safeguard is a Double-edged Sword: Denial-of-service Attack on Large Language Models (http://arxiv.org/pdf/2410.02916v1.pdf) - This study reveals a significant gap in current LLM safeguards, highlighting the ongoing vulnerability to creative adversarial attacks despite advanced safety measures.
Demonstration Attack against In-Context Learning for Code Intelligence (http://arxiv.org/pdf/2410.02841v1.pdf) - DICE showcases the stark vulnerability of LLMs to subtle adversarial inputs, compromising code intelligence tasks while evading current defenses.
Deceptive Risks in LLM-enhanced Robots (http://arxiv.org/pdf/2410.00434v1.pdf) - The integration of LLMs into social robots highlights both the transformative potential and critical safety challenges in healthcare applications.
Universally Optimal Watermarking Schemes for LLMs: from Theory to Practice (http://arxiv.org/pdf/2410.02890v1.pdf) - This paper introduces a robust, model-agnostic token-level watermarking scheme that effectively identifies AI-generated content by optimizing detector performance and minimizing errors, while demonstrating resilience against various attacks.
Can Watermarked LLMs be Identified by Users via Crafted Prompts? (http://arxiv.org/pdf/2410.03168v1.pdf) - The research introduces innovative watermarking detection techniques that enhance imperceptibility without compromising detection accuracy in large language models.
Optimizing Adaptive Attacks against Content Watermarks for Language Models (http://arxiv.org/pdf/2410.02440v1.pdf) - The study challenges the security of watermarking methods against adaptive attacks, revealing significant vulnerabilities in watermark robustness.
Federated Instruction Tuning of LLMs with Domain Coverage Augmentation (http://arxiv.org/pdf/2409.20135v3.pdf) - FedDIT and FedDCA enhance domain-specific language model performance while preserving privacy through federated instruction tuning and domain coverage augmentation.
Position: LLM Unlearning Benchmarks are Weak Measures of Progress (http://arxiv.org/pdf/2410.02879v1.pdf) - The paper highlights significant issues with current LLM unlearning benchmarks and calls for improved, threat model-aware methodologies to ensure reliable privacy and performance outcomes in unlearning scenarios.
RAFT: Realistic Attacks to Fool Text Detectors (http://arxiv.org/pdf/2410.03658v1.pdf) - RAFT attacks highlight critical vulnerabilities in LLM text detectors, driving an urgent need for improved adversarial robustness and resilient detection strategies.
Is Safer Better? The Impact of Guardrails on the Argumentative Strength of LLMs in Hate Speech Countering (http://arxiv.org/pdf/2410.03466v1.pdf) - Exploring the balance between safety guardrails and argumentative strength reveals critical trade-offs in automated counterspeech generation.
Vulnerability Detection via Topological Analysis of Attention Maps (http://arxiv.org/pdf/2410.03470v1.pdf) - The study showcases the successful application of topological data analysis on BERT attention maps for improved software vulnerability detection, offering an innovative enhancement over traditional machine learning methods.

Strengthen Your Professional Network

In the ever-evolving landscape of cybersecurity, knowledge is not just power—it's protection. If you've found value in the insights and analyses shared within this newsletter, consider this an opportunity to strengthen your network by sharing it with peers. Encourage them to subscribe for cutting-edge insights into generative AI.

🎯

This post was generated using generative AI (OpenAI GPT-4o). Specific approaches were taken to reduce fabrications. As with any AI-generated content, mistakes might be present. Sources for all content have been included for reference.

Last Week in GAI Security Research - 10/07/24

Highlights from Last Week

Partner Content

Other Interesting Research

Strengthen Your Professional Network

Read next

Last Week in GAI Security Research - 11/18/24

Hire an AI Strategist in Security

Last Week in GAI Security Research - 09/23/24