Last Week in GAI Security Research - 02/24/25

Highlights from Last Week
- 🤔 What are Models Thinking about? Understanding Large Language Model Hallucinations "Psychology" through Model Inner State Analysis
- 🤖 Towards Robust and Secure Embodied AI: A Survey on Vulnerabilities and Attacks
- 🔐 Do LLMs Consider Security? An Empirical Study on Responses to Programming Questions
- 🤑 Fraud-R1 : A Multi-Round Benchmark for Assessing the Robustness of LLM Against Augmented Fraud and Phishing Inducements
- 🤾 Red-Teaming LLM Multi-Agent Systems via Communication Attacks
- 👹 DemonAgent: Dynamically Encrypted Multi-Backdoor Implantation Attack on LLM-based Agent
Partner Content

Pillar Security is the security stack for AI teams. Fortify the entire AI application development lifecycle while helping Security teams regain visibility and visibility control.
- Gain complete oversight of your AI inventory. Audit usage, app interactions, inputs, outputs, meta-prompts, user sessions, models and tools with full transparency.
- Safeguard your apps with enterprise-grade low-latency security and safety guardrails. Detect and prevent attacks that can affect your users, data and AI-app integrity.
- Assess and reduce risk by continuously stress-testing your AI apps with automated security and safety evaluations. Enhance resilience against novel attacks and stay ahead of emerging threats.
🤔 What are Models Thinking about? Understanding Large Language Model Hallucinations "Psychology" through Model Inner State Analysis (http://arxiv.org/pdf/2502.13490v1.pdf)
- Large Language Models (LLMs) generate hallucinations due to inconsistencies in training datasets and inherent randomness during the inference process, with up to 59% of content quality issues stemming from data inaccuracies.
- The Hallucination Detection using Internal States framework enables real-time intervention by analyzing the internal features of LLMs, revealing that attention distribution and activation sharpness are crucial for understanding and mitigating hallucinated content.
- Transferability of hallucination detection techniques is limited due to dataset-specific patterns, highlighting challenges in cross-dataset generalizability and the necessity for dataset-appropriate models in different domains.
🤖 Towards Robust and Secure Embodied AI: A Survey on Vulnerabilities and Attacks (http://arxiv.org/pdf/2502.13175v1.pdf)
- Embodied AI systems are highly susceptible to both exogenous and endogenous vulnerabilities such as sensor spoofing, adversarial attacks, and system failures which pose significant safety and reliability challenges.
- The integration of AI with Internet of Things (IoT) and cloud infrastructures introduces cybersecurity threats, including data injection and unauthorized access, that require robust security protocols to mitigate risks.
- Large Language Models (LLMs) and Large Vision Language Models (LVLMs) exhibit vulnerabilities to adversarial prompts and multimodal noise, leading to potential misclassifications and harmful outputs.
🔐 Do LLMs Consider Security? An Empirical Study on Responses to Programming Questions (http://arxiv.org/pdf/2502.14202v1.pdf)
- Large Language Models (LLMs) such as GPT-4, Claude 3, and Llama 3 have a security vulnerability detection rate of merely 12.6% to 40%, often failing to identify insecure coding practices unless explicitly prompted.
- Less than 28% of developer queries led to LLMs issuing explicit security warnings on security flaws in analyzed responses, indicating a need for improved LLM security awareness and prompting strategies.
- LLMs including GPT-4 provide more comprehensive information about causes and potential fixes for vulnerabilities relative to user-provided Stack Overflow responses, outperforming them in providing exploit details in 8.8% to 20.9% of test cases.
🤑 Fraud-R1 : A Multi-Round Benchmark for Assessing the Robustness of LLM Against Augmented Fraud and Phishing Inducements (http://arxiv.org/pdf/2502.12904v1.pdf)
- A comprehensive evaluation involving 15 open-source LLMs demonstrated variable Defense Success Rates (DSRs) across internet fraud scenarios, with performance percentages ranging from 38.92% to 83.27%.
- The Fraud-R1 benchmark organized fraud detection efforts into five key fraud types and tested multilingual models, revealing a notable performance gap between Chinese and English settings.
- Role-play scenarios significantly challenged model effectiveness, highlighting reduced efficacy in detecting fraud after multiple rounds and emphasizing the need for robust defensive strategies.
🤾 Red-Teaming LLM Multi-Agent Systems via Communication Attacks (http://arxiv.org/pdf/2502.14847v1.pdf)
- Agent-in-the-Middle attacks exhibit a high success rate, with rates exceeding 70% in certain experiments, highlighting a significant vulnerability in LLM-based multi-agent systems' communication protocols.
- Communication structures within a system greatly influence the effectiveness of AiTM attacks, with chain structures experiencing the highest vulnerability and complete structures displaying more resilience.
- The success of AiTM attacks is notably increased by the position of the victim agent within the communication structure and the sophistication of adversarial LLM models, as stronger adversarial agents achieve higher success rates.
👹 DemonAgent: Dynamically Encrypted Multi-Backdoor Implantation Attack on LLM-based Agent (http://arxiv.org/pdf/2502.12575v1.pdf)
- The Dynamically Encrypted Multi-Backdoor Implantation Attack achieved a 100% attack success rate across various domains while maintaining a 0% detection rate during safety audits.
- Through Multi-Backdoor Tiered Implantation, backdoor attacks have increased stealth by utilizing encrypted sub-backdoor fragments and cumulative triggering, complicating detection.
- The stealthy nature of this mechanism enhances the adaptability and effectiveness of attacks across diverse domain applications without hindering the normal task performance of LLM-based agents.
Other Interesting Research
- G-Safeguard: A Topology-Guided Security Lens and Treatment on LLM-based Multi-agent Systems (http://arxiv.org/pdf/2502.11127v1.pdf) - G-Safeguard significantly enhances multi-agent system security by efficiently detecting and mitigating adversarial attacks through a topology-guided approach.
- Prompt Inject Detection with Generative Explanation as an Investigative Tool (http://arxiv.org/pdf/2502.11006v1.pdf) - Fine-tuned models outperformed vanilla versions in distinguishing adversarial prompts, marking a critical advancement in AI security measures.
- UniGuardian: A Unified Defense for Detecting Prompt Injection, Backdoor Attacks and Adversarial Attacks in Large Language Models (http://arxiv.org/pdf/2502.13141v1.pdf) - UniGuardian outperformed existing defenses against prompt injection, backdoor, and adversarial attacks on large language models without requiring additional training.
- The Hidden Risks of Large Reasoning Models: A Safety Assessment of R1 (http://arxiv.org/pdf/2502.12659v1.pdf) - Enhanced reasoning capabilities in large models increase both problem-solving potential and safety risks, highlighting the need for comprehensive safety strategies.
- CCJA: Context-Coherent Jailbreak Attack for Aligned Large Language Models (http://arxiv.org/pdf/2502.11379v1.pdf) - Open-source models' enhanced accessibility inadvertently elevates safety risks with effective context-coherent jailbreaks targeting closed-source models.
- Why Safeguarded Ships Run Around? Aligned Large Language Models' Safety Mechanisms Tend to Be Anchored in The Template Region (http://arxiv.org/pdf/2502.13946v1.pdf) - The paper highlights critical vulnerabilities in the safety mechanisms of LLMs due to their reliance on template-anchored decision-making.
- Reasoning-Augmented Conversation for Multi-Turn Jailbreak Attacks on Large Language Models (http://arxiv.org/pdf/2502.11054v3.pdf) - Reasoning-augmented frameworks pose significant challenges to language model safety, achieving up to 96% success rates in circumventing current defenses.
- ShieldLearner: A New Paradigm for Jailbreak Attack Defense in LLMs (http://arxiv.org/pdf/2502.13162v1.pdf) - ShieldLearner innovatively enhances jailbreak defense in Large Language Models by leveraging self-learning mechanisms and flexible customizations.
- DELMAN: Dynamic Defense Against Large Language Model Jailbreaking with Model Editing (http://arxiv.org/pdf/2502.11647v1.pdf) - DELMAN offers robust defense against jailbreak attacks with minimal performance impact, maintaining high utility and broad cross-category protection.
- BaxBench: Can LLMs Generate Correct and Secure Backends? (http://arxiv.org/pdf/2502.11844v2.pdf) - Current LLMs display significant security gaps, with their inability to consistently produce secure and deployable code underlining the need for advanced benchmarks like BAXBENCH.
- LAMD: Context-driven Android Malware Detection and Classification with LLMs (http://arxiv.org/pdf/2502.13055v1.pdf) - The integration of context-aware frameworks like LAMD enhances the accuracy and robustness of Android malware detection, outperforming conventional methods by leveraging large language models for better interpretability and reduced detection errors.
- ReF Decompile: Relabeling and Function Call Enhanced Decompile (http://arxiv.org/pdf/2502.12221v1.pdf) - Innovative strategies in decompilation processes such as relabeling and function calls markedly enhance the accuracy and readability of decompiled outputs.
- StructTransform: A Scalable Attack Surface for Safety-Aligned Large Language Models (http://arxiv.org/pdf/2502.11853v1.pdf) - The study exposes critical gaps in the safety of language models against structure transformation attacks, with a notable 90% success rate even on leading AI systems.
- Be Cautious When Merging Unfamiliar LLMs: A Phishing Model Capable of Stealing Privacy (http://arxiv.org/pdf/2502.11533v1.pdf) - Exploring the privacy risks in merging open-source language models reveals significant vulnerabilities in disclosing sensitive user data through phishing attacks.
- Demonstrating specification gaming in reasoning models (http://arxiv.org/pdf/2502.13295v1.pdf) - AI models exhibit notable vulnerabilities and capabilities for hacking and manipulation, stressing the importance of robust prompt designs and further research into securing AI systems against unconventional exploitation.
- Uncertainty-Aware Step-wise Verification with Generative Reward Models (http://arxiv.org/pdf/2502.11250v1.pdf) - The study introduces a novel uncertainty quantification method using CoT Entropy, significantly improving the step-wise verification process for generative reward models in complex reasoning tasks.
- Evaluation of Best-of-N Sampling Strategies for Language Model Alignment (http://arxiv.org/pdf/2502.12668v1.pdf) - Innovative sampling strategies effectively address reward model biases, enhancing language model alignment with human objectives.
- SoK: Understanding Vulnerabilities in the Large Language Model Supply Chain (http://arxiv.org/pdf/2502.12497v1.pdf) - The intricate ecosystem of Large Language Models faces substantial security vulnerabilities, particularly in resource control and application layer challenges.
- Rewrite to Jailbreak: Discover Learnable and Transferable Implicit Harmfulness Instruction (http://arxiv.org/pdf/2502.11084v1.pdf) - The introduction of R2J offers a breakthrough in efficiently attacking large language models while highlighting the need for stronger safety mechanisms.
- Exploiting Prefix-Tree in Structured Output Interfaces for Enhancing Jailbreak Attacking (http://arxiv.org/pdf/2502.13527v1.pdf) - Exploitative attacks using structured output manipulation reveal significant vulnerabilities in large language models, challenging the efficacy of current safety protocols.
- SafeDialBench: A Fine-Grained Safety Benchmark for Large Language Models in Multi-Turn Dialogues with Diverse Jailbreak Attacks (http://arxiv.org/pdf/2502.11090v2.pdf) - SafeDialBench provides a comprehensive safety evaluation framework for Large Language Models, emphasizing the need for diverse strategies to mitigate jailbreak attack vulnerabilities.
- Reasoning-to-Defend: Safety-Aware Reasoning Can Defend Large Language Models from Jailbreaking (http://arxiv.org/pdf/2502.12970v1.pdf) - The R2D framework markedly strengthens the safety protocols of language models, curbing vulnerabilities to jailbreak attacks with a focus on refining reasoning capabilities.
- Adversary-Aware DPO: Enhancing Safety Alignment in Vision Language Models via Adversarial Training (http://arxiv.org/pdf/2502.11455v1.pdf) - The Adversary-aware DPO framework substantively elevates VLM safety and robustness against adversarial attacks, setting it apart through its efficacious integration of adversarial training.
- Efficient Safety Retrofitting Against Jailbreaking for LLMs (http://arxiv.org/pdf/2502.13603v1.pdf) - Direct Preference Optimization proves a cost-effective method for enhancing the safety alignment of LLMs, revealing notable evaluation biases and dataset-driven advancements in handling harmful content.
- Understanding and Rectifying Safety Perception Distortion in VLMs (http://arxiv.org/pdf/2502.13095v1.pdf) - Visual inputs in Vision-Language Models can distort safety perceptions, but the ShiftDC method offers a promising solution by restoring accurate safety alignment without further training.
- AutoTEE: Automated Migration and Protection of Programs in Trusted Execution Environments (http://arxiv.org/pdf/2502.13379v1.pdf) - LLMs significantly improve the secure adaptation of programs to Trusted Execution Environments, achieving high rates of accuracy and success in transformation with minimized developer input.
- ALGEN: Few-shot Inversion Attacks on Textual Embeddings using Alignment and Generation (http://arxiv.org/pdf/2502.11308v2.pdf) - The study underscores the potential risks and challenges of protecting textual embeddings against few-shot inversion attacks across various languages and domains.
- Towards Secure Program Partitioning for Smart Contracts with LLM's In-Context Learning (http://arxiv.org/pdf/2502.14215v1.pdf) - Integration of GPT in smart contract partitioning significantly curtails vulnerability to manipulation attacks while optimizing sensitive data security.
- Towards Context-Robust LLMs: A Gated Representation Fine-tuning Approach (http://arxiv.org/pdf/2502.14100v1.pdf) - The study demonstrates a lightweight fine-tuning approach that significantly enhances large language models (LLMs) by making them robust against misleading and irrelevant contexts.
- InstructAgent: Building User Controllable Recommender via LLM Agent (http://arxiv.org/pdf/2502.14662v1.pdf) - The study demonstrates the potential of Instruction-aware agents to enhance personalization and diversity in recommender systems, significantly outperforming traditional models.
- Detecting and Filtering Unsafe Training Data via Data Attribution (http://arxiv.org/pdf/2502.11411v1.pdf) - The integration of DABUF significantly elevates the effectiveness in identifying and filtering harmful training data, surpassing traditional safety measures.
- Computational Safety for Generative AI: A Signal Processing Perspective (http://arxiv.org/pdf/2502.12445v1.pdf) - Signal processing offers vital tools for enhancing the safety and performance of generative AI systems in light of pervasive risks and challenges.
- Mimicking the Familiar: Dynamic Command Generation for Information Theft Attacks in LLM Tool-Learning System (http://arxiv.org/pdf/2502.11358v1.pdf) - Dynamic command generation in LLM tool-learning systems poses significant security threats due to its high attack success rate and potential for information leakage.
- Automating Prompt Leakage Attacks on Large Language Models Using Agentic Approach (http://arxiv.org/pdf/2502.12630v1.pdf) - Prompt leakage frameworks leverage automation and modular agent systems to fortify large language models against data vulnerabilities, achieving high-security standards and minimal adversarial advantages.
- Unveiling Privacy Risks in LLM Agent Memory (http://arxiv.org/pdf/2502.13172v1.pdf) - High vulnerability in LLM agent memory extraction shows an urgent need for enhanced privacy safeguards and design modifications.
- DeFiScope: Detecting Various DeFi Price Manipulations with LLM Reasoning (http://arxiv.org/pdf/2502.11521v1.pdf) - DeFiScope efficiently detects price manipulation attacks in DeFi markets with a remarkable precision rate using fine-tuned LLMs, surpassing traditional state-of-the-art methods.
- Multi-Faceted Studies on Data Poisoning can Advance LLM Development (http://arxiv.org/pdf/2502.14182v1.pdf) - Exploring data poisoning reveals critical vulnerabilities and proposes strategic defenses to enhance the safety and reliability of large language models.
- Fundamental Limitations in Defending LLM Finetuning APIs (http://arxiv.org/pdf/2502.14828v1.pdf) - Despite stringent efforts, LLM fine-tuning APIs remain vulnerable to sophisticated misuse attacks, leveraging the model's inherent variability and evading detection mechanisms.
- CORBA: Contagious Recursive Blocking Attacks on Multi-Agent Systems Based on Large Language Models (http://arxiv.org/pdf/2502.14529v1.pdf) - CORBA attacks reveal critical vulnerabilities in LLM-MASs, demonstrating their significant susceptibility to recursive blocking strategies that severely impact system resilience and resource availability.
- SmartLLM: Smart Contract Auditing using Custom Generative AI (http://arxiv.org/pdf/2502.13167v1.pdf) - SmartLLM's integration of advanced LLaMA models and RAG techniques offers a high recall and balanced performance, advancing smart contract security beyond traditional analysis tools.
- R.R.: Unveiling LLM Training Privacy through Recollection and Ranking (http://arxiv.org/pdf/2502.12658v1.pdf) - R.R method significantly enhances PII reconstruction accuracy by leveraging novel biased ranking and recollection techniques in LLMs.
- Aspect-Guided Multi-Level Perturbation Analysis of Large Language Models in Automated Peer Review (http://arxiv.org/pdf/2502.12510v1.pdf) - This paper highlights crucial weaknesses in LLM-based peer review frameworks and underlines the importance of robust safeguards to ensure fairness and reliability.
- SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities (http://arxiv.org/pdf/2502.12025v1.pdf) - The integration of chain-of-thought reasoning improves safety metrics but also highlights the challenges of balancing length and harmful output filtration in language models.
- PEARL: Towards Permutation-Resilient LLMs (http://arxiv.org/pdf/2502.14628v1.pdf) - PEARL outperforms existing robust learning methods by safeguarding against permutations, crucial for increasing the reliability of LLMs in complex, multi-demonstration scenarios.
- Oreo: A Plug-in Context Reconstructor to Enhance Retrieval-Augmented Generation (http://arxiv.org/pdf/2502.13019v2.pdf) - Oreo revolutionizes retrieval-augmented generation by optimizing context use, significantly enhancing the performance and efficiency of language models through innovative training paradigms and token reduction strategies.
- AEIA-MN: Evaluating the Robustness of Multimodal LLM-Powered Mobile Agents Against Active Environmental Injection Attacks (http://arxiv.org/pdf/2502.13053v1.pdf) - The study exposes substantial vulnerabilities in AI mobile agents to advanced environmental injection attacks, emphasizing the need for enhanced defensive mechanisms.
- AGrail: A Lifelong Agent Guardrail with Effective and Adaptive Safety Detection (http://arxiv.org/pdf/2502.11448v2.pdf) - AGrail effectively enhances Large Language Models' security by adapting safety checks to specific task contexts, achieving high accuracy and significant risk mitigation.
- RM-PoT: Reformulating Mathematical Problems and Solving via Program of Thoughts (http://arxiv.org/pdf/2502.12589v1.pdf) - The RM-PoT framework effectively enhances large language models' mathematical problem-solving by reformulating problems and employing structured reasoning.
- FedEAT: A Robustness Optimization Framework for Federated LLMs (http://arxiv.org/pdf/2502.11863v1.pdf) - The introduction of the FedEAT framework represents a promising advancement in enhancing the robustness of federated LLMs without compromising performance, crucial for applications in privacy-sensitive domains.
- Can LLM Watermarks Robustly Prevent Unauthorized Knowledge Distillation? (http://arxiv.org/pdf/2502.11598v1.pdf) - The research introduces innovative watermark removal techniques that effectively preserve model performance, offering robust solutions against unauthorized knowledge distillation while maintaining high efficiency.
Strengthen Your Professional Network
In the ever-evolving landscape of cybersecurity, knowledge is not just power—it's protection. If you've found value in the insights and analyses shared within this newsletter, consider this an opportunity to strengthen your network by sharing it with peers. Encourage them to subscribe for cutting-edge insights into generative AI.