Towards Large Language Models Robust to Adversarial Attacks
Abstract
This proposal aims to address the critical vulnerability of large language models (LLMs) to adversarial attacks, which can compromise the reliability and security of these models, particularly when deployed in safety-critical systems. Despite their widespread application across various domains, including naval operations, LLMs remain susceptible to crafted input sequences that can manipulate their behavior. Traditional methods for enhancing robustness, such as adversarial training and certified robustness, are impracticalfor LLMs due to their computational inefficiency and scalability issues. This project proposes the development of a new class of adversarially robust LLM training and inference methods tailored to overcome these challenges. By exploring techniques like slow adversarial training and representation control, the project seeks to significantly advance the current state of robustness in LLMs. Success in this endeavor could have far-reaching implications, securing LLMs against malicious inputs and ensuring their safe and effective deployment in mission-critical environments, including naval operations where reliability and security are paramount.APPROVED FOR PUBLIC RELEASE
Document Details
- Document Type
- DoD Grant Award
- Publication Date
- Nov 09, 2024
- Source ID
- N000142412693
Entities
People
- J. Zico Kolter
Organizations
- Carnegie Mellon University
- Office of Naval Research
- United States Navy