Towards Large Language Models Robust to Adversarial Attacks

Abstract

This proposal aims to address the critical vulnerability of large language models (LLMs) to adversarial attacks, which can compromise the reliability and security of these models, particularly when deployed in safety-critical systems. Despite their widespread application across various domains, including naval operations, LLMs remain susceptible to crafted input sequences that can manipulate their behavior. Traditional methods for enhancing robustness, such as adversarial training and certified robustness, are impracticalfor LLMs due to their computational inefficiency and scalability issues. This project proposes the development of a new class of adversarially robust LLM training and inference methods tailored to overcome these challenges. By exploring techniques like slow adversarial training and representation control, the project seeks to significantly advance the current state of robustness in LLMs. Success in this endeavor could have far-reaching implications, securing LLMs against malicious inputs and ensuring their safe and effective deployment in mission-critical environments, including naval operations where reliability and security are paramount.APPROVED FOR PUBLIC RELEASE

Document Details

Document Type: DoD Grant Award
Publication Date: Nov 09, 2024
Source ID: N000142412693

Entities

People

J. Zico Kolter

Organizations

Carnegie Mellon University
Office of Naval Research
United States Navy

Towards Large Language Models Robust to Adversarial Attacks

Abstract

Document Details

Entities

People

Organizations

Tags

Fields of Study

Readers

Technology Areas