Social Choice for AI Alignment

Abstract

Approved for public releaseGenerative AI has made huge strides in the past two years, with large language models (LLMs) such as ChatGPT emerging as the technology#s poster child. Key to this progress is a surprisingly simple innovation, reinforcement learning from human feedback (RLHF). In a typical implementation of RLHF, human feedback is provided in the form of pairwise comparisons, and a reward function is learned from these by fitting a statistical model.Is this the #right# way of aggregating individual preferences towards a socially desirable reward function, however? To answer this question, one can draw on social choice theory, a field that studies collective decision making through a mathematical lens. A common approach in social choice analyzes the desirability of aggregation methods through their satisfaction of certain axioms that capture notions of consensus, fairness, and economic efficiency. When examined through this normative lens, current RLHF methods fall short.The high-level goal of this proposal, therefore, is to design and evaluate novel RLHF methods that satisfy desirable axiomatic properties by drawing on social choice theory. The proposed research is organized into three thrusts, each of which is designed to address a critical gap in the current RLHF paradigm:Randomized rules: The first thrust explores the design of RLHF methods that use randomized rules. The idea is to create distributions over reward functions that are (approximately) pairwise calibrated, in the sense that they reflect the preferences of the population over pairs of alternatives, while excluding outliers with extreme or repugnant views.Personalization: The second thrust asks how to design personalized RLHF methods that cater to individual preferences while avoiding the risks associated with full personalization, such as the creation of echo chambers or the reinforcement of harmful views. By using a budgeted social choice framework, this thrust investigates methods that allow for a controlled degree of personalization, where multiple reward functions can cater to different groups orindividuals, without sacrificing the collective welfare or ethical principles shared by society.Independence of clones: The third thrust aims to alleviate the paradoxical behavior of cur- rent RLHF methods: when presented with very strong but similar alternatives, the method may prefer weaker alternatives. In social choice theory, immunity to this type of behavior is known as independence of clones. The proposed research seeks to design RLHF methods that are independent of clones by adapting voting rules that are known topossess this property, such as instant-runoff voting and ranked pairs. Such methods will safeguard the alignment process from distortions caused by redundant or equivalent options.Future Naval Relevance. The current DoD Data, Analytics, and AI Adoption Strategy emphasizes the importance of responsible AI that is well aligned in the sense of, for example, reducing bias. The proposed research directly addresses this need through the development of novel alignment capabilities. It is worth noting that specialized AI systems need not be aligned with the general population. In particular, the US Navy could use the proposed methodology to align its AI systems with the preferences of Navy stakeholders who have domain expertise. Even a smaller group of specialists would exhibit diverse views, which should be reflected in the outcome of the alignment process.

Document Details

Document Type: DoD Grant Award
Publication Date: Feb 24, 2025
Source ID: N000142512153

Entities

People

Ariel Procaccia

Organizations

Office of Naval Research
President and Fellows of Harvard College
United States Navy

Social Choice for AI Alignment

Abstract

Document Details

Entities

People

Organizations

Tags

Readers

Technology Areas