Multi-group machine learning: theory and algorithms

Abstract

Much concern about trust in data-driven systems stems from the use of aggregate evaluation metrics that obscure the structure of errors. These metrics are typically averages, taken over a broad population of instances, where an "instance" might represent a particular individual or a particular situation involving an individual. However, a finding of good average-case performance offers littleassurance about what the system may do for any particular instance, and the lack of assurances about finer-grain behavior can thus lead to mistrust. The same is true beyond individual instances: the behavior of the system on smaller subpopulations of instances can be very different from the behavior on larger subpopulations, due to the disparity in influence on the overall average.The proposed project is to advance the theory and algorithms for machine learning problems that require performance guarantees at subpopulationlevels, a general class of problems that we call multi-group learning. We will leverage techniques from statistical machine learning to design new algorithms that achieve the multi-group agnostic learning guarantee with optimal sample complexity. We will build onpreliminary work that uses decision lists and online learning to construct and analyze multi-group learning algorithms. We will also study the statistical properties of these methods in contexts beyond multi-group learning, such as transfer learning.There are at least a few different use-cases for multi-group learning. The first is a drop-in replacement for machine learning algorithms used indata-driven systems whose performance has high-stakes consequences, especially at the level of individual subgroups within the broader population. A second use-case for multi-group learning is related to a particular form of the "distribution shift" problem, where training data is assumed to be drawn from a broad population---typically because training datasets are often a "dataset of convenience"---but the performance of a learned predictor is only really relevant over some (a priori unknown) subgroup. This is the problem of "hidden stratification" that was previously studied in the literature on machine learning applications to medical imaging.Developing trustworthy machine learning algorithms is of great importance in naval applications, where users of data-driven systems must be able to reason about the outputs of these systems. Reasoning about average-case performance over very broad populations is unlikely to be sufficient for users to make critical decisions, and it is almost certainly not informative enough to guide (semi- or fully-) automated decision-making. The proposed project will therefore address a major limitation of existing machine learning algorithmsand theory.Approved for Public Release.

Document Details

Document Type
DoD Grant Award
Publication Date
Nov 09, 2024
Source ID
N000142412700

Entities

People

  • Daniel Hsu

Organizations

  • Office of Naval Research
  • Trustees of Columbia University in the City of New York
  • United States Navy

Tags

Fields of Study

  • Computer science

Readers

  • Agent-Based Social Robotics and Mobile-Assisted Learning in Virtual Environments.
  • Applied Combinatorial Optimization and Logic Circuit Design.
  • Distributed Systems and Data Platform Development

Technology Areas

  • AI & ML
  • AI & ML - Machine Learning Algorithms
  • AI & ML - Neural Networks