Auditory Salience in Complex Natural Scenes

Abstract

Project abstract (approved for public release)-Advances in sensor technologies make available massive streams of data that can be fused into actionable intelligence. While much focus has been put on motion and optical sensors for image/video understanding, audio streams from sensor networks undergo only rudimentary processing that detects simple events such as transients from explosions, gunshots or focus solely on human speech and voice communications. In contrast, a human listener is capable of intelligent sound analytics that afford a rich inference of the scene including interpretation of the human voice in context of other sound events, as well as establishing causal relationships between sensor data and objects in the environment. While current technologies face the challenge of man/computational power needed to make sense of this data, biological networks address the problem of sensory overload using mechanisms of selective attention and adaptive processing. Humans (and animals) are very accomplished at parsing their environment and constantly navigate a nonlinear space of possible behavioral strategies while adapting to different sensory information. Nonetheless, human performers remain limited by capacity, training/familiarity and speed. By better understanding these limitations, we can design systems that enable a partnership of automated and biological systems that is superior to either alone, all while augmenting optical analytics with intelligent audio capability.--The current project advances our understanding of how biological and artificial networks parse rich audio signals in order to detect and interpret salient sound events in complex environments. With recent advances in artificial intelligence and greater interest in audio tagging, interactive technologies and smart speakers, a number of systems have been recently developed for sound event detection based on deep neural networks. These models rely on large amounts of annotated data for training; and the data curation itself is heavily informed by video information and metadata as well as highly trained annotators. To date, the largest database currently available is Google#s AudioSet which annotates sound events based on YouTube videos. Biases introduced by using video and metadata information compound the problem of audio analytics as they often diverge from natural perception where human listeners react and interpret different sound events depending not only on the sensory signal itself, but also the presence or absence of visual cues as well as context, attentional state, and behavioral goals of the task at hand. These contexts and behavior-dependent interpretations go far beyond what is currently feasible with conventional (albeit sophisticated) frameworks based on one-to-one mappings (signal-to-label).--The proposed work centers around a hypothesis that an interplay of attentional control and memory predictions act as executive feedback that regulate responses in sensory cortex via mechanisms of rapid neural plasticity to shape the representation of auditory objects in line with different behavioral goals. In other words, the nature of the sensory representation itself is heavily modulated by cognitive feedback which closely ties sensory encoding with behavioral goals driven by temporal dynamics of sounds as they unfold over time. The proposed effort evolves in 3 directions: (1) investigate behavioral and neural mechanisms engaged by this network in human listeners; (2) develop a computational infrastructure informed by experimental findings that integrates cortical processing, attentional selectivity, memory expectations and salience discrimination; (3) scale the model to applications of interest to ONR and DoD, and use it as springboard for generating testable hypotheses and further developing large scale infrastructures for intelligent audio analytics.

Document Details

Document Type
DoD Grant Award
Publication Date
Jan 12, 2023
Source ID
N000142312050

Entities

People

  • Mounya Elhilali

Organizations

  • Johns Hopkins University
  • Office of Naval Research
  • United States Navy

Tags

Readers

  • Agent-Based Social Robotics and Mobile-Assisted Learning in Virtual Environments.
  • Speech Processing/Speech Recognition.
  • Systems Analysis and Design

Technology Areas

  • AI & ML
  • Space