The ICSI+ Multilingual Sentence Segmentation System

Abstract

The ICSI+ multilingual sentence segmentation with results for English and Mandarin broadcast news automatic speech recognizer transcriptions represents a joint effort involving ICSI, SRI, and UT Dallas. Our approach is based on using hidden event language models for exploiting lexical information, and maximum entropy and boosting classifiers for exploiting lexical, as well as prosodic, speaker change and syntactic information. We demonstrate that the proposed methodology including pitch- and energy-related prosodic features performs significantly better than a baseline system that uses words and simple pause features only. Furthermore, the obtained improvements are consistent across both languages, and no language-specific adaptation of the methodology is necessary. The best results were achieved by combining hidden event language models with a boosting-based classifier that to our knowledge has not previously been applied for this task.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Jan 01, 2006
Accession Number
ADA459046

Entities

People

  • D. Hakkani-tuer
  • E. Shriberg
  • Jason M. Fung
  • L. Gottlieb
  • M. Zimmerman
  • N. Mirghafori
  • Y. Liu

Organizations

  • International Computer Science Institute

Tags

Communities of Interest

  • Autonomy
  • Energy and Power Technologies

DTIC Thesaurus Topics

  • Applied Computer Science
  • Automata Theory
  • Automated Speech Recognition
  • Boundaries
  • Computational Linguistics
  • Computational Science
  • Computer Languages
  • Computer Science
  • Computer Vision
  • Language
  • Linguistics
  • Machine Learning
  • Machine Translation
  • Natural Language Processing
  • Natural Languages
  • Recognition
  • Supervised Machine Learning

Fields of Study

  • Computer science

Readers

  • Computational Linguistics
  • Speech Processing/Speech Recognition.