Comparing Evaluation Metrics for Sentence Boundary Detection

Abstract

In recent NIST evaluations on sentence boundary detection, a single error metric was used to describe performance. Additional metrics, however, are available for such tasks, in which a word stream is partitioned into subunits. This paper compares alternative evaluation metrics including the NIST error rate, classification error rate per word boundary, precision and recall, ROC curves, DET curves, precision-recall curves, and area under the curves and discusses advantages and disadvantages of each. Unlike many studies in machine learning, we use real data for a real task. We find benefit from using curves in addition to a single metric. Furthermore, we find that data skew has an impact on metrics, and that differences among different system outputs are more visible in precision-recall curves. Results are expected to help us better understand evaluation metrics that should be generalizable to similar language processing tasks.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Jan 01, 2007
Accession Number
ADA463058

Entities

People

  • Elizabeth Shriberg
  • Yang Liu

Organizations

  • University of Texas at Dallas

Tags

Communities of Interest

  • Human Systems

DTIC Thesaurus Topics

  • Abstracts
  • Artificial Intelligence Software
  • Boundaries
  • Classification
  • Computational Science
  • Computer Languages
  • Computer Science
  • Detection
  • Hidden Markov Models
  • Language
  • Machine Learning
  • Machine Translation
  • Markov Models
  • Natural Language Processing
  • Precision
  • Probability
  • Recognition

Fields of Study

  • Computer science

Readers

  • Computational Modeling and Simulation
  • Regression Analysis.
  • Speech Processing/Speech Recognition.

Technology Areas

  • AI & ML
  • AI & ML - Machine Translation