Comparing Evaluation Metrics for Sentence Boundary Detection

Abstract

In recent NIST evaluations on sentence boundary detection, a single error metric was used to describe performance. Additional metrics, however, are available for such tasks, in which a word stream is partitioned into subunits. This paper compares alternative evaluation metrics including the NIST error rate, classification error rate per word boundary, precision and recall, ROC curves, DET curves, precision-recall curves, and area under the curves and discusses advantages and disadvantages of each. Unlike many studies in machine learning, we use real data for a real task. We find benefit from using curves in addition to a single metric. Furthermore, we find that data skew has an impact on metrics, and that differences among different system outputs are more visible in precision-recall curves. Results are expected to help us better understand evaluation metrics that should be generalizable to similar language processing tasks.

Open PDF

Document Details

Document Type: Technical Report
Publication Date: Jan 01, 2007
Accession Number: ADA463058

Entities

People

Elizabeth Shriberg
Yang Liu

Organizations

University of Texas at Dallas

Comparing Evaluation Metrics for Sentence Boundary Detection

Abstract

Document Details

Entities

People

Organizations

Tags

Communities of Interest

DTIC Thesaurus Topics

Fields of Study

Readers

Technology Areas