Leveraging Machine Readable Dictionaries in Discriminative Sequence Models

Abstract

Many natural language processing tasks make use of a lexicon typically the words collected from some annotated training data along with their associated properties. We demonstrate here the utility of corpora-independent lexicons derived from machine readable dictionaries. Lexical information is encoded in the form of features in a Conditional Random Field tagger providing improved performance in cases where: i) limited training data is made available ii) the data is case-less and iii) the test data genre or domain is different than that of the training data. We show substantial error reductions, especially on unknown words, for the tasks of part-of-speech tagging and shallow parsing, achieving up to 20% error reduction on Penn TreeBank part-of-speech tagging and up to a 15.7% error reduction for shallow parsing using the CoNLL 2000 data. Our results here point towards a simple, but effective methodology for increasing the adaptability of text processing systems by training models with annotated data in one genre augmented with general lexical information or lexical information pertinent to the target genre (or domain).

Open PDF

Document Details

Document Type: Technical Report
Publication Date: Jan 01, 2006
Accession Number: AD1106871

Entities

People

Ben Wellner
Marc Vilain

Organizations

MITRE Corporation

Leveraging Machine Readable Dictionaries in Discriminative Sequence Models

Abstract

Document Details

Entities

People

Organizations

Tags

Communities of Interest

DTIC Thesaurus Topics

Readers

Technology Areas