Leveraging Machine Readable Dictionaries in Discriminative Sequence Models

Abstract

Many natural language processing tasks make use of a lexicon typically the words collected from some annotated training data along with their associated properties. We demonstrate here the utility of corpora-independent lexicons derived from machine readable dictionaries. Lexical information is encoded in the form of features in a Conditional Random Field tagger providing improved performance in cases where: i) limited training data is made available ii) the data is case-less and iii) the test data genre or domain is different than that of the training data. We show substantial error reductions, especially on unknown words, for the tasks of part-of-speech tagging and shallow parsing, achieving up to 20% error reduction on Penn TreeBank part-of-speech tagging and up to a 15.7% error reduction for shallow parsing using the CoNLL 2000 data. Our results here point towards a simple, but effective methodology for increasing the adaptability of text processing systems by training models with annotated data in one genre augmented with general lexical information or lexical information pertinent to the target genre (or domain).

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Jan 01, 2006
Accession Number
AD1106871

Entities

People

  • Ben Wellner
  • Marc Vilain

Organizations

  • MITRE Corporation

Tags

Communities of Interest

  • Biomedical

DTIC Thesaurus Topics

  • Accuracy
  • Coding
  • Computational Linguistics
  • Computational Science
  • Data Set
  • Data Sets
  • Decoding
  • Dictionaries
  • Digital Data
  • Hidden Markov Models
  • Language
  • Linguistics
  • Machine Learning
  • Markov Models
  • Models
  • Natural Language Computing
  • Natural Language Processing
  • Natural Languages
  • Probabilistic Models
  • Probability
  • Standards
  • Test Sets
  • Text Processing

Readers

  • Computational Linguistics

Technology Areas

  • AI & ML
  • AI & ML - Machine Translation
  • AI & ML - Neural Networks