Building a Large Annotated Corpus of English: The Penn Treebank

Abstract

As a result of this grant, the researchers have now published oil CDROM a corpus of over 4 million words of running text annotated with part-of- speech (POS) tags, with over 3 million words of that material assigned skeletal grammatical structure. This material now includes a fully hand-parsed version of the classic Brown corpus. About one half of the papers at the ACL Workshop on Using Large Text Corpora this past summer were based on the materials generated by this grant.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Apr 30, 1993
Accession Number
ADA273556

Entities

People

  • Mitch Marcus

Organizations

  • Moore School of Electrical Engineering

Tags

Communities of Interest

  • Air Platforms
  • Energy and Power Technologies
  • Human Systems

DTIC Thesaurus Topics

  • Accuracy
  • Computational Linguistics
  • Computer Programming
  • Computers
  • Consistency
  • Consortiums
  • Databases
  • Engineering
  • Errors
  • Grammars
  • Information Science
  • Language
  • Linguistics
  • Materials
  • Natural Language Processing
  • Natural Languages
  • Workshops

Readers

  • Neural Network Machine Learning.
  • Technical Research and Report Writing.