A Comparison of Language Representation Models on Small Text Corpora of Scientific and Technical Documents

Abstract

Text mining for the identification of emerging technology is becoming increasingly important as the number of scientific and technical documents grows. However, algorithms for developing text mining models require a large amount of training data, which carries heavy costs associated with data annotation and model development. The need for avoiding these associated costs has in part motivated recent work in text mining, which indicate value in leveraging language representation models (LRMs) on domain-specific text corpora for domain-specific tasks. However, these results are demonstrated predominantly on large text corpora, which do not address concerns associated with the ability of LRMs to transfer to domains where training data may be scarce. Due to this, we benchmarked the performance of LRMs on identifying quantities and units of measure from text when the number of training samples is small.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Mar 01, 2023
Accession Number
AD1209395

Entities

People

  • Michael T. Gorczyca
  • Peter F. David
  • Tavish M. Mcdonald
  • Thadeous A. Goodwyn

Organizations

  • Air Force Research Laboratory

Tags

Communities of Interest

  • Autonomy

DTIC Thesaurus Topics

  • Air Force
  • Air Force Research Laboratories
  • Algorithms
  • Artificial Intelligence
  • Artificial Intelligence Software
  • Computational Science
  • Deep Learning
  • Embedding
  • Emerging Technology
  • Fluid Mechanics
  • Language
  • Machine Learning
  • Materials
  • Measurement
  • Military Research
  • Neural Networks
  • Text Mining

Fields of Study

  • Computer science

Readers

  • Computational Linguistics
  • Economics
  • Polymer Science and Engineering.