LARGE FILES IN LINGUISTIC COMPUTING,

Abstract

A typical linguistic job will bring together a number of files each very large by the standards of everyday computing: a body of text, a dictionary and a grammar. The grammar, if it is anything but a very simple one, will contain a large number of elementary items of information of different kinds, each related to others in a number of different ways. This is what it means to say that the file has a lot of structure. The dictionary may also contain grammatical codes, which may consist of characters from the alphabet of one or other of the languages or may be something altogether different. If the dictionary contains alternatives to which probabilities are assigned, then these will probably be in the form of floating-point numbers. This is what it is like for a file to contain many different kinds of information. Computer files---texts, dictionaries, grammars--have to be changed, corrected, searched, given new structures and manipulated in a host of other ways. This can only be done if we know exactly where everything is in the file--if we have some means of addressing each item and each related set of items. Suppose we have a Japanese-English dictionary organized by words rather than morphemes. This would not necessarily be the best way to organize such a dictionary for machine use but it will serve adequately as an illustration. Each entry in the dictionary will have two main parts, one for the Japanese word and one for the English. Each of these in turn may have a number of sections, one for each of the forms that the word may take when inflected. (Extracted)

Document Details

Document Type
Technical Report
Publication Date
May 01, 1965
Accession Number
AD0615301

Entities

People

  • Martin Kay

Organizations

  • RAND Corporation

Tags

DTIC Thesaurus Topics

  • Addressing
  • Alphabets
  • Computers
  • Dictionaries
  • Digital Information
  • Grammars
  • Language
  • Linguistics
  • Morphology (Linguistics)
  • Personality
  • Probability
  • Social Sciences
  • Standards

Readers

  • Computer Science.
  • Library and Information Science
  • Speech Processing/Speech Recognition.