Lightweight Structure in Text

Abstract

Pattern matching is heavily used for searching, filtering, and transforming text, but existing pattern languages offer few opportunities for reuse. Lightweight structure is a new approach that solves the reuse problem. Lightweight structure has three parts: a model of text structure as contiguous segments of text, or regions; an extensible library of structure abstractions (e.g., HTML elements, Java expressions, or English sentences) that can be implemented by any kind of pattern or parser; and a region algebra for composing and reusing structure abstractions. Lightweight structure does for text pattern matching what procedure abstraction does for programming, enabling construction of a reusable library. Lightweight structure has been implemented in LAPIS, a web browser/text editor that demonstrates several novel techniques: Text constraints is a new pattern language for composing structure abstractions, based on the region algebra. Text constraint patterns are simple and high-level, and user studies have shown that users can generate and comprehend them. Simultaneous editing uses multiple selections for repetitive text editing. Multiple selections are inferred from examples given by the user, drawing on the lightweight structure library to make fast, accurate, domain-specific inferences from very few examples. In user studies, simultaneous editing required only 1.26 examples per selection, approaching the 1-example ideal. Outlier finding draws the user's attention to inconsistent selections or pattern matches both possible false positives and possible false negatives. When integrated into simultaneous editing and tested in a user study, outlier finding reduced user errors. Unix tools for structured text extend tools like grep and sort with lightweight structure, and the browser shell integrates a Unix command prompt into a web browser, offering new ways to build pipelines and automate web browsing.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
May 01, 2002
Accession Number
ADA459021

Entities

People

  • Robert C. Miller

Organizations

  • Carnegie Mellon University

Tags

Communities of Interest

  • Autonomy
  • C4I
  • Energy and Power Technologies
  • Ground and Sea Platforms

DTIC Thesaurus Topics

  • Artificial Intelligence
  • Automata Theory
  • Computational Science
  • Computer Languages
  • Computer Program Documentation
  • Computer Programming
  • Computer Programs
  • Computer Science
  • Computers
  • Information Science
  • Knowledge Management
  • Machine Learning
  • Operating Systems
  • Trees (Data Structures)
  • Two Dimensional
  • Web Browsers
  • Word Processors

Fields of Study

  • Computer science

Readers

  • Applied Combinatorial Optimization and Logic Circuit Design.
  • Computational Linguistics
  • Database Systems and Applications

Technology Areas

  • AI & ML
  • AI & ML - Information Retrieval
  • AI & ML - Machine Learning Algorithms