A Novel System Level Approach to Fault Tolerance in Distributed Memory Multicomputers

Abstract

The objective of this research was to develop new, cost-effective techniques for fault tolerance in multicomputer architectures. The requirements for high performance and fault tolerance are seemingly contradictory: parallel architectures and algorithms developed for high performance attempt to achieve maximum utilization of each of the processors, while fault tolerance requires redundant computations and checks to ensure that the results of the applied to highly parallel multicomputer architectures. Our unique approach to achieve fault tolerance in multicomputer parallel architectures is to use an algorithm-based tolerance (ABFT) technique which is an on-line system-level method for detection of faults followed by a system level approach to reconfiguration and recovery of a parallel processor system.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Sep 01, 1994
Accession Number
ADA284729

Entities

People

  • Prithviraj Banergee

Organizations

  • University of Illinois Urbana–Champaign

Tags

Communities of Interest

  • Materials and Manufacturing Processes

DTIC Thesaurus Topics

  • Algorithms
  • Application Software
  • Computational Science
  • Computations
  • Computer Programming
  • Computer Science
  • Computers
  • Detection
  • Differential Equations
  • Electronic Mail
  • Equations
  • Error Analysis
  • Fault Tolerance
  • Fault Tolerant Computing
  • High Performance Computing
  • Parallel Computing
  • Parallel Processing

Fields of Study

  • Computer science
  • Engineering

Readers

  • Parallel and Distributed Computing.
  • Systems Analysis and Design