A Novel System Level Approach to Fault Tolerance in Distributed Memory Multicomputers

Abstract

The objective of this research was to develop new, cost-effective techniques for fault tolerance in multicomputer architectures. The requirements for high performance and fault tolerance are seemingly contradictory: parallel architectures and algorithms developed for high performance attempt to achieve maximum utilization of each of the processors, while fault tolerance requires redundant computations and checks to ensure that the results of the applied to highly parallel multicomputer architectures. Our unique approach to achieve fault tolerance in multicomputer parallel architectures is to use an algorithm-based tolerance (ABFT) technique which is an on-line system-level method for detection of faults followed by a system level approach to reconfiguration and recovery of a parallel processor system.

Open PDF

Document Details

Document Type: Technical Report
Publication Date: Sep 01, 1994
Accession Number: ADA284729

Entities

People

Prithviraj Banergee

Organizations

University of Illinois Urbana–Champaign

A Novel System Level Approach to Fault Tolerance in Distributed Memory Multicomputers

Abstract

Document Details

Entities

People

Organizations

Tags

Communities of Interest

DTIC Thesaurus Topics

Fields of Study

Readers