Failure Analysis and Modeling of a Multicomputer System

Abstract

This thesis describes the results of an extensive measurement-based analysis of real error data collected from a 7-machine DEC VaxCluster multicomputer system. In addition to evaluating basic system error and failure characteristics, we develop reward models to analyze the impact of failures and errors on the system. The results show that, although 98% of errors in the shared resources recover, they result in 48% of all system failures. The analysis of rewards shows that the expected reward rate for the Vax Cluster decreases to 0.5 in 100 days for a 3 out of 7 model, which is well over a 100 times that for a 7-out-of-7 model. A comparison of the reward rates for a range of k-out-of-n models indicates that the maximum increase in reward rate (0.25) occurs in going from the 6-out-of-7 model to the 5-out-of-7 model. The analysis also shows that software errors have the lowest reward (0.2 vs. 0.91 for network errors). The large loss in reward rate for software errors is due to the fact that a large proportion (94%) of software errors lead to failure. In comparison, the high reward rate for network errors is due to fast recovery from a majority of these errors (median recovery duration is 0 seconds).

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Feb 01, 1990
Accession Number
ADA219653

Entities

People

  • Sujatha S. Subramani

Organizations

  • University of Illinois Urbana–Champaign

Tags

Communities of Interest

  • Energy and Power Technologies

DTIC Thesaurus Topics

  • Computer Science
  • Computers
  • Data Analysis
  • Failure Analysis
  • Failure Mode And Effect Analysis
  • Frequency
  • Information Science
  • Linear Accelerators
  • Markov Models
  • Markov Processes
  • Measurement
  • Probability
  • Recovery
  • Reliability
  • Statistical Analysis
  • Statistics
  • Steady State

Fields of Study

  • Computer science

Readers

  • Computational Modeling and Simulation
  • Mathematical Modeling and Probability Theory.
  • Systems Analysis and Design