Failure Analysis and Modeling of a Multicomputer System
Abstract
This thesis describes the results of an extensive measurement-based analysis of real error data collected from a 7-machine DEC VaxCluster multicomputer system. In addition to evaluating basic system error and failure characteristics, we develop reward models to analyze the impact of failures and errors on the system. The results show that, although 98% of errors in the shared resources recover, they result in 48% of all system failures. The analysis of rewards shows that the expected reward rate for the Vax Cluster decreases to 0.5 in 100 days for a 3 out of 7 model, which is well over a 100 times that for a 7-out-of-7 model. A comparison of the reward rates for a range of k-out-of-n models indicates that the maximum increase in reward rate (0.25) occurs in going from the 6-out-of-7 model to the 5-out-of-7 model. The analysis also shows that software errors have the lowest reward (0.2 vs. 0.91 for network errors). The large loss in reward rate for software errors is due to the fact that a large proportion (94%) of software errors lead to failure. In comparison, the high reward rate for network errors is due to fast recovery from a majority of these errors (median recovery duration is 0 seconds).
Document Details
- Document Type
- Technical Report
- Publication Date
- Feb 01, 1990
- Accession Number
- ADA219653
Entities
People
- Sujatha S. Subramani
Organizations
- University of Illinois Urbana–Champaign