Hard CPU Related Failures and System Activity: Measurement and Modelling.
Abstract
This paper describes the measurement and analysis of hard CPU and memory errors, and system activity at the Stanford Linear Accelerator Center computational facility. Nearly 25 percent of the errors were estimated to be permanent. The occurrence of a failure was found to be strongly correlated with the level and type of workload prior to the occurrence of the failure. For example, it is shown that the risk of a permanent error increases in a non-linear fashion with the amount of interactive processing. The observed tendency is present in three years of load data. This observation is significant because a load-failure relationship found at the CPU level must, in our view, be considered fundamental. In addition, the fact that most of the errors are permanent, provides new information on these error types viz. their load dependent behavior. Our analysis procedure, used on the SLAC data, has been validated on an artificially created data base seeded with failures. (Author)
Document Details
- Document Type
- Technical Report
- Publication Date
- May 01, 1983
- Accession Number
- ADA130821
Entities
People
- David J. Rossetti
- Ravishankar K. Iyer
Organizations
- Stanford University