CPU Performance Counter-Based Problem Diagnosis for Software Systems
Abstract
Faults that occur in distributed software systems, such as e-commerce applications, are often costly in terms of lost revenue, but can be difficult to discover manually. Problem diagnosis tools attempt to detect, and often localize, faults that occur in distributed software systems. The goal is to detect problems soon after they occur, and rapidly notify an operator or automatically fix the issue as quickly possible. Trade-offs are involved in designing such tools. A good tool should be accurate, and should have low overheads, to minimize adverse effects to the monitored system. Often there is a trade-off between these two goals. Application-level data can often lead to very accurate, fine-grained diagnoses, but at a high cost in terms of reduced system performance. Metrics collected from the operating system are less expensive to collect, but usually are only suitable for coarse fault localization, usually to a specific machine. This thesis explores a data source that has only had limited use in problem diagnosis tools: CPU performance counters. Instrumentation based on these performance counters can be collected with very low overheads, and provides information with expressive power similar to data collected from the operating system. This data source is evaluated experimentally, in conjunction with a variety of simple analysis algorithms, via synthetic fault-injection experiments against a realistic 3-tier auction web-application.
Document Details
- Document Type
- Technical Report
- Publication Date
- Sep 01, 2009
- Accession Number
- ADA507019
Entities
People
- Keith A. Bare
Organizations
- Carnegie Mellon University