CPU Performance Counter-Based Problem Diagnosis for Software Systems

Abstract

Faults that occur in distributed software systems, such as e-commerce applications, are often costly in terms of lost revenue, but can be difficult to discover manually. Problem diagnosis tools attempt to detect, and often localize, faults that occur in distributed software systems. The goal is to detect problems soon after they occur, and rapidly notify an operator or automatically fix the issue as quickly possible. Trade-offs are involved in designing such tools. A good tool should be accurate, and should have low overheads, to minimize adverse effects to the monitored system. Often there is a trade-off between these two goals. Application-level data can often lead to very accurate, fine-grained diagnoses, but at a high cost in terms of reduced system performance. Metrics collected from the operating system are less expensive to collect, but usually are only suitable for coarse fault localization, usually to a specific machine. This thesis explores a data source that has only had limited use in problem diagnosis tools: CPU performance counters. Instrumentation based on these performance counters can be collected with very low overheads, and provides information with expressive power similar to data collected from the operating system. This data source is evaluated experimentally, in conjunction with a variety of simple analysis algorithms, via synthetic fault-injection experiments against a realistic 3-tier auction web-application.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Sep 01, 2009
Accession Number
ADA507019

Entities

People

  • Keith A. Bare

Organizations

  • Carnegie Mellon University

Tags

Communities of Interest

  • Autonomy
  • Energy and Power Technologies

DTIC Thesaurus Topics

  • Artificial Intelligence Software
  • Bayesian Networks
  • Computer Programming
  • Computer Programs
  • Computers
  • Data Analysis
  • Databases
  • Detection
  • Dimensionality Reduction
  • Electronic Commerce
  • Information Science
  • Kernel Functions
  • Machine Learning
  • Network Science
  • Operating Systems
  • Relational Database Management Systems
  • Supervised Machine Learning

Fields of Study

  • Computer science
  • Engineering

Readers

  • Fault Tolerant Diagnosis of Black and White Balloon Isolation Tests Using ¥.
  • Life Cycle Cost Analysis
  • Parallel and Distributed Computing.