Multiple Outliers in Linear Regression: Advances in Detection Methods, Robust Estimation, and Variable Selection

Abstract

Empirical evidence suggests unusual or outlying observations in data sets are much more prevalent than one might expect; 5 to 10% on average for many industries. This research addresses multiple outliers in the linear regression model. Although reliable for a single or a few outliers, standard diagnostic techniques from an ordinary least squares (OLS) fit can fail to identify multiple outliers. The parameter estimates, diagnostic quantities and model inferences from the contaminated data set can be significantly different from those obtained with the clean data. The researcher requires a dependable method to identify and accommodate these multiple outliers. This research tests both direct methods from algorithms and indirect methods from robust regression estimators to identify multiple outliers. A comprehensive Monte Carlo simulation study evaluates the impact that outlier density and geometry, regressor variable dimension, and outlying distance have on numerous published methods. The performance study focuses on outlier configurations likely to be encountered in practice and uses a designed experiment approach. The results for each scenario provide insight and limitations in performance for each technique. Recommendations are given for each technique. OLS is the optimal regression estimator under a set of assumptions on the distribution of the error term and predictor variables. Compound robust regression estimators have been proposed as alternatives when some OLS assumptions fail. Compound estimators can accommodate multiple outliers and limit the influence of the observations with remote levels of predictor variables. This research proposes a new compound estimator that is more effective for extreme observations in X space and high dimension than currently published methods. This research also addresses the variable selection problem for compound robust regression estimators.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
May 01, 1999
Accession Number
ADA367633

Entities

People

  • James Walter Wisnowski

Organizations

  • Arizona State University

Tags

Communities of Interest

  • Energy and Power Technologies
  • Human Systems

DTIC Thesaurus Topics

  • Air Force
  • Algorithms
  • Computational Complexity
  • Data Analysis
  • Data Mining
  • Data Science
  • Detection
  • Genetic Algorithms
  • Information Science
  • Knowledge Management
  • Monte Carlo Method
  • Regression Analysis
  • Statistical Algorithms
  • Statistical Analysis
  • Surveys
  • Test And Evaluation
  • Test Methods

Readers

  • Statistical inference.
  • Systems Analysis and Design

Technology Areas

  • AI & ML
  • AI & ML - Bayesian Inference
  • Space