Resolving the enigma of Factor Analysis

Abstract

Vintage Factor Analysis (VFA) was developed nearly a century ago and is still widely used in many of the social sciences. However, among Statisticians trained in Statistics Departments, VFA is not nearly so popular. This discrepancy is caused by an enigma. The conventional wisdom among Statisticians says that the factor rotation, a key ingredient in VFA, is unidentifiable. However, the factor rotation is popular among practitioners because it frequently makes the factors much more interpretable. How is something that aides interpretability actually unidentifiable? Aim 1 of this proposal resolves the enigma. It studies a simple VFA approach, Varimax rotated Principal Components Analysis, referred to herein as PCA + Varimax. Varimax, developed by Kaiser in 1958, is the most popular approach to computing a factor rotation. Despite the fact that many statisticians have never heard of Varimax, it comes preloaded in the base R packages (like kmeans) and in some academic communities, Varimax is so widely used that it is sometimes not properly cited (like kmeans). The proposed Theorem 1 which explains why Varimax is so useful; PCA + Varimax can estimate a broad class of semiparametric factor models. Aim 1 of the proposal is to prove this theorem. Historically, Statisticians have mistakenly equated VFA with the Gaussian Factor Model. However, many factor models are nonGaussian and VFA can estimates them. Semiparametric factor modeling provides a unifying approach to several popular models, including the Stochastic Blockmodel, its generalizations, and Latent Dirichlet Allocation (i.e. topic modeling). In these models, the factors are interpretable quantities (communities, clusters, topics, etc). However, the principal components only estimate the column space of the factors; that is, each principal component is a linear combination of the interpretable quantities in the factors. Varimax applied to the principal components can isolate the factors. These technical results are within reach due to recent developments that demonstrate the elementwise convergence for the eigenvectors of a random graph. Aim 2 builds a modern framework for multivariate statistics from a key observation in Aim 1. Key observation: If the factors and the loadings are generated as independent and nonGaussian random variables, then an additional matrix, which we call the middle B matrix, is identifiable. This matrix is missing from many models (except the Stochastic Blockmodel) and this matrix describes how the factors are related. Moreover, the middle B matrix allows for the factors and the loadings to be sparse, even when the principal components are not. This highlights a key limitation in current approaches to sparse PCA. Aim 2 will provide a suite of models, algorithms, and theory for (i) generalized factor models, (ii) big k factor analysis, and (iii) semiparametric tensor modeling, each of which are fundamentally motivated by the middle B matrix. These technical results will be within reach due to the developments in Aim 1.

Document Details

Document Type
DoD Grant Award
Publication Date
Jul 09, 2020
Source ID
W911NF2010051

Entities

People

  • Karl Rohe

Organizations

  • Army Contracting Command
  • United States Army
  • University of Wisconsin–Madison

Tags

Fields of Study

  • Computer science

Readers

  • Educational Psychology
  • Neural Network Machine Learning.
  • Regression Analysis.

Technology Areas

  • Space