Fast Anomaly Discovery Given Duplicates

Abstract

Given a large cloud of multi-dimensional points, and an off-the-shelf outlier detection method, why does it take a week to finish? After careful analysis, we discovered that duplicate points create subtle issues, that the literature has ignored: if dmax is the multiplicity of the most overplotted point, typical algorithms are quadratic on dmax. For graph-related outlier detection, all the satellites of a 'star' node will have identical features, and thus create over-plotting with dmax being the highest degree; due to power law degree distributions, this may be huge, for real graph data. We propose several ways to eliminate the problem; we report wall-clock times and our time savings and we show that our methods give either exact results, or highly accurate approximate ones.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Dec 01, 2012
Accession Number
ADA580209

Entities

People

  • Christos Faloutsos
  • Danai Koutra
  • Jay-yoon Lee
  • U. Kang

Organizations

  • Carnegie Mellon University

Tags

Communities of Interest

  • Biomedical

DTIC Thesaurus Topics

  • Algorithms
  • Anomaly Detection
  • Big Data
  • Change Detection
  • Computations
  • Computer Science
  • Computers
  • Data Mining
  • Data Sets
  • Detection
  • Health Care
  • Social Media
  • Social Networking Services
  • Triangles
  • Two Dimensional

Fields of Study

  • Computer science

Readers

  • Applied Combinatorial Optimization and Logic Circuit Design.
  • Distributed Systems and Data Platform Development

Technology Areas

  • Space