On Learning from Collective Data

Abstract

In many machine learning problems and application domains, the data are naturally organized by groups. For example, a video sequence is a group of images, an image is a group of patches, a document is a group of paragraphs/words, and a community is a group of people. We call them the collective data. In this thesis, we study how and what we can learn from collective data. Usually machine learning focuses on individual objects, each of which is described by a feature vector and studied as a point in some metric space. When approaching collective data researchers often reduce the groups into vectors to which traditional methods can be applied. We, on the other hand, will try to develop machine learning methods that respect the collective nature of data and learn from them directly. Several different approaches were taken to address this learning problem. When the groups consist of unordered discrete data points, it can naturally be characterized by its sufficient statistics ? the histogram. For this case we develop efficient methods to address the outliers and temporal effects in the data based on matrix and tensor factorization methods. To learn from groups that contain multi-dimensional real-valued vectors, we develop both generative methods based on hierarchical probabilistic models and discriminative methods using group kernels based on new divergence estimators. With these tools, we can accomplish various tasks such as classification, regression, clustering anomaly detection, and dimensionality reduction on collective data. We further consider the practical side of the divergence based algorithms. To reduce their time and space requirements, we evaluate and find methods that can effectively reduce the size of the groups with little impact on the accuracy. We also proposed the conditional divergence along with an efficient estimator in order to correct the sampling biases that might be present in the data.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Dec 01, 2013
Accession Number
ADA598234

Entities

People

  • Liang Xiong

Organizations

  • Carnegie Mellon University

Tags

Communities of Interest

  • Autonomy
  • Energy and Power Technologies

DTIC Thesaurus Topics

  • Artificial Intelligence Software
  • Bayesian Networks
  • Computational Science
  • Computer Vision
  • Data Mining
  • Databases
  • Dimensionality Reduction
  • Information Processing
  • Information Science
  • Kernel Functions
  • Machine Learning
  • Mathematical Filters
  • Monte Carlo Method
  • Network Science
  • Parallel Computing
  • Probabilistic Models
  • Statistical Algorithms

Fields of Study

  • Computer science

Readers

  • Computer Vision.
  • Neural Network Machine Learning.
  • Team-Based Human-Centered Cognitive Task Decision Making and Information Performance.

Technology Areas

  • AI & ML
  • Space