A Large-scale Distributed Indexed Learning Framework for Data that Cannot Fit into Memory

Abstract

This project deals with issues on distributed learning for big data and addresses three major problems. 1) Learning a classifier where data contain many samples that do not help improve the model quality, which cost much I/O and large memory to process. A Block Coordinate Descent combined with Approximate Nearest Neighbor (ANN) search to select active samples in dual mode was shown to outperform the-state-of-the-art. 2) Complex query search in which sending it to all the local machines is very costly. Decomposing the reference patterns into multi-resolution solved the distributed kNN/kFN pattern matching very efficiently. 3) Distributed learning problem for unlimited unlabeled data stream from many clients needed to send to a server to learn a classifier. Integrating three learning techniques (online, semi-supervised and active learning) together with a selective sampling with minimum communication between the server and the clients solved this problem.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Mar 27, 2015
Accession Number
ADA616935

Entities

People

  • Shou-De Lin

Organizations

  • National Taiwan University

Tags

Communities of Interest

  • Autonomy

DTIC Thesaurus Topics

  • Algorithms
  • Big Data
  • Computer Science
  • Data Mining
  • Data Science
  • Data Sets
  • Distance Learning
  • Dual Mode
  • Information Science
  • Machine Learning
  • Mobile Devices
  • Network Science
  • Pattern Recognition
  • Sampling
  • Semi-Supervised Learning
  • Supervised Machine Learning

Fields of Study

  • Computer science

Readers

  • Neural Network Machine Learning.
  • Operations Research
  • Parallel and Distributed Computing.

Technology Areas

  • AI & ML
  • AI & ML - Machine Learning Algorithms
  • AI & ML - Neural Networks