Parallel Formulations of Tree-Projection Based Sequence Mining Algorithms

Abstract

Discovery of sequential patterns is becoming increasingly useful and essential in many scientific and commercial domains. Enormous sizes of available datasets and possibly large number of mined patterns demand efficient, scalable, and parallel algorithms. Even though a number of algorithms have been developed to efficiently parallelize frequent pattern discovery algorithms that are based on the candidate-generation-and-counting framework, the problem of parallelizing the more efficient projection-based algorithms has received relatively little attention and existing parallel formulations have been targeted only toward shared-memory architectures. The irregular and unstructured nature of the task-graph generated by these algorithms and the fact that these tasks operate on overlapping sub-databases makes it challenging to efficiently parallelize these algorithms on scalable distributed-memory parallel computing architectures. In this paper we present and study a variety of distributed-memory parallel algorithms for a tree-projection-based frequent sequence discovery algorithm that are able to minimize the various overheads associated with load imbalance, database overlap, and interprocessor communication. Our experimental evaluation on a 32 processor IBM SP show that these algorithms are capable of achieving good speedups, substantially reducing the amount of the required work to find sequential patterns in large databases.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Jan 20, 2003
Accession Number
ADA439587

Entities

People

  • George Karypis
  • Valerie Guralnik

Organizations

  • University of Minnesota

Tags

Communities of Interest

  • Materials and Manufacturing Processes

DTIC Thesaurus Topics

  • Algorithms
  • Computations
  • Computer Programming
  • Computer Science
  • Computers
  • Data Mining
  • Data Sets
  • Databases
  • Decomposition
  • Dynamic Loads
  • Engineering
  • Frequency
  • High Performance Computing
  • Parallel Computing
  • Parallel Processing
  • Sequences
  • Workload

Fields of Study

  • Computer science
  • Engineering

Readers

  • Distributed Systems and Data Platform Development
  • Parallel and Distributed Computing.