Supporting Social Data Observatory with Customizable Index Structures on HBase - Architecture and Performance

Abstract

The intensive research activities in social data analysis in recent years suggest the necessity and great potential of a public social data observatory. To effectively support a social data observatory, the storage platform must satisfy its special requirements for loading and storage of Terabyte-level datasets, as well as efficient evaluation of queries involving analysis of the texts of millions of social updates. Traditional inverted indexing techniques do not meet such requirements due to their targeted use cases in text retrieval scenarios. To address these problems, we propose a general indexing framework, IndexedHBase, to build specially customized index structures for facilitating efficient queries, and employ the HBase system for distributed data storage. IndexedHBase is used to support the Truthy system that collects and analyzes data obtained through the Twitter streaming API. To handle the special queries in Truthy, we develop a parallel query evaluation strategy that can explore the customized index structures efficiently. We evaluate the performance of IndexedHBase on FutureGrid, and compare it with Riak, a widely adopted commercial NoSQL database system. The results show that IndexedHBase provides a data loading speed that is 6 times faster than Riak, and is significantly more efficient in evaluating queries involving large result sets.

Open PDF

Document Details

Document Type
Technical Report
Publication Date
Jan 01, 2013
Accession Number
ADA603195

Entities

People

  • Andrew Younge
  • Clayton Davis
  • Emilio Ferrara
  • Evan Roth
  • Filippo Menczer
  • Judy Qiu
  • Karissa Mckelvey
  • Xiaoming Gao

Organizations

  • Indiana University

Tags

Communities of Interest

  • Energy and Power Technologies

DTIC Thesaurus Topics

  • Batch Processing
  • Big Data
  • Case Studies
  • Cloud Storage
  • Computations
  • Computer Programs
  • Data Analysis
  • Data Processing
  • Data Rate
  • Data Storage Systems
  • Databases
  • Information Science
  • Observatories
  • Online Communications
  • Parallel Computing
  • Parallel Processing
  • Social Media

Fields of Study

  • Computer science

Readers

  • Agent-Based Social Robotics and Mobile-Assisted Learning in Virtual Environments.
  • Database Systems and Applications
  • Information Retrieval