De-bloating Managed Runtmes for Scalable Data-Intensive Systems
Abstract
One of the key challenges in Big Data is how to develop systems that can scale to massive amounts of data with relatively small amounts of resources. The mainstream approach is to enable distributed data processing, resulting in development of distributed platforms utilizing large numbers of machines. Manyof these platforms are written in managed languages such as Java, C#, or Scala. While these languages simplify development effort, their runtime system has a high cost, which cannot be amortized by increasing the number of machines. Poor performance on each node reduces the scalability of the whole cluster: a large number of machines are needed to process a small dataset, leading to excessive use of resources and increased communication overhead. This project explores an alternative direction, that is, how to effectively optimize the managed runtime to improve performance and scalability of data processing on each machine.We propose a five-step research agenda that exploits language, compiler, and systems support to improve performance and scalability of managed Big Data programs. Our first observation is that the number of data objects in the heap must not grow proportionally with the data cardinality. Based on this observation, the first step of the project will investigate a new execution model that can statically bound the number of data objects in the heap. In particular, we propose to allocate (arbitrarily many) data items in the off-heap, native memory, and only create a statically bounded pool of heap objects that keep getting reused to represent data items. An iteration-based mechanism is used to reclaim data items while GC scans only the regular heap, leading to reduced header/pointer overhead and memory management costs. The second and third step of the project will investigate compiler and runtime system support that can automatically enforce the object- bounded property by statically transforming and dynamically optimizing existing programs, respectively.Another major performance issue is the overly-parallel execution of data processing tasks in which developers create a large number of threads to fully utilize the underlying parallelism. However, the excessive creation of objects and use of collections incurs tremendous runtime bloat that is duplicated in each thread created, making the system quickly hit the memory wall. The fourth step of the project will develop a novel language, ScaleJ, that uses a dataflow semantics to support memory- and thread-oblivious development of data processing functions. ScaleJ allows developers to focus on high-level data processing logic without worrying about low-level memory and parallelism issues. The fifth step will develop an autotuning framework for ScaleJ that can safely and adaptively adjust the degree of parallelism by considering a varietyof parameters, including data partitioning, runtime bloat, memory availability, GC effort, and parallelism.Despite the large body of work on performance optimization for Big Data applications, none of the existing techniques have paid attention to performance issues that arise from the high cost of the managed runtime. As managed languages are gaining increasing popularity in development of Big Data systems, there is a pressing need to carefully study their performance problems, and design novel techniques to address them. To the best of our knowledge, this project is first attempt that improves Big Data scalability by optimizing managed runtime environments.ImpactsManaged languages are widely used to implement Big Data systems. Inefficiencies in these languages have caused significant scalability problems. This project will address this important problem and provide benefit to a wide community that researches and uses Big Data systems.The Navy is awash in data generated by sensors aboard ships, aircraft and other platforms. In order to make full use of the data without much human involvement, it is critical to develop efficient Big Da
Document Details
- Document Type
- DoD Grant Award
- Publication Date
- Nov 09, 2018
- Source ID
- N000141912009
Entities
People
- Harry Xu
Organizations
- Office of Naval Research
- United States Navy
- University of California, Los Angeles