Spark Memory Management

Apache Spark, serving as a distributed data analytics framework, has been adopted widely in recent years to processing a huge amount of data. Spark takes the advantage of in-memory (RAM) computing, which speeds up data-processing by reducing I/O overhead. However, expensive and limited RAM space requests Spark users and developers to manage memory space deliberately. Therefore, we research on developing new mechanism for memory allocation to handle memory management without disturbing users.


SparkCache: Intermediate Data Caching Optimization for Multi-Stage and Parallel Big Data Frameworks - [IEEE CLOUD'18]

In the era of big data and cloud computing, large amounts of data are generated from user applications and need to be processed in the datacenter. Data-parallel computing frameworks, such as Apache Spark, are widely used to perform such data processing at scale. Specifically, Spark leverages distributed memory to cache the intermediate results, represented as Resilient Distributed Datasets (RDDs). This gives Spark an advantage over other parallel frameworks for implementations of iterative machine learning and data mining algorithms, by avoiding repeated computation or hard disk accesses to retrieve RDDs. By default, caching decisions are left at the programmer’s discretion, and the LRU policy is used for evicting RDDs when the cache is full. However, when the objective is to minimize total work, LRU is woefully inadequate, leading to arbitrarily suboptimal caching decisions. In this work, we design an algorithm for multi-stage big data processing platforms to adaptively determine and cache the most valuable intermediate datasets that can be reused in the future. Our solution automates the decision of which RDDs to cache: this amounts to identifying nodes in a direct acyclic graph (DAG) representing computations whose outputs should persist in the memory.


ATuMm: Auto Tuning Memory Manager in Apache Spark

Apache Spark is an in-memory analytic framework that has been widely adopted in industry and research fields. Two memory managers, Static and Unified, are available in Spark to allocate memory for caching Resilient Distributed Datasets (RDDs) and executing tasks. However, we found that static memory manager is lack of flexibility, while unified memory manager puts heavy pressure on garbage collection of JVM on which Spark resides. To address these issues, we design an auto tuning memory manager (named ATuMm) to support dynamic memory allocation with the consideration of both memory demands and latency introduced by garbage collection. We implement our new memory manager in Spark 2.2.0 and evaluate it by running experiments in a real Spark cluster. Our experimental results show that our auto tuning memory manager is able to reduce the total garbage collection time and thus further improve the performance (i.e., reduced latency) of Spark applications, compared to the existing Spark memory management solutions.