Characterizing and diagnosing out of memory errors in MapReduce applications

https://doi.org/10.1016/j.jss.2017.03.013Get rights and content

Highlights

  • We performed an empirical study on 56 real-world MapReduce OOM errors.

  • We designed a memory profiler to profile MapReduce applications’ memory usage.

  • We designed two types of quantitative rules in the profiler to diagnose OOM errors.

  • We evaluated our memory profiler on 28 real-world OOM errors.

Abstract

Out of memory (OOM) errors are common and serious in MapReduce applications. Since MapReduce framework hides the details of distributed execution, it is challenging for users to pinpoint the OOM root causes. Current memory analyzers and memory leak detectors can only figure out what objects are (unnecessarily) persisted in memory but cannot figure out where the objects come from and why the objects become so large. Thus, they cannot identify the OOM root causes.

Our empirical study on 56 OOM errors in real-world MapReduce applications found that the OOM root causes are improper job configurations, data skew, and memory-consuming user code. To identify the root causes of OOM errors in MapReduce applications, we design a memory profiling tool Mprof. Mprof can automatically profile and quantify the correlation between a MapReduce application’s runtime memory usage and its static information (input data, configurations, user code). Mprof achieves this through modeling and profiling the application’s dataflow, the memory usage of user code, and performing correlation analysis on them. Based on this correlation, Mprof uses quantitative rules to trace OOM errors back to the problematic user code, data, and configurations.

We evaluated Mprof through diagnosing 28 real-world OOM errors in diverse MapReduce applications. Our evaluation shows that Mprof can accurately identify the root causes of 23 OOM errors, and partly identify the root causes of the other 5 OOM errors.

Introduction

As a representative big data framework, MapReduce (Dean and Ghemawat, 2004) provides users with a simple programming model and hides the parallel/distributed execution. This design helps users focus on the data processing logic, but burdens them when the applications running atop this framework generate runtime errors. Since MapReduce applications process large data in memory, out of memory (OOM) errors are common. For example, in StackOverflow.com, users complained why their MapReduce applications run out of memory (OOM, OOM, OOM). OOM errors are also serious, since they can directly lead to the application failures and cannot be tolerated by MapReduce framework’s fault-tolerant mechanisms (i.e., OOM errors will occur again if re-executing the failed map/reduce tasks).

In general, a MapReduce application (job) can be represented as inputdata,configurations,usercode. The input data is usually stored as data blocks on the distributed file system. Before submitting an application to MapReduce framework, users need to specify the application’s configurations and user code. (1) Memory-related configurations, such as buffer size, define the size of framework buffers, which temporarily store intermediate data in memory. (2) Dataflow-related configurations, such as partition function/number, affect the volume of data that flows in mappers or reducers. Partition function defines how to partition the output key/value records of mappers, while partition number defines how many partitions will be generated. (3) User code, which refers to the user-defined functions, such as map(), reduce(), and optional combine(). These user-defined functions generate in-memory computing results while processing the key/value records.

While running, a MapReduce application goes through a map stage and a reduce stage (shown in Fig. 1). Each stage contains multiple map/reduce tasks (i.e., mappers and reducers), and each task runs as a process on a node. While running a MapReduce application, the framework buffers intermediate data in memory for better performance, and user code also stores intermediate computing results in memory. Once the required memory exceeds the memory limit of a map/reduce task, an OOM error will occur in the task. Fig. 1 shows two OOM errors that occur in a map task and a reduce task. When an OOM error occurs, users can only figure out what functions/methods are running from the OOM stack trace. However, this error message cannot directly reflect the OOM root causes.

To understand the common root causes of OOM errors, we conducted a characteristic study on 56 real-world OOM errors in MapReduce applications collected from open forums, such as StackOverflow.com and Hadoop mailing list. Our study found 3 types of common root causes with 7 cause patterns. (1) 15% errors are caused by improper job configurations, which can lead to large buffered data or improper data partition. (2) 38% errors are caused by data skew, which can lead to unexpected large runtime data, including large 〈k, list(v)〉 group and large single 〈k, v〉 record. (3) 75% errors are caused by memory-consuming user code, which loads large external data in memory or generates large intermediate/accumulated results (28% errors are also caused by data skew).

Even with the summarized OOM common causes, it is still challenging to automatically identify the root causes of OOM errors in a running MapReduce application. (1) The application’s configurations do not directly affect the memory usage of distributed map/reduce tasks. (2) User code can be written arbitrarily or automatically generated by high-level languages (e.g., SQL-like Pig script (Apache Pig)), which makes us treat the user code as a black box. With memory analysis tools, such as Eclipse MAT (Eclipse Memory Analyzer), users can figure out what objects exist in memory but do not know where the objects come from and why they become so large. The static and dynamic memory leak detectors (Cherem, Princehouse, Rugina, 2007, Xie, Aiken, 2005, Jump, McKinley, 2007, Xu, Rountev, 2008) can identify memory leak (i.e., which objects are unnecessarily persisted in memory). However, OOM errors in big data applications are commonly caused by excessive memory usage, not memory leak.

In this paper, we propose a memory profiling tool Mprof. Mprof can automatically profile and quantify the correlation between a MapReduce application’s runtime memory usage and its static information (input data, configurations, user code). Based on the correlation, Mprof uses quantitative rules to trace OOM errors back to the problematic user code, data, and configurations. Some but limited manual efforts are required in the cause identification such as linking the identified cause patterns with the code semantics.

Mprof mainly solves three problems: (1) How to figure out the correlation between an application’s runtime memory usage and its static information? Mprof solves this problem through modeling and profiling the application’s dataflow, memory usage of user code, and performing correlation analysis on them. (2) How to figure out the correlation between memory usage of black-box user code and its input data? We find that user objects (generated by user code) have different but fixed lifecycles, and objects with different lifecycles are related to different parts of the input data. Based on this observation, we design a lifecycle-aware memory monitoring strategy that can profile and quantify the correlation between different user objects and their related input data. (3) How to identify the root causes based on the correlation? Two types of quantitative rules are designed: Rules for user code can identify the problematic user code and the error-related runtime data. Rules for dataflow can identify the skewed data and improper configurations.

We implemented Mprof in the latest Hadoop-1.2, and evaluated it on 28 real-world OOM errors in diverse Hadoop MapReduce applications, including raw MapReduce code, Apache Pig, Apache Hive, Apache Mahout, and Cloud9. Twenty of them are reproducible OOM errors from our empirical study. Since only these 20 errors have detailed data characteristics, user code, and OOM stack trace, we reproduced them. Eight of them are new reproducible OOM errors collected from the Mahout/Hive/Pig JIRAs and open forums, which are not used in our empirical study. The results show that Mprof can precisely identify the root causes of 23 OOM errors, and partly identify the root causes of the other 5 OOM errors (in which the first OOM root cause is identified, but the second root cause is missed). Mprof is available at github (Dia, 2017).

The main contributions of this paper are as follows:

  • An empirical study on 56 real-world OOM errors narrowed the root causes of OOM errors in MapReduce applications down to 3 kinds of common causes and 7 cause patterns.

  • A memory profiling tool is designed to profile and quantify the correlation between a MapReduce application’s runtime memory usage and its static information.

  • Two types of quantitative rules (11 rules in total) are designed to diagnose OOM errors in MapReduce applications.

  • An evaluation on 28 real-world OOM errors shows that Mprof can accurately identify the OOM root causes.

An earlier version of this work appeared at ISSRE 2015 (Xu et al., 2015). In this paper, we significantly extend the earlier version in three aspects. (1) We designed and implemented a memory profiler that can automatically profile the memory usage of MapReduce applications. (2) We designed two types of quantitative rules in the profiler to identify the root causes of OOM errors. (3) We evaluated our profiler on 28 real-world MapReduce applications.

The rest of the paper is organized as follows. Section 2 introduces the background of MapReduce applications. Section 3 presents our empirical study results on the common root causes of OOM errors. Section 4 describes the design and implementation of Mprof. Section 5 describes the diagnosing procedure and diagnostic rules in Mprof. Section 6 presents the evaluation results. Section 7 discusses the limitation and generality of Mprof. Section 8 lists the related work and Section 9 concludes this paper.

Section snippets

Background

A MapReduce application can be generally represented as inputdataset,configurations,usercode. Input dataset is split into data blocks (e.g., 3 input blocks in Fig. 1) and stored on the distributed file system (e.g., HDFS (Had)). Before submitting an application to MapReduce framework, users need to write user code (e.g., map()) according to the programming model and specify the application’s configurations. While running, a MapReduce application (job) is split into multiple map/reduce tasks,

Empirical study on OOM errors

To understand and summarize the common causes of OOM errors in MapReduce applications, we perform an empirical study on 56 real-world OOM errors.1

We took real-world MapReduce applications that run atop Apache Hadoop as our study subjects. Since there are not any special bug repositories for OOM errors (JIRA mainly covers Hadoop framework bugs), users usually

Memory profiler design and implementation

Our empirical study has narrowed the root causes of OOM errors down to 7 cause patterns. However, it is still hard to diagnose the root causes of OOM errors in a running MapReduce application. To diagnose the OOM errors, we design a memory profiling tool named Mprof as shown in Fig. 3. Mprof can automatically profile and quantify the correlation between a MapReduce application’s runtime memory usage and its static information (input data, configurations, and user code). Mprof achieves this

OOM cause identification

After figuring out the correlation among the application’s configurations, dataflow, and memory usage, Mprof can further identify the root causes of OOM errors. Mprof first extracts large objects from heap dumps, and then uses quantitative rules to trace the large objects back to the problematic user code, data, and configurations. Fig. 5 illustrates the procedure of OOM cause identification when an OOM error occurs in a reducer. The concrete steps are as follows.

Evaluation

Our evaluation answers the following three questions:

  • RQ1: Can Mprof effectively diagnose the real-world OOM errors in MapReduce applications? We reproduced 28 real-world OOM errors, and then performed Mprof on them to see whether the root causes can be correctly identified.

  • RQ2: How much overhead does Mprofadd to the running jobs? We rerun each job with and without Mprof, and regard their time difference as the overhead.

  • RQ3: How does Mprof trace OOM errors back to the problematic user code,

Limitation and discussion

This section discusses the limitation of our empirical study, Mprof’s limitation, Mprof’s generality, how to select the number of heap dumps, and the potential ways to improve Mprof’s performance.

Related work

Failure study on big data applications Many researchers have studied the failures in big data applications/systems. Li et al. (2013) studied 250 failures in SCOPE jobs and found the root causes are undefined columns, wrong schemas, incorrect row format, etc. They also found 3 OOM errors that are caused by accumulating large data (e.g, all input rows) in memory. The 3 errors can be classified to the large accumulated results pattern in our study. Kavulya et al. (2010) analyzed 4100 failed Hadoop

Conclusion and future work

MapReduce applications frequently suffer from out of memory errors. In this paper, we performed a comprehensive study on 56 real-world OOM errors in MapReduce applications, and found the OOM root causes are improper job configurations, data skew, and memory-consuming user code. To diagnose OOM errors, we propose a memory profiler that can automatically figure out the correlation between a MapReduce application’s runtime memory usage and its static information. Based on the correlation, our

Acknowledgements

This work was supported by the National Key Research and Development Plan (2016YFB1000803), National Natural Science Foundation of China (61672506), and Beijing Natural Science Foundation (4164104).

Lijie Xu is an assistant research professor in the Institute of Software, Chinese Academy of Sciences. His research interests focus on distributed systems and software engineering, including distributed data-parallel frameworks, memory management techniques, and system reliability.

References (37)

  • Apache HBase, [Online]. Available:...
  • Apache Hive, [Online]. Available:...
  • Apache Mahout, [Online]. Available:...
  • Apache Pig, [Online]. Available:...
  • S. Cherem et al.

    Practical memory leak detection using guarded value-flow analysis

    Proceedings of the ACM SIGPLAN 2007 Conference on Programming Language Design and Implementation (PLDI)

    (2007)
  • Cloud9: A Hadoop toolkit for working with big data [Online]. Available:...
  • J. Dean et al.

    MapReduce: Simplified data processing on large clusters

    6th Symposium on Operating System Design and Implementation (OSDI)

    (2004)
  • DISTINCT operator in Pig Latin [Online]. Available:...
  • Dominator tree [Online]. Available:...
  • Eclipse Memory Analyzer,...
  • Enhanced Eclipse MAT, [Online]. Available:...
  • Enhanced Hadoop-1.2, [Online]. Available:...
  • L. Fang et al.

    Interruptible tasks: Treating memory pressure as interrupts for highly scalable data-parallel programs

    ACM SIGOPS 25th Symposium on Operating Systems Principles (SOSP)

    (2015)
  • H.S. Gunawi et al.

    What bugs live in the cloud?: A study of 3000+ issues in cloud systems

    Proceedings of the ACM Symposium on Cloud Computing, Seattle (SoCC)

    (2014)
  • Hadoop Distributed File System, [Online]. Available:...
  • Hadoop mailing list, [Online]. Available:...
  • M. Isard et al.

    Dryad: distributed data-parallel programs from sequential building blocks

    Proceedings of the 2007 EuroSys Conference (EuroSys)

    (2007)
  • java.lang.OutOfMemoryError on running Hadoop job, [Online]. Available:...
  • Lijie Xu is an assistant research professor in the Institute of Software, Chinese Academy of Sciences. His research interests focus on distributed systems and software engineering, including distributed data-parallel frameworks, memory management techniques, and system reliability.

    Wensheng Dou is an associate research professor in the Institute of Software, Chinese Academy of Sciences. His research interests focus on program analysis, including spreadsheet analysis, program comprehension and testing. A particular research interest lies in spreadsheet analysis.

    Feng Zhu got his PhD degree from Institute of software, Chinese Academy of Sciences, and is now working on large-scale data processing at Tencent Inc. His research interests focus on distributed computing, database and software engineering.

    Chushu Gao is an assistant research professor in the Institute of Software, Chinese Academy of Sciences. His research area is software engineering with a special focus on web-based and service-based applications.

    Jie Liu is an associate research professor in the Institute of Software, Chinese Academy of Sciences. His research area is software engineering and distributed computing, with a special focus on distributed system performance optimization and development environment for big data application. He has published over 20 papers over the last 5 years.

    Jun Wei is a research professor in the Institute of Software, Chinese Academy of Sciences. His area of research is software engineering and distributed computing, with emphasis on middleware and distributed software engineering. His interests include software analysis and verification techniques and tools for improving software reliability. He has published over 50 papers on international journals and conferences over the last 5 years. He is serving on the editorial board of Journal of Software and Journal of Frontiers of Computer Science and Technology. He is a senior member of CCF.

    View full text