ABSTRACT
Apache spark is one of the high speed "in-memory computing" that run over the JVM. Due to increasing data in volume, it needs performance optimization mechanism that requires management of JVM heap space. To Manage JVM heap space it needs management of garbage collector pause time that affects application performance. There are different parameters to pass to spark to control JVM heap space and GC time overhead to increase application performance. Passing appropriate heap size with appropriate types of GC as a parameter is one of performance optimization which is known as Spark Garbage collection tuning. To reduce GC overhead, an experiment was done by adjusting certain parameters for loading and dataframe creation and data retrieval process. The result shows 3.23% improvement in Latency and 1.62% improvement in Throughput as compared to default parameter configuration in garbage collection tuning approach.
- "Apache Spark™ - Unified Analytics Engine for Big Data," Apache Spark™ Unified Analytics Engine for Big Data. [Online]. Available: https://spark.apache.org/. [Accessed: 25-Feb-2019].Google Scholar
- Y. Yu, T. Lei, W. Zhang, H. Chen and B. Zang, "Performance Analysis and Optimization of Full Garbage Collection in Memory-hungry Environments", ACM SIGPLAN Notices, vol. 51, no. 7, pp. 123--130, 2016.Google ScholarDigital Library
- V. Bande, G. Pakle: "CSRS: Customized Service Recommendation System for Big Data Analysis using Map Reduce":2018Google Scholar
- S. Choi, W. Yang, Y. Kee, "Early experience with optimizing I/O performance using high-performance SSDs for in-memory cluster computing -2015 IEEE International Conference on Big Data (Big Data), 2015.Google Scholar
- Y. Zhao, D. Chen, H. Che, and Z. Jiang, "An adaptive memory tuning strategy with high performance for Spark", International Journal of Big Data Intelligence, 2017.Google ScholarCross Ref
- Z. Zhu, Q. Shen, Y. Yang, and Z. Wu "MCS: Memory Constraint Strategy for Unified Memory Manager in Spark." 2017 IEEE 23rd International Conference onParallel and distributed systems, 2017Google Scholar
- N. Nhan, M. Mohammad, K. Hasan, A. Yusuf, and W. Kewen, "Understanding the Influence of Configuration Settings: An Execution Model-driven Framework for Apache Spark Platform", 2017 IEEE 10th International Conference on Cloud Computing, 2017.Google Scholar
- W. Guolu, X. Jungang, and H. Ben, "A Novel Method for Tuning Configuration Parameters of Spark Based on Machine Learning", 2016 IEEE 18th International Conference onHigh-Performance Computing and Communications; IEEE 14th International Conference on Smart City; IEEE 2nd International Conference on Data Science and Systems.Google Scholar
- R. Charles, "Understanding Memory Configurations for In-Memory Analytics | EECS at UC Berkeley", Www2.eecs.berkeley.edu, 2019Google Scholar
- Z. Han and Y. Zhang, "Spark: A Big Data Processing Platform Based on Memory Computing," 2015 Seventh International Symposium on Parallel Architectures, Algorithms and Programming (PAAP), 2015.Google Scholar
- T. Chiba and T. Onodera, "Workload characterization and optimization of TPC-H queries on Apache Spark," 2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2016.Google Scholar
- H. Du, P. Han, W. Chen, Y. Wang, and C. Zhang, "Otterman: A Novel Approach of Spark Auto-tuning by a Hybrid Strategy," 2018 5th International Conference on Systems and Informatics (ICSAI), 2018.Google Scholar
Recommendations
GC assertions: using the garbage collector to check heap properties
PLDI '09: Proceedings of the 30th ACM SIGPLAN Conference on Programming Language Design and ImplementationThis paper introduces GC assertions, a system interface that programmers can use to check for errors, such as data structure invariant violations, and to diagnose performance problems, such as memory leaks. GC assertions are checked by the garbage ...
Comments