Abstract
Artificial intelligence applications that greatly depend on deep learning and compute vision processing becomes popular. Their strong demands for low-latency or real-time services make Spark, an in-memory big data computing framework, the best choice in taking place of previous disk-based big data computing. As an in-memory framework, reasonable data arrangement in storage is the key factor of performance. However, the existing cache replacement strategy and storage selection mechanism based optimizations all rely on an imprecise available memory model and will lead to negative decision. To address this issue, we propose an available memory model to capture the accurate information of to be freed memory space by sensing the dependencies between the data. And we also propose a maximum memory requirement model for execution prediction to exclude the redundancy from inactive blocks. With such two models, we build DASS, a dependency-aware storage selection mechanism for Spark to make dynamic and fine-grained storage decision. Our experiments show that compared with previous methods the DASS could effectively reduce the cost of garbage collection and RDD blocks re-computing, give better computing performance by 77.4%.
Similar content being viewed by others
Notes
An extended algorithm of CSAS.
References
Yu, Y., Wang, W., Zhang, J., Letaief, K.B.: LRC: dependency-aware cache management for data analytics clusters (2017)
Liu, Z., Ng, T.S.E.: Leaky buffer: a novel abstraction for relieving memory pressure from cluster data processing frameworks. IEEE Trans. Parallel Distrib. Syst. 28(1), 128–140 (2017)
Apache Sparkhttp://Spark.apache.org/
TensorFlow https://www.tensorflow.org/
CaffeOnSpark https://github.com/yahoo/CaffeOnSpark
TensorFlowOnSpark https://github.com/yahoo/TensorFlowOnSpark
Saha, B., Shah, H., Seth, S., Vijayaraghavan, G., Murthy, A., Curino, C.: Apache Tez: a unifying framework for modeling and building data processing applications. In: Proceedings of the 2015 ACM SIGMOD international conference on management of data, pp. 1357–1369 (2015). https://doi.org/10.1145/2723372.2742790
Apache Flinkhttp://flink.apache.org/
Nicolae, B., Costa, C.H.A., Misale, C., Katrinis, K., Park, Y.: Leveraging adaptive I/O to optimize collective data shuffling patterns for big data analytics. IEEE Trans. Parallel Distrib. Syst. 28(6), 1663–1674 (2017)
Mattson, R.L., et al.: Evaluation techniques for storage hierarchies. IBM Syst. J. 9(2), 78–117 (1970). https://doi.org/10.1147/sj.92.0078
Aho, A.V., et al.: Principles of optimal page replacement. J. ACM 18(1), 80–93 (1971). https://doi.org/10.1145/321623.321632
Nguyen, K., Fang, L., Xu, G., Demsky, B.: Speculative region-based memory management for big data systems. In: Proceedings of the 8th workshop on programming languages and operating systems, pp. 27–32 (2015). https://doi.org/10.1145/2818302.2818308
Nguyen, K., Wang, K., Bu, Y., Fang, L., Hu, J., Xu, G.: Facade: a compiler and runtime for (almost) object-bounded big data applications. SIGPLAN Not. 50(4), 675–690 (2015)
Koliopoulos, A.K., Yiapanis, P., Tekiner, F., Nenadic, G., Keane, J.: Towards automatic memory tuning for in-memory big data analytics in clusters. In: Proceedings 2016 IEEE international congress on big data (BigData congress), pp. 353–356 (2016)
Wang, B., Tang, J., Zhang, R., Gu, Z.: CSAS: cost-based storage auto-selection, a fine grained storage selection mechanism for spark. In: Proceedings network and parallel computing: 14th IFIP WG 10.3 international conference (NPC 2017), pp. 150–154 (2017). https://doi.org/10.1007/978-3-319-68210-5_18
Li, M., Tan, J., Wang, Y., Zhang, L., Salapura, V.: Sparkbench: a comprehensive benchmarking suite for in memory data analytic platform spark. In: Proceedings the 12th ACM international conference on computing frontiers, pp. 1–8 (2015). https://doi.org/10.1145/2742854.2747283
Zaharia, M., Chowdhury, M., Das, T., Dave, Ma, AJ., Mccauley, M., Franklin, MJ., Shenker, S., Stoica, I. : Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings the 9th USENIX conference on networked systems design and im-plementation, pp. 2 (2012)
Spark tuning http://spark.apache.org/docs/latest/tuning.html#tuning-spark
Chen, Q.A., et al.: Parameter optimization for spark jobs based on runtime data analysis. China Comput. Eng. Sci. 38(1), 11–19 (2016)
Khan, M., et al.: Optimizing hadoop parameter settings with gene expression programming guided PSO. Concurr. Comput. Pract. Exp. 29(3), e3786 (2017) https://doi.org/10.1002/cpe.3786
Wang, G.L. et al.: A performance automatic optimization method for spark, Patent CN 105868019 A (2016)
Herodotou, H., Babu, S.: Profiling, what-if analysis, and cost-based optimization of mapreduce programs. In: Proceedings of the VLDB, pp. 1111–1122 (2011)
Geng, Y., Shi, X., Pei, C., Jin, H., Jiang, W.: LCS: an efficient data eviction strategy for Spark. Int. J. Parallel Program. 45, 1–13 (2016)
Duan, M., et al.: Selection and replacement algorithms for memory performance improvement in spark. Concurr. Comput. Pract. Exp. 28(8), 2473–2486 (2016)
Zhao, Y., et al.: An adaptive tuning strategy on spark based on in-memory computation characteristics. In: Proceedings ICACT, pp. 484–488 (2016)
Acknowledgements
Jie Tang is the corresponding author of this paper. This work is supported by South China University of Technology Start-up Grant No. D61600470, Guangzhou Technology Grant No. 201707010148, the Fundamental Research Funds for the Central Universities Grant No. 2017MS057, and National Science Foundation of China under Grant No. 61370062.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Wang, B., Tang, J., Zhang, R. et al. A Dependency-Aware Storage Schema Selection Mechanism for In-Memory Big Data Computing Frameworks. Int J Parallel Prog 47, 502–519 (2019). https://doi.org/10.1007/s10766-018-0612-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10766-018-0612-8