Skip to main content
Log in

A Dependency-Aware Storage Schema Selection Mechanism for In-Memory Big Data Computing Frameworks

  • Published:
International Journal of Parallel Programming Aims and scope Submit manuscript

Abstract

Artificial intelligence applications that greatly depend on deep learning and compute vision processing becomes popular. Their strong demands for low-latency or real-time services make Spark, an in-memory big data computing framework, the best choice in taking place of previous disk-based big data computing. As an in-memory framework, reasonable data arrangement in storage is the key factor of performance. However, the existing cache replacement strategy and storage selection mechanism based optimizations all rely on an imprecise available memory model and will lead to negative decision. To address this issue, we propose an available memory model to capture the accurate information of to be freed memory space by sensing the dependencies between the data. And we also propose a maximum memory requirement model for execution prediction to exclude the redundancy from inactive blocks. With such two models, we build DASS, a dependency-aware storage selection mechanism for Spark to make dynamic and fine-grained storage decision. Our experiments show that compared with previous methods the DASS could effectively reduce the cost of garbage collection and RDD blocks re-computing, give better computing performance by 77.4%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Notes

  1. An extended algorithm of CSAS.

References

  1. Yu, Y., Wang, W., Zhang, J., Letaief, K.B.: LRC: dependency-aware cache management for data analytics clusters (2017)

  2. Liu, Z., Ng, T.S.E.: Leaky buffer: a novel abstraction for relieving memory pressure from cluster data processing frameworks. IEEE Trans. Parallel Distrib. Syst. 28(1), 128–140 (2017)

    Article  Google Scholar 

  3. Apache Sparkhttp://Spark.apache.org/

  4. Caffe http://caffe.berkeleyvision.org/

  5. TensorFlow https://www.tensorflow.org/

  6. CaffeOnSpark https://github.com/yahoo/CaffeOnSpark

  7. TensorFlowOnSpark https://github.com/yahoo/TensorFlowOnSpark

  8. Saha, B., Shah, H., Seth, S., Vijayaraghavan, G., Murthy, A., Curino, C.: Apache Tez: a unifying framework for modeling and building data processing applications. In: Proceedings of the 2015 ACM SIGMOD international conference on management of data, pp. 1357–1369 (2015). https://doi.org/10.1145/2723372.2742790

  9. Apache Flinkhttp://flink.apache.org/

  10. Nicolae, B., Costa, C.H.A., Misale, C., Katrinis, K., Park, Y.: Leveraging adaptive I/O to optimize collective data shuffling patterns for big data analytics. IEEE Trans. Parallel Distrib. Syst. 28(6), 1663–1674 (2017)

    Article  Google Scholar 

  11. Mattson, R.L., et al.: Evaluation techniques for storage hierarchies. IBM Syst. J. 9(2), 78–117 (1970). https://doi.org/10.1147/sj.92.0078

    Article  Google Scholar 

  12. Aho, A.V., et al.: Principles of optimal page replacement. J. ACM 18(1), 80–93 (1971). https://doi.org/10.1145/321623.321632

    Article  MathSciNet  MATH  Google Scholar 

  13. Nguyen, K., Fang, L., Xu, G., Demsky, B.: Speculative region-based memory management for big data systems. In: Proceedings of the 8th workshop on programming languages and operating systems, pp. 27–32 (2015). https://doi.org/10.1145/2818302.2818308

  14. Nguyen, K., Wang, K., Bu, Y., Fang, L., Hu, J., Xu, G.: Facade: a compiler and runtime for (almost) object-bounded big data applications. SIGPLAN Not. 50(4), 675–690 (2015)

    Article  Google Scholar 

  15. Koliopoulos, A.K., Yiapanis, P., Tekiner, F., Nenadic, G., Keane, J.: Towards automatic memory tuning for in-memory big data analytics in clusters. In: Proceedings 2016 IEEE international congress on big data (BigData congress), pp. 353–356 (2016)

  16. Wang, B., Tang, J., Zhang, R., Gu, Z.: CSAS: cost-based storage auto-selection, a fine grained storage selection mechanism for spark. In: Proceedings network and parallel computing: 14th IFIP WG 10.3 international conference (NPC 2017), pp. 150–154 (2017). https://doi.org/10.1007/978-3-319-68210-5_18

  17. Li, M., Tan, J., Wang, Y., Zhang, L., Salapura, V.: Sparkbench: a comprehensive benchmarking suite for in memory data analytic platform spark. In: Proceedings the 12th ACM international conference on computing frontiers, pp. 1–8 (2015). https://doi.org/10.1145/2742854.2747283

  18. Zaharia, M., Chowdhury, M., Das, T., Dave, Ma, AJ., Mccauley, M., Franklin, MJ., Shenker, S., Stoica, I. : Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings the 9th USENIX conference on networked systems design and im-plementation, pp. 2 (2012)

  19. Spark tuning http://spark.apache.org/docs/latest/tuning.html#tuning-spark

  20. Chen, Q.A., et al.: Parameter optimization for spark jobs based on runtime data analysis. China Comput. Eng. Sci. 38(1), 11–19 (2016)

    Google Scholar 

  21. Khan, M., et al.: Optimizing hadoop parameter settings with gene expression programming guided PSO. Concurr. Comput. Pract. Exp. 29(3), e3786 (2017) https://doi.org/10.1002/cpe.3786

  22. Wang, G.L. et al.: A performance automatic optimization method for spark, Patent CN 105868019 A (2016)

  23. Herodotou, H., Babu, S.: Profiling, what-if analysis, and cost-based optimization of mapreduce programs. In: Proceedings of the VLDB, pp. 1111–1122 (2011)

  24. Geng, Y., Shi, X., Pei, C., Jin, H., Jiang, W.: LCS: an efficient data eviction strategy for Spark. Int. J. Parallel Program. 45, 1–13 (2016)

    Google Scholar 

  25. Duan, M., et al.: Selection and replacement algorithms for memory performance improvement in spark. Concurr. Comput. Pract. Exp. 28(8), 2473–2486 (2016)

    Article  Google Scholar 

  26. Zhao, Y., et al.: An adaptive tuning strategy on spark based on in-memory computation characteristics. In: Proceedings ICACT, pp. 484–488 (2016)

Download references

Acknowledgements

Jie Tang is the corresponding author of this paper. This work is supported by South China University of Technology Start-up Grant No. D61600470, Guangzhou Technology Grant No. 201707010148, the Fundamental Research Funds for the Central Universities Grant No. 2017MS057, and National Science Foundation of China under Grant No. 61370062.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jie Tang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, B., Tang, J., Zhang, R. et al. A Dependency-Aware Storage Schema Selection Mechanism for In-Memory Big Data Computing Frameworks. Int J Parallel Prog 47, 502–519 (2019). https://doi.org/10.1007/s10766-018-0612-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10766-018-0612-8

Keywords

Navigation