Skip to main content
Log in

Comprehensive techniques for multi-tenant deep learning framework on a Hadoop YARN cluster

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

We have designed and implemented a new data processing framework called “MeLoN” (Multi-tenant dEep Learning framework On yarN) which aims to effectively support distributed deep learning applications that can show another type of data-intensive workloads in the YARN-based Hadoop ecosystem. MeLoN is developed as one of Hadoop YARN applications so that it can transparently co-host existing deep learning applications with other data processing workflows. In this paper, we present comprehensive techniques that can effectively support multiple deep learning applications in a Hadoop YARN cluster by leveraging fine-grained GPU over-provisioning policy and a high-performance parallel file system for data staging which can improve the overall system throughput. Through our extensive experiments based on the representative deep learning workloads, we demonstrate that MeLoN can make an effective convergence of deep learning and the big data platform Hadoop by employing YARN-based resource allocation and execution mechanisms for running distributed deep learning applications. We believe that MeLoN can bring many additional interesting research issues including profiling of expected GPU memory usages of deep learning applications, supporting more complicated deep learning related jobs based on queuing systems which can ultimately contribute to a new data processing framework in the YARN-based Hadoop ecosystem.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Data availability

Not applicable.

Code availability

Not applicable.

References

  1. Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., Kudlur, M., Levenberg, J., Monga, R., Moore, S., Murray, D.G., Steiner, B., Tucker, P., Vasudevan, V., Warden, P., Wicke, M., Yu, Y., Zheng, X.: TensorFlow: a system for large-scale machine learning. In: Proceedings of 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI’16) (2016)

  2. TensorFlow: An end-to-end open source machine learning platform. Available at https://www.tensorflow.org/

  3. PyTorch: An open source machine learning framework that accelerates the path from research prototyping to production deployment. Available at https://pytorch.org/

  4. The Apache Hadoop project: open-source software for reliable, scalable, distributed computing. Available at http://hadoop.apache.org/

  5. Dean, J., Corrado, G.S., Monga, R., Chen, K., Devin, M., Le, Q.V., Mao, M.Z., Ranzato, M., Senior, A., Tucker, P., Yang, K., Ng, A.Y.: Large scale distributed deep networks. In: Proceedings of the 25th International Conference on Neural Information Processing Systems (NIPS’12) (2012)

  6. Jiang, Z., Balu, A., Hegde, C., Sarkar, S.: Collaborative deep learning in fixed topology networks. In: Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17) (2017)

  7. Zhang, S., Choromanska, A., LeCun, Y.: Deep learning with elastic averaging SGD. In: Proceedings of the 28th International Conference on Neural Information Processing Systems (NIPS’15) (2015)

  8. Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The Hadoop distributed file system. In: Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST’10) (2010)

  9. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)

    Article  Google Scholar 

  10. Murthy, A., Vavilapalli, V., Eadline, D., Niemiec, J., Markham, J.: Apache Hadoop YARN: Moving Beyond MapReduce and Batch Processing with Apache Hadoop 2. Addison-Wesley Data & Analytics, Boston (2014)

    Google Scholar 

  11. Vavilapalli, V.K., Murthy, A.C., Douglas, C., Agarwal, S., Konar, M., Evans, R., Graves, T., Lowe, J., Shah, H., Seth, S., Saha, B., Curino, C., O’Malley, O., Radia, S., Reed, B., Baldeschwieler, E.: Apache Hadoop YARN: yet another resource negotiator. In: Proceedings of the 4th Annual Symposium on Cloud Computing (SoCC’13) (2013)

  12. Huang, J., Zhou, W., Song, R., Liu, F., Wang, S., Liu, J.: Design and performance modeling of A YARN-based GPU resource scheduling system. In: Proceedings of the 2019 IEEE 5th International Conference on Computer and Communications (ICCC) (2019)

  13. Sculley, D., Holt, G., Golovin, D., Davydov, E., Phillips, T., Ebner, D., Chaudhary, V., Young, M., Crespo, J.-F., Dennison, D.: Hidden technical debt in machine learning systems. In: Proceedings of the 28th International Conference on Neural Information Processing Systems (NIPS’15) (2015)

  14. Kang, D.-C., Heo, S., Jang, H., Lee, H.-J., Cho, M., Kim, J.-S.: MeLoN: Distributed deep learning meets the big data platform. In: Proceedings of the 9th International Workshop on Autonomic Management of High Performance Grid and Cloud Computing (AMGCC’21) in Conjunction with ACSOS 2021 (2021)

  15. Lustre: an open-source, parallel file system that supports many requirements of leadership class HPC simulation environments. Available at https://www.lustre.org/

  16. Schwan, P.: Lustre: building a file system for 1,000-node clusters. In: Proceedings of the Linux Symposium (2003)

  17. MeLoN: multi-tenant deep learning framework on YARN: Available at https://github.com/IDPL-MJU/MeLoN

  18. Gorelik, A.: The Enterprise Big Data Lake: Delivering the Promise of Big Data and Data Science, 1st edn. O’Reilly Media Inc, Sebastopol (2019)

    Google Scholar 

  19. Apache submarine: a unified AI platform which allows engineers and data scientists to run machine learning and deep learning workload in distributed cluster. Available at https://submarine.apache.org/

  20. Hsu, A., Hu, K., Hung, J., Suresh, A., Zhang, Z.: TonY: an orchestrator for distributed machine learning jobs. In: Proceedings of the 2019 USENIX Conference on Operational Machine Learning (OpML’19) (2019)

  21. Kubernetes: production-grade container orchestration. Available at https://kubernetes.io/

  22. Docker: empowering app development for developers. Available at https://www.docker.com/

  23. Gu, J., Chowdhury, M., Shin, K.G., Zhu, Y., Jeon, M., Qian, J., Liu, H., Guo, C.: Tiresias: a GPU cluster manager for distributed deep learning. In: Proceedings of 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI’19) (2019)

  24. Wang, S., Gonzalez, O.J., Zhou, X., Williams, T., Friedman, B.D., Havemann, M., Woo, T.: An efficient and non-intrusive GPU scheduling framework for deep learning training systems. In: SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–13 (2020). https://doi.org/10.1109/SC41405.2020.00094

  25. Mahajan, K., Balasubramanian, A., Singhvi, A., Venkataraman, S., Akella, A., Phanishayee, A., Chawla, S.: Themis: fair and efficient gpu cluster scheduling. In: 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 20), pp. 289–304. USENIX Association, Santa Clara, CA (2020)

  26. Oh, C., Jung, H., Yi, S., Yoon, I., Yi, Y.: HybridHadoop: CPU-GPU hybrid scheduling in Hadoop. In: The International Conference on High Performance Computing in Asia-Pacific Region. HPC Asia 2021, pp. 40–49. Association for Computing Machinery, New York, NY, USA (2021). https://doi.org/10.1145/3432261.3432264

  27. Using GPU On YARN. Available at https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/UsingGpus.html

  28. First class GPUs support in Apache Hadoop 3.1, YARN & HDP 3.0. Available at https://blog.cloudera.com/gpus-support-in-apache-hadoop-3-1-yarn-hdp-3/

  29. Apache Spark: a unified analytics engine for large-scale data processing. Available at https://spark.apache.org/

  30. Meng, X., Bradley, J., Yavuz, B., Sparks, E., Venkataraman, S., Liu, D., Freeman, J., Tsai, D., Amde, M., Owen, S., Xin, D., Xin, R., Franklin, M.J., Zadeh, R., Zaharia, M., Talwalkar, A.: MLlib: machine learning in apache spark. J. Mach. Learn. Res 17(1), 1235–1241 (2016)

    MathSciNet  MATH  Google Scholar 

  31. MLlib: Apache Spark’s scalable machine learning library. Available at https://spark.apache.org/mllib/

  32. Apache MAHOUT for creating scalable performant machine learning applications. Available at https://mahout.apache.org/

  33. Anil, R., Capan, G., Drost-Fromm, I., Dunning, T., Friedman, E., Grant, T., Quinn, S., Ranjan, P., Schelter, S.: Özgür Yilmazel: Apache Mahout: machine learning on distributed dataflow systems. J. Mach. Learn. Res. 21(1), 1–6 (2020)

    Google Scholar 

  34. Dai, J.J., Wang, Y., Qiu, X., Ding, D., Zhang, Y., Wang, Y., Jia, X., Zhang, C.L., Wan, Y., Li, Z., et al.: BigDL: a distributed deep learning framework for big data. In: Proceedings of the ACM Symposium on Cloud Computing. SoCC ’19 (2019)

  35. Sergeev, A., Balso, M.D.: Horovod: fast and easy distributed deep learning in TensorFlow. CoRR abs/1802.05799 (2018) arXiv:1802.05799

  36. Dongarra, J.J., Otto, S.W., Snir, M., Walker, D.: A message passing standard for MPP and workstations. Commun. ACM 39(7), 84–90 (1996)

    Article  Google Scholar 

  37. Jeon, M., Venkataraman, S., Phanishayee, A., Qian, J., Xiao, W., Yang, F.: Analysis of large-scale multi-tenant GPU clusters for DNN training workloads. In: Proceedings of the 2019 USENIX Annual Technical Conference (2019)

  38. Li, M., Andersen, D.G., Park, J.W., Smola, A.J., Ahmed, A., Josifovski, V., Long, J., Shekita, E.J., Su, B.-Y.: Scaling distributed machine learning with the parameter server. In: Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation (OSDI’14) (2014)

  39. Bin packing problem (from Wikipedia, the free encyclopedia). Available at https://en.wikipedia.org/wiki/Bin_packing_problem

  40. THE MNIST DATABASE of handwritten digits. Available at http://yann.lecun.com/exdb/mnist/

  41. The CIFAR-10 and CIFAR-100: labeled subsets of the 80 million tiny images dataset. Available at https://www.cs.toronto.edu/~kriz/cifar.html

  42. Kim, S., Hwang, E., Yoo, T.-k., Kim, J.-S., Hwang, S., Choi, Y.-r.: Platform and co-runner affinities for many-task applications in distributed computing platforms. In: Proceedings of the 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid 2015) (2015)

  43. wiki.lustre.org: the main community repository for Lustre information. Available at https://wiki.lustre.org/

  44. Rutman, N.: Map/reduce on Lustre: hadoop performance in HPC environments. Xyratex White Paper (2011)

  45. Gallegos, J.M., Tao, Z.: Deploying Hadoop on Lustre storage: lessons learned and best practices. In: The 2015 Lustre User Group (LUG) Conference (2015)

  46. Skory, S.: Understanding hadoop performance on Lustre. In: The 2015 Lustre User Group (LUG) Conference (2015)

  47. Hadoop Adapter for Lustre (HAL): Available at https://github.com/whamcloud/lustre-connector-for-hadoop

  48. The STL-10 dataset: an image recognition dataset for developing unsupervised feature learning, deep learning, self-taught learning algorithms: Available at https://cs.stanford.edu/~acoates/stl10/

  49. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)

  50. ImageNet: an image database organized according to the WordNet hierarchy. Available at https://www.image-net.org/

  51. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition (2015)

  52. Apache Kafka: a high-throughput distributed messaging system: Available at http://kafka.apache.org/

  53. AA Kreps, J., Narkhede, N., Rao, J.: Kafka: a distributed messaging system for log processing. In: Proceedings of the 6th International Workshop on Networking Meets Databases (NetDB’11) (2011)

  54. Apache ActiveMQ: the most popular and powerful open source messaging and integration patterns server. Available at http://activemq.apache.org/

Download references

Funding

The National Research Foundation of Korea (NRF) Grant funded by the Korea government (MSIT) (No. NRF-2019R1A2C1005360).

Author information

Authors and Affiliations

Authors

Contributions

Not applicable.

Corresponding author

Correspondence to Jik-Soo Kim.

Ethics declarations

Conflict of interest

Not applicable.

Ethical approval

Not applicable.

Consent to participate

Not applicable.

Consent for publication

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Heo, S., Kang, DC., Jang, H. et al. Comprehensive techniques for multi-tenant deep learning framework on a Hadoop YARN cluster. Cluster Comput 26, 2851–2864 (2023). https://doi.org/10.1007/s10586-022-03799-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10586-022-03799-6

Keywords

Navigation