Comprehensive techniques for multi-tenant deep learning framework on a Hadoop YARN cluster

Heo, Seoungbeom; Kang, Dae-Cheol; Jang, Hyeounji; Lee, Hyeock-Jin; Cho, Minkyoung; Kim, Jik-Soo

doi:10.1007/s10586-022-03799-6

Comprehensive techniques for multi-tenant deep learning framework on a Hadoop YARN cluster

Published: 17 November 2022

Volume 26, pages 2851–2864, (2023)
Cite this article

Cluster Computing Aims and scope Submit manuscript

Seoungbeom Heo¹,
Dae-Cheol Kang¹,
Hyeounji Jang¹,
Hyeock-Jin Lee¹,
Minkyoung Cho¹ &
…
Jik-Soo Kim ORCID: orcid.org/0000-0002-0104-4617¹

185 Accesses
Explore all metrics

Abstract

We have designed and implemented a new data processing framework called “MeLoN” (Multi-tenant dEep Learning framework On yarN) which aims to effectively support distributed deep learning applications that can show another type of data-intensive workloads in the YARN-based Hadoop ecosystem. MeLoN is developed as one of Hadoop YARN applications so that it can transparently co-host existing deep learning applications with other data processing workflows. In this paper, we present comprehensive techniques that can effectively support multiple deep learning applications in a Hadoop YARN cluster by leveraging fine-grained GPU over-provisioning policy and a high-performance parallel file system for data staging which can improve the overall system throughput. Through our extensive experiments based on the representative deep learning workloads, we demonstrate that MeLoN can make an effective convergence of deep learning and the big data platform Hadoop by employing YARN-based resource allocation and execution mechanisms for running distributed deep learning applications. We believe that MeLoN can bring many additional interesting research issues including profiling of expected GPU memory usages of deep learning applications, supporting more complicated deep learning related jobs based on queuing systems which can ultimately contribute to a new data processing framework in the YARN-based Hadoop ecosystem.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

File Access Patterns of Distributed Deep Learning Applications

Understanding Distributed Deep Learning Performance by Correlating HPC and Machine Learning Measurements

Analyzing the I/O Patterns of Deep Learning Applications

Data availability

Not applicable.

Code availability

Not applicable.

References

Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., Kudlur, M., Levenberg, J., Monga, R., Moore, S., Murray, D.G., Steiner, B., Tucker, P., Vasudevan, V., Warden, P., Wicke, M., Yu, Y., Zheng, X.: TensorFlow: a system for large-scale machine learning. In: Proceedings of 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI’16) (2016)
TensorFlow: An end-to-end open source machine learning platform. Available at https://www.tensorflow.org/
PyTorch: An open source machine learning framework that accelerates the path from research prototyping to production deployment. Available at https://pytorch.org/
The Apache Hadoop project: open-source software for reliable, scalable, distributed computing. Available at http://hadoop.apache.org/
Dean, J., Corrado, G.S., Monga, R., Chen, K., Devin, M., Le, Q.V., Mao, M.Z., Ranzato, M., Senior, A., Tucker, P., Yang, K., Ng, A.Y.: Large scale distributed deep networks. In: Proceedings of the 25th International Conference on Neural Information Processing Systems (NIPS’12) (2012)
Jiang, Z., Balu, A., Hegde, C., Sarkar, S.: Collaborative deep learning in fixed topology networks. In: Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17) (2017)
Zhang, S., Choromanska, A., LeCun, Y.: Deep learning with elastic averaging SGD. In: Proceedings of the 28th International Conference on Neural Information Processing Systems (NIPS’15) (2015)
Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The Hadoop distributed file system. In: Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST’10) (2010)
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Article Google Scholar
Murthy, A., Vavilapalli, V., Eadline, D., Niemiec, J., Markham, J.: Apache Hadoop YARN: Moving Beyond MapReduce and Batch Processing with Apache Hadoop 2. Addison-Wesley Data & Analytics, Boston (2014)
Google Scholar
Vavilapalli, V.K., Murthy, A.C., Douglas, C., Agarwal, S., Konar, M., Evans, R., Graves, T., Lowe, J., Shah, H., Seth, S., Saha, B., Curino, C., O’Malley, O., Radia, S., Reed, B., Baldeschwieler, E.: Apache Hadoop YARN: yet another resource negotiator. In: Proceedings of the 4th Annual Symposium on Cloud Computing (SoCC’13) (2013)
Huang, J., Zhou, W., Song, R., Liu, F., Wang, S., Liu, J.: Design and performance modeling of A YARN-based GPU resource scheduling system. In: Proceedings of the 2019 IEEE 5th International Conference on Computer and Communications (ICCC) (2019)
Sculley, D., Holt, G., Golovin, D., Davydov, E., Phillips, T., Ebner, D., Chaudhary, V., Young, M., Crespo, J.-F., Dennison, D.: Hidden technical debt in machine learning systems. In: Proceedings of the 28th International Conference on Neural Information Processing Systems (NIPS’15) (2015)
Kang, D.-C., Heo, S., Jang, H., Lee, H.-J., Cho, M., Kim, J.-S.: MeLoN: Distributed deep learning meets the big data platform. In: Proceedings of the 9th International Workshop on Autonomic Management of High Performance Grid and Cloud Computing (AMGCC’21) in Conjunction with ACSOS 2021 (2021)
Lustre: an open-source, parallel file system that supports many requirements of leadership class HPC simulation environments. Available at https://www.lustre.org/
Schwan, P.: Lustre: building a file system for 1,000-node clusters. In: Proceedings of the Linux Symposium (2003)
MeLoN: multi-tenant deep learning framework on YARN: Available at https://github.com/IDPL-MJU/MeLoN
Gorelik, A.: The Enterprise Big Data Lake: Delivering the Promise of Big Data and Data Science, 1st edn. O’Reilly Media Inc, Sebastopol (2019)
Google Scholar
Apache submarine: a unified AI platform which allows engineers and data scientists to run machine learning and deep learning workload in distributed cluster. Available at https://submarine.apache.org/
Hsu, A., Hu, K., Hung, J., Suresh, A., Zhang, Z.: TonY: an orchestrator for distributed machine learning jobs. In: Proceedings of the 2019 USENIX Conference on Operational Machine Learning (OpML’19) (2019)
Kubernetes: production-grade container orchestration. Available at https://kubernetes.io/
Docker: empowering app development for developers. Available at https://www.docker.com/
Gu, J., Chowdhury, M., Shin, K.G., Zhu, Y., Jeon, M., Qian, J., Liu, H., Guo, C.: Tiresias: a GPU cluster manager for distributed deep learning. In: Proceedings of 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI’19) (2019)
Wang, S., Gonzalez, O.J., Zhou, X., Williams, T., Friedman, B.D., Havemann, M., Woo, T.: An efficient and non-intrusive GPU scheduling framework for deep learning training systems. In: SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–13 (2020). https://doi.org/10.1109/SC41405.2020.00094
Mahajan, K., Balasubramanian, A., Singhvi, A., Venkataraman, S., Akella, A., Phanishayee, A., Chawla, S.: Themis: fair and efficient gpu cluster scheduling. In: 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 20), pp. 289–304. USENIX Association, Santa Clara, CA (2020)
Oh, C., Jung, H., Yi, S., Yoon, I., Yi, Y.: HybridHadoop: CPU-GPU hybrid scheduling in Hadoop. In: The International Conference on High Performance Computing in Asia-Pacific Region. HPC Asia 2021, pp. 40–49. Association for Computing Machinery, New York, NY, USA (2021). https://doi.org/10.1145/3432261.3432264
Using GPU On YARN. Available at https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/UsingGpus.html
First class GPUs support in Apache Hadoop 3.1, YARN & HDP 3.0. Available at https://blog.cloudera.com/gpus-support-in-apache-hadoop-3-1-yarn-hdp-3/
Apache Spark: a unified analytics engine for large-scale data processing. Available at https://spark.apache.org/
Meng, X., Bradley, J., Yavuz, B., Sparks, E., Venkataraman, S., Liu, D., Freeman, J., Tsai, D., Amde, M., Owen, S., Xin, D., Xin, R., Franklin, M.J., Zadeh, R., Zaharia, M., Talwalkar, A.: MLlib: machine learning in apache spark. J. Mach. Learn. Res 17(1), 1235–1241 (2016)
MathSciNet MATH Google Scholar
MLlib: Apache Spark’s scalable machine learning library. Available at https://spark.apache.org/mllib/
Apache MAHOUT for creating scalable performant machine learning applications. Available at https://mahout.apache.org/
Anil, R., Capan, G., Drost-Fromm, I., Dunning, T., Friedman, E., Grant, T., Quinn, S., Ranjan, P., Schelter, S.: Özgür Yilmazel: Apache Mahout: machine learning on distributed dataflow systems. J. Mach. Learn. Res. 21(1), 1–6 (2020)
Google Scholar
Dai, J.J., Wang, Y., Qiu, X., Ding, D., Zhang, Y., Wang, Y., Jia, X., Zhang, C.L., Wan, Y., Li, Z., et al.: BigDL: a distributed deep learning framework for big data. In: Proceedings of the ACM Symposium on Cloud Computing. SoCC ’19 (2019)
Sergeev, A., Balso, M.D.: Horovod: fast and easy distributed deep learning in TensorFlow. CoRR abs/1802.05799 (2018) arXiv:1802.05799
Dongarra, J.J., Otto, S.W., Snir, M., Walker, D.: A message passing standard for MPP and workstations. Commun. ACM 39(7), 84–90 (1996)
Article Google Scholar
Jeon, M., Venkataraman, S., Phanishayee, A., Qian, J., Xiao, W., Yang, F.: Analysis of large-scale multi-tenant GPU clusters for DNN training workloads. In: Proceedings of the 2019 USENIX Annual Technical Conference (2019)
Li, M., Andersen, D.G., Park, J.W., Smola, A.J., Ahmed, A., Josifovski, V., Long, J., Shekita, E.J., Su, B.-Y.: Scaling distributed machine learning with the parameter server. In: Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation (OSDI’14) (2014)
Bin packing problem (from Wikipedia, the free encyclopedia). Available at https://en.wikipedia.org/wiki/Bin_packing_problem
THE MNIST DATABASE of handwritten digits. Available at http://yann.lecun.com/exdb/mnist/
The CIFAR-10 and CIFAR-100: labeled subsets of the 80 million tiny images dataset. Available at https://www.cs.toronto.edu/~kriz/cifar.html
Kim, S., Hwang, E., Yoo, T.-k., Kim, J.-S., Hwang, S., Choi, Y.-r.: Platform and co-runner affinities for many-task applications in distributed computing platforms. In: Proceedings of the 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid 2015) (2015)
wiki.lustre.org: the main community repository for Lustre information. Available at https://wiki.lustre.org/
Rutman, N.: Map/reduce on Lustre: hadoop performance in HPC environments. Xyratex White Paper (2011)
Gallegos, J.M., Tao, Z.: Deploying Hadoop on Lustre storage: lessons learned and best practices. In: The 2015 Lustre User Group (LUG) Conference (2015)
Skory, S.: Understanding hadoop performance on Lustre. In: The 2015 Lustre User Group (LUG) Conference (2015)
Hadoop Adapter for Lustre (HAL): Available at https://github.com/whamcloud/lustre-connector-for-hadoop
The STL-10 dataset: an image recognition dataset for developing unsupervised feature learning, deep learning, self-taught learning algorithms: Available at https://cs.stanford.edu/~acoates/stl10/
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
ImageNet: an image database organized according to the WordNet hierarchy. Available at https://www.image-net.org/
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition (2015)
Apache Kafka: a high-throughput distributed messaging system: Available at http://kafka.apache.org/
AA Kreps, J., Narkhede, N., Rao, J.: Kafka: a distributed messaging system for log processing. In: Proceedings of the 6th International Workshop on Networking Meets Databases (NetDB’11) (2011)
Apache ActiveMQ: the most popular and powerful open source messaging and integration patterns server. Available at http://activemq.apache.org/

Download references

Funding

The National Research Foundation of Korea (NRF) Grant funded by the Korea government (MSIT) (No. NRF-2019R1A2C1005360).

Author information

Authors and Affiliations

Department of Computer Engineering, Myongji University, Yongin, Republic of Korea
Seoungbeom Heo, Dae-Cheol Kang, Hyeounji Jang, Hyeock-Jin Lee, Minkyoung Cho & Jik-Soo Kim

Authors

Seoungbeom Heo
View author publications
You can also search for this author in PubMed Google Scholar
Dae-Cheol Kang
View author publications
You can also search for this author in PubMed Google Scholar
Hyeounji Jang
View author publications
You can also search for this author in PubMed Google Scholar
Hyeock-Jin Lee
View author publications
You can also search for this author in PubMed Google Scholar
Minkyoung Cho
View author publications
You can also search for this author in PubMed Google Scholar
Jik-Soo Kim
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Not applicable.

Corresponding author

Correspondence to Jik-Soo Kim.

Ethics declarations

Conflict of interest

Not applicable.

Ethical approval

Not applicable.

Consent to participate

Not applicable.

Consent for publication

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Heo, S., Kang, DC., Jang, H. et al. Comprehensive techniques for multi-tenant deep learning framework on a Hadoop YARN cluster. Cluster Comput 26, 2851–2864 (2023). https://doi.org/10.1007/s10586-022-03799-6

Download citation

Received: 12 March 2022
Revised: 30 September 2022
Accepted: 25 October 2022
Published: 17 November 2022
Issue Date: October 2023
DOI: https://doi.org/10.1007/s10586-022-03799-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Comprehensive techniques for multi-tenant deep learning framework on a Hadoop YARN cluster

Abstract

Access this article

Similar content being viewed by others

File Access Patterns of Distributed Deep Learning Applications

Understanding Distributed Deep Learning Performance by Correlating HPC and Machine Learning Measurements

Analyzing the I/O Patterns of Deep Learning Applications

Data availability

Code availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Consent to participate

Consent for publication

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Comprehensive techniques for multi-tenant deep learning framework on a Hadoop YARN cluster

Abstract

Access this article

Similar content being viewed by others

File Access Patterns of Distributed Deep Learning Applications

Understanding Distributed Deep Learning Performance by Correlating HPC and Machine Learning Measurements

Analyzing the I/O Patterns of Deep Learning Applications

Data availability

Code availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Consent to participate

Consent for publication

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation