Abstract
Today, lightweight virtualization technologies have been widely deployed on data centers and HPC clusters to provide highly efficient and elastic resource provisioning. Virtualization has also been extended to the I/O stack in operating system. For example, virtual switch has become the primary provider of I/O services for data movement among various light-weight virtual machines, such as Docker and Kubernetes. However, I/O stack virtualization introduces performance degradation and scalability bottleneck to the data movements of HPC computing framework, such as MPI based collective data movements and bursty asynchronous data movements. In order to study the bottleneck, we quantify and analyze the performance degradation involving with HPC data movements on virtual clusters. Then, we design a set of two-stage methods to proactively adapt the virtual network and data movement procedures. This can enhance the performance of HPC collective data movements by up to 3\(\times \). Meanwhile, a cross-layer middleware is designed to improve the performance and scalability of bursty asynchronous data movements. Our evaluation shows that it can improve the performance of real scientific application by 34.6%.
Similar content being viewed by others
References
Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al.: Tensorflow: a system for large-scale machine learning. In: 12th \(\{\)USENIX\(\}\) Symposium on Operating Systems Design and Implementation (\(\{\)OSDI\(\}\) 16), pp. 265–283 (2016)
Armitage, Grenville: MPLs: the magic behind the myths [multiprotocol label switching]. Commun. Mag. IEEE 38(1), 124–131 (2000)
Burtsev, A., Srinivasan, K., Radhakrishnan, P., Voruganti, K., Goodson, G.R.: Fido: fast inter-virtual-machine communication for enterprise appliances. In: USENIX Annual Technical Conference, San Diego, CA (2009)
Chai, L., Lai, P., Jin, H.-W., Panda, D.K.: Designing an efficient kernel-level and user-level hybrid approach for MPI intra-node communication on multi-core systems. In: Parallel Processing, 2008. ICPP’08. 37th International Conference on, pp. 222–229. IEEE (2008)
den Burger, M., Kielmann, T.: Collective receiver-initiated multicast for grid applications. Parallel Distrib. Syst. IEEE Trans. 22(2), 231–244 (2011)
Docker: https://www.docker.com/ (2019). Accessed 22 Dec 2019
Friedley, A., Bronevetsky, G., Hoefler, T., Lumsdaine, A.: Hybrid MPI: efficient message passing for multi-core systems. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. ACM, p. 18 (2013)
Gong, Y., He, B., Zhong, J.: Network performance aware MPI collective communication operations in the cloud. IEEE Trans. Parallel Distrib. Syst. 26(11), 3079–3089 (2013)
Gong, Y., He, B., Li, D.: Finding constant from change: revisiting network performance aware optimizations on IAAS clouds. In: High Performance Computing, Networking, Storage and Analysis, SC14: International Conference for. IEEE, pp. 982–993 (2014)
Graham, R.L., Shipman, G.: MPI support for multi-core architectures: Optimized shared memory collectives. In: Recent Advances in Parallel Virtual Machine and Message Passing Interface. Springer, pp. 130–140 (2008)
Gropp, W.: Mpich2: a new start for MPI implementations. In: Recent Advances in Parallel Virtual Machine and Message Passing Interface. Springer, p. 7 (2002)
Hanks, S., Meyer, D., Farinacci, D., Traina, P.: Generic routing encapsulation(GRE)RFC 1701 (2000)
Huang, D., Liu, Q., Klasky, S., Wang, J., Choi, J.Y., Logan, J., Podhorszki, N.: Harnessing data movement in virtual clusters for in-situ execution. IEEE Trans. Parallel Distrib. Syst. 30(3), 615–629 (2018)
Hwang, J., Ramakrishnan, K.K., Wood, T.: Netvm: high performance and flexible networking using virtualization on commodity platforms. Netw. Serv. Manag. IEEE Trans. 12(1), 34–47 (2015)
Kamil, S., Shalf, J., Oliker, L., Skinner, D.: Understanding ultra-scale application communication requirements. In: Workload Characterization Symposium, 2005. Proceedings of the IEEE International. IEEE, pp. 178–187 (2005)
Kandalla, K., Subramoni, H., Vishnu, A., Panda, D.K.: Designing topology-aware collective communication algorithms for large scale infiniband clusters: case studies with scatter and gather. In: Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW), 2010 IEEE International Symposium on. IEEE, pp. 1–8 (2010)
Karonis, N.T., De Supinski, B.R., Foster, I., Gropp, W., Lusk, E., Bresnahan, J.: Exploiting hierarchy in parallel computer networks to optimize collective operation performance. In: Parallel and Distributed Processing Symposium, 2000. IPDPS 2000. Proceedings. 14th International. IEEE, pp. 377–384 (2000)
Kielmann, Thilo, Hofman, Rutger FH, Bal, Henri E, Plaat, Aske, Bhoedjang, Raoul AF: Magpie: MPI’s collective communication operations for clustered wide area systems. ACM Sigplan Notices 34(8), 131–140 (1999)
Koponen, T., Amidon, K., Balland, P., Casado, M., Chanda, A., Fulton, B., Ganichev, I., Gross, J., Gude, N., Ingram, P., et al.: Network virtualization in multi-tenant datacenters. In: USENIX NSDI (2014)
Kubernetes: http://kubernetes.io/ (2019). Accessed 22 Dec 2019
Kwon, Y., Nunley, D., Gardner, J.P.: Magdalena B., Bill, H., Sarah, L.. Scalable clustering algorithm for N-body simulations in a shared-nothing cluster. Technical Report, University of Washington, Seattle, WA (2009)
Lai, P., Sur, S., Panda, D.K.: Designing truly one-sided MPI-2 RMA intra-node communication on multi-core systems. Comput. Sci. Res. Dev. 25(1–2), 3–14 (2010)
Li, S., Hoefler, T., Snir, M.: Numa-aware shared-memory collective communication for MPI. In: Proceedings of the 22nd International Aymposium on High-Performance Parallel and Distributed Computing, ACM, pp. 85–96 (2013)
Lin, Z., Ethier, S., Hahm, T.S., Tang, W.M.: Size scaling of turbulent transport in magnetically confined plasmas. Phys. Rev. Lett. 88(19), 195004 (2002)
Linux Container: https://linuxcontainers.org/ (2018). Accessed 22 Sept 2019
Ma, T., Herault, T., Bosilca, G., Dongarra, J.J.: Process distance-aware adaptive MPI collective communications. In: Cluster Computing (CLUSTER), 2011 IEEE International Conference on. IEEE, pp. 196–204 (2011)
Mahalingam, M., Dutt, D., Duda, K., Agarwal, P., Kreeger, L., Sridhar, T., Bursell, M., Wright, C.: Virtual extensible local area network (vxlan): a framework for overlaying virtualized layer 2 networks over layer 3 networks. Technical Report (2014)
Mamidala, A.R., Kumar, R., De, D., Panda, D.K : MPI collectives on modern multicore clusters: performance optimizations and communication characteristics. In: Cluster Computing and the Grid, 2008. CCGRID’08. 8th IEEE International Symposium on. IEEE, pp. 130–137 (2008)
Probe’s Marmot and Susitna Clusters: http://nmc-probe.org (2017). Accessed 12 May 2017
Ram, K.K., Cox, A.L., Chadha, M., Rixner, S., Barr, T.W., Smith, R., Rixner, S.: Hyper-switch: a scalable software virtual switching architecture. In: USENIX Annual Technical Conference, pp. 13–24 (2013)
Reussner, R., Sanders, P., Träff, J.L.: Skampi: a comprehensive benchmark for public benchmarking of MPI. Sci. Program. 10(1), 55–65 (2002)
Salmond, G.L., Holmes, C.A., Milburn, G.J.: Dynamics of a strongly driven two-component Bose–Einstein condensate. Phys. Rev. A 65(3), 033623 (2002)
Sergeev, A., Balso, M.D.: Horovod: fast and easy distributed deep learning in tensorflow. arXiv:1802.05799 (2018)
Sistare, S., Vandevaart, R., Loh, E.: Optimization of MPI collectives on clusters of large-scale SMP’s. In: Proceedings of the 1999 ACM/IEEE Conference on Supercomputing, ACM, p. 23 (1999)
Soltesz, S., Pötzl, H., Fiuczynski, M.E., Bavier, A., Peterson, L.: Container-based operating system virtualization: a scalable, high-performance alternative to hypervisors. In: ACM SIGOPS Operating Systems Review, vol. 41. ACM, pp. 275–287 (2007)
Subramoni, H., Kandalla, K., Vienne, J., Sur, S., Barth, B., Tomko, K., Mclay, R., Schulz, K., Panda, D.K: Design and evaluation of network topology-/speed-aware broadcast algorithms for infiniband clusters. In: Cluster Computing (CLUSTER), 2011 IEEE International Conference on. IEEE, pp. 317–325 (2011)
Subramoni, H., Potluri, S., Kandalla, K., Barth, B., Vienne, J., Keasler, J., Tomko, K., Schulz, K., Moody, A., Panda, D.K: Design of a scalable infiniband topology service to enable network-topology-aware placement of processes. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. IEEE Computer Society Press, p. 70 (2012)
Sundararaj, A., Gupta, A., Dinda, P., et al.: Increasing application performance in virtual environments through run-time inference and adaptation. In: High Performance Distributed Computing, 2005. HPDC-14. Proceedings. 14th IEEE International Symposium on. IEEE (2005), pp. 47–58 (2005)
Thakur, R., Gropp, W.D.: Improving the performance of collective operations in MPICH. In: Recent Advances in Parallel Virtual Machine and Message Passing Interface. Springer, pp. 257–267 (2003)
Trahay, F., Denis, A., Aumage, O., Namyst, R.: Improving reactivity and communication overlap in MPI using a generic i/o manager. In: Recent Advances in Parallel Virtual Machine and Message Passing Interface. Springer, pp. 170–177 (2007)
Vazhkudai, S.S., de Supinski, B.R., Bland, Arthur S., Geist, A., Sexton, J., Kahle, J., Zimmer, C.J., Atchley, S., Oral, S., Maxwell, D.E. et al.: The design, deployment, and evaluation of the coral pre-exascale systems. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis. IEEE Press, p. 52 (2018)
Wang, B., Ethier, S., Tang, W., Williams, T., Ibrahim, K.Z., Madduri, K., Williams, S., Oliker, L.: Kinetic turbulence simulations at extreme scale on leadership-class systems. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. ACM, p. 82 (2013)
Xavier, M.G., Neves, M.V., Rossi, F.D, Ferreto, T.C, Lange, T., De Rose, C.A.F: Performance evaluation of container-based virtualization for high performance computing environments. In: Parallel, Distributed and Network-Based Processing (PDP), 2013 21st Euromicro International Conference on. IEEE, pp. 233–240 (2013)
Acknowledgements
This work is supported by National Key R&D Program of China under Grant No.2018YFB0204303, NSFC U1811461, Guangdong Natural Science Foundation 2018B030312002 and the Major Program of Guangdong Basic and Applied Research 2019B030302002.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Huang, D., Lu, Y. Improving the efficiency of HPC data movement on container-based virtual cluster. CCF Trans. HPC 2, 67–80 (2020). https://doi.org/10.1007/s42514-020-00025-w
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s42514-020-00025-w