Abstract
Data centers, clusters, and grids have historically supported High-Performance Computing (HPC) applications. Due to the high capital and operational expenditures associated with such infrastructures, we have witnessed consistent efforts to run HPC applications in the cloud in the recent past. The potential advantages of this shift include higher scalability and lower costs. If, on the one hand, app instantiation—through customized Virtual Machines (VMs)—is a well-solved issue, on the other, the network still represents a significant bottleneck. When switching HPC applications to be executed on the cloud, we lose control of where VMs will be positioned and of the paths that will be traversed for processes to communicate with one another. To bridge this gap, we present Janus, a framework for dynamic, just-in-time path provisioning in cloud infrastructures. By leveraging emerging software-defined networking principles, the framework allows for an HPC application, once deployed, to have interprocess communication paths configured upon usage based on least-used network links (instead of resorting to shortest, pre-computed paths). Janus is fully configurable to cope with different operating parameters and communication strategies, providing a rich ecosystem for application execution speed up. Through an extensive experimental evaluation, we provide evidence that the proposed framework can lead to significant gains regarding runtime. Moreover, we show what one can expect in terms of system overheads, providing essential insights on how better benefiting from Janus.
Similar content being viewed by others
Data availability
The datasets generated during and/or analysed during the current study are available from the first author on reasonable request.
References
Alsmadi, I., Khamaiseh, S., Xu, D.: Network Parallelization in HPC Clusters. In: International Conference on Computational Science and Computational Intelligence (CSCI’2016), pp. 584–589 (2016). https://doi.org/10.1109/CSCI.2016.0116
Bailey, D.H.: NAS parallel benchmarks. In: Encyclopedia of Parallel Computing, pp. 1254–1259 (2011)
Bera, S., Misra, S., Obaidat, M.S.: Mobi-flow: mobility-aware adaptive flow-rule placement in software-defined access network. IEEE Trans. Mob. Comput. 18(8), 1831–1842 (2019)
Dagum, L., Menon, R.: Openmp: an industry-standard api for shared-memory programming. IEEE Comput. Sci. Eng. 5(1), 46–55 (1998). https://doi.org/10.1109/99.660313
Evangelinos, C., Hill, C.N.: Cloud computing for parallel scientific HPC applications: feasibility of running coupled atmosphere-ocean climate models on Amazon’s EC2. In: The 1st Workshop on Cloud Computing and its Applications (CCA), pp. 2–34 (2008)
Faizian, P., Mollah, M.A., Tong, Z., Yuan, X., Lang, M.: A comparative study of SDN and adaptive routing on dragonfly networks. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, p. 51. ACM (2017)
Gropp, W., Lusk, E., Skjellum, A.: Using MPI: Portable Parallel Programming with the Message-Passing Interface. The MIT Press, New York (2014)
Guan, Y., Lei, W., Zhang, W., Liu, S., Li, H.: Scalable orchestration of software defined service overlay network for multipath transmission. Comput. Netw. 137, 132–146 (2018). https://doi.org/10.1016/j.comnet.2018.03.005
Guillen, L., Izumi, S., Abe, T., Suganuma, T., Muraoka, H.: Sdn-based hybrid server and link load balancing in multipath distributed storage systems. In: NOMS 2018—2018 IEEE/IFIP Network Operations and Management Symposium, pp. 1–6 (2018)
Guo, Z., Liu, R., Xu, Y., Gushchin, A., Walid, A., Chao, H.J.: STAR: preventing flow-table overflow in software-defined networks. Comput. Netw. 125, 15–25 (2017)
Gupta, A., Faraboschi, P., Gioachin, F., Kale, L.V., Kaufmann, R., Lee, B., March, V., Milojicic, D., Suen, C.H.: Evaluating and improving the performance and scheduling of HPC applications in cloud. IEEE Trans. Cloud Comput. 4(3), 307–321 (2016). https://doi.org/10.1109/TCC.2014.2339858
Hauser, C.B., Palanivel, S.R.: Dynamic network scheduler for cloud data centres with SDN. In: Proceedings of The10th International Conference on Utility and Cloud Computing, UCC ’17, p. 29-38. Association for Computing Machinery, New York, NY, USA (2017). https://doi.org/10.1145/3147213.3147217
Joy, S., Nayak, A.: Improving flow completion time for short flows in datacenter networks. In: IFIP/IEEE International Symposium on Integrated Network Management (IM’2015), pp. 700–705 (2015). https://doi.org/10.1109/INM.2015.7140358
Kotas, C., Naughton, T., Imam, N.: A comparison of amazon web services and microsoft azure cloud platforms for high performance computing. In: 2018 IEEE International Conference on Consumer Electronics (ICCE), pp. 1–4 (2018)
Lantz, B., Heller, B., McKeown, N.: A network in a laptop: rapid prototyping for software-defined networks. In: Proceedings of the 9th ACM SIGCOMM Workshop on Hot Topics in Networks, p. 19. ACM (2010)
Lee, J., Tong, Z., Achalkar, K., Yuan, X., Lang, M.: Enhancing infiniband with Openflow-style SDN capability. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC’16, pp. 36:1–36:12. IEEE Press, Piscataway, NJ, USA (2016). http://dl.acm.org/citation.cfm?id=3014904.3014953
Li, C., Zhang, J., Luo, Y.: Real-time scheduling based on optimized topology and communication traffic in distributed real-time computation platform of storm. J. Netw. Comput. Appl. 87, 100–115 (2017)
Mauch, V., Kunze, M., Hillenbrand, M.: High performance cloud computing. Future Gener. Comput. Syst. 29(6), 1408–1416 (2013)
Milojičić, D., Llorente, I.M., Montero, R.S.: Opennebula: a cloud management tool. IEEE Internet Comput. 15(2), 11–14 (2011)
Netto, M.A.S., Calheiros, R.N., Rodrigues, E.R., Cunha, R.L.F., Buyya, R.: HPC cloud for scientific and business applications: taxonomy, vision, and research challenges. ACM Comput. Surv. 51(1), 29 (2018)
Pretto, G.R., Dalmazo, B.L., Marques, J.A., Wu, Z., Wang, X., Korkhov, V., Navaux, P.O.A., Gaspary, L.P.: Boosting hpc applications in the cloud through JIT traffic-aware path provisioning. In: Misra, S., Gervasi, O., Murgante, B., Stankova, E., Korkhov, V., Torre, C., Rocha, A.M.A., Taniar, D., Apduhan, B.O., Tarantino, E. (eds.) Computational Science and Its Applications—ICCSA 2019, pp. 702–716. Springer, Cham (2019)
Ramakrishnan, L., Canon, R.S., Muriki, K., Sakrejda, I., Wright, N.J.: Evaluating interconnect and virtualization performance for high performance computing. ACM SIGMETRICS Perform. Eval. Rev. 40(2), 55–60 (2012)
Roloff, E., Diener, M., Diaz Carreño, E., Gaspary, L.P., Navaux, P.O.A.: Leveraging cloud heterogeneity for cost-efficient execution of parallel applications. In: Rivera, F.F., Pena, T.F., Cabaleiro, J.C. (eds.) Euro-Par 2017: Parallel Processing, pp. 399–411. Springer, Cham (2017)
RYU: Ryu, a component-based software defined networking framework. (2019). Accessed 26 Jan 2019. http://osrg.github.io/ryu/
Sefraoui, O., Aissaoui, M., Eleuldj, M.: Openstack: toward an open-source solution for cloud computing. Int. J. Comput. Appl. 55(3), 38–42 (2012)
Tokmakov, K., Sarker, M., Domaschka, J., Wesner, S.: A case for data centre traffic management on software programmable ethernet switches. In: 2019 IEEE 8th International Conference on Cloud Networking (CloudNet), pp. 1–6 (2019)
Walker, E.: Benchmarking Amazon EC2 for Hig-Performance Scientific Computing. ;login:: The magazine of USENIX & SAGE 33(5), 18–23 (2008)
Witte, P.A., Louboutin, M., Modzelewski, H., Jones, C., Selvage, J., Herrmann, F.J.: An event-driven approach to serverless seismic imaging in the cloud. IEEE Trans. Parallel Distrib. Syst. 31(9), 2032–2049 (2020)
Zahid, F., Taherkordi, A., Gran, E.G., Skeie, T., Johnsen, B.D.: A self-adaptive network for hpc clouds: architecture, framework, and implementation. IEEE Trans. Parallel Distrib. Syst. 29(12), 2658–2671 (2018)
Acknowledgements
This work was carried out as part of the project CloudHPC—Harnessing Cloud Computing to Power Up HPC Applications, BRICS Pilot Call 2016. It was partially supported by the Brazilian National Council for Scientific and Technological Development (CNPq), Project Numbers 441892/2016-7, Call CNPq/MCTIC/BRICS-STI No 18/2016, the Coordination for the Improvement of Higher Education Personnel (CAPES), as well as the National Key Cooperation between the BRICS Program of China (No. 2017YE0100500) and the Beijing Natural Science Foundation of China (No. 4172033).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Pretto, G.R., Dalmazo, B.L., Marques, J.A. et al. Janus: a framework to boost HPC applications in the cloud based on SDN path provisioning. Cluster Comput 25, 947–964 (2022). https://doi.org/10.1007/s10586-021-03470-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10586-021-03470-6