ABSTRACT
Intra-host networks, including heterogeneous devices and interconnect fabrics, have become increasingly complex and crucial. However, intra-host networks today do not provide sufficient manageability. This prevents data center operators from running a reliable and efficient end-to-end network, especially for multi-tenant clouds. In this paper, we analyze the main manageability deficiencies of intra-host networks and argue that a systematic solution should be implemented to bridge this function gap. We propose two key building blocks for a manageable intra-host network: a fine-grained monitoring system and a holistic resource manager. We discuss the research questions associated with realizing these two building blocks.
- Bhavish Agarwal, Ranjita Bhagwan, Tathagata Das, Siddharth Eswaran, Venkata N Padmanabhan, and Geoffrey M Voelker. Netprints: Diagnosing home network misconfigurations using shared knowledge. In NSDI, volume 9, pages 349--364, 2009.Google Scholar
- Saksham Agarwal, Rachit Agarwal, Behnam Montazeri, Masoud Moshref, Khaled Elmeleegy, Luigi Rizzo, Marc Asher de Kruijf, Gautam Kumar, Sylvia Ratnasamy, David Culler, and Amin Vahdat. Understanding host interconnect congestion. In Proceedings of the 21st ACM Workshop on Hot Topics in Networks, HotNets '22, page 198--204, New York, NY, USA, 2022. Association for Computing Machinery.Google ScholarDigital Library
- Ian F Akyildiz, Ahyoung Lee, Pu Wang, Min Luo, and Wu Chou. A roadmap for traffic engineering in sdn-openflow networks. Computer Networks, 71:1--30, 2014.Google ScholarDigital Library
- Mohammad Al-Fares, Sivasankar Radhakrishnan, Barath Raghavan, Nelson Huang, Amin Vahdat, et al. Hedera: dynamic flow scheduling for data center networks. In Nsdi, volume 10, pages 89--92. San Jose, USA, 2010.Google Scholar
- Mohammad Alian and Nam Sung Kim. Netdimm: Low-latency near-memory network interface architecture. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, pages 699--711, 2019.Google ScholarDigital Library
- Mohammad Alizadeh, Tom Edsall, Sarang Dharmapurikar, Ramanan Vaidyanathan, Kevin Chu, Andy Fingerhut, Vinh The Lam, Francis Matus, Rong Pan, Navindra Yadav, et al. Conga: Distributed congestion-aware load balancing for datacenters. In Proceedings of the 2014 ACM conference on SIGCOMM, pages 503--514, 2014.Google ScholarDigital Library
- Marcelo Amaral, Jorda Polo, David Carrera, Seetharami Seelam, and Malgorzata Steinder. Topology-aware gpu scheduling for learning workloads in cloud environments. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1--12, 2017.Google ScholarDigital Library
- Sebastian Angel, Hitesh Ballani, Thomas Karagiannis, Greg O'Shea, and Eno Thereska. End-to-End Performance Isolation through Virtual Datacenters. In OSDI, 2014.Google ScholarDigital Library
- Behnaz Arzani, Selim Ciraci, Boon Thau Loo, Assaf Schuster, and Geoff Outhred. Taking the blame game out of data centers operations with netpoirot. In Proceedings of the 2016 ACM SIGCOMM Conference, pages 440--453, 2016.Google ScholarDigital Library
- Wei Bai, Shanim Sainul Abdeen, Ankit Agrawal, Krishan Kumar Attre, Paramvir Bahl, Ameya Bhagat, Gowri Bhaskara, Tanya Brokhman, Lei Cao, Ahmad Cheema, et al. Empowering azure storage with {RDMA}. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), pages 49--67, 2023.Google Scholar
- Hitesh Ballani, Paolo Costa, Thomas Karagiannis, and Ant Rowstron. Towards predictable datacenter networks. In Proceedings of the ACM SIGCOMM 2011 Conference, pages 242--253, 2011.Google ScholarDigital Library
- Jingrong Chen, Yongji Wu, Shihan Lin, Yechen Xu, Xinhao Kong, Thomas Anderson, Matthew Lentz, Xiaowei Yang, and Danyang Zhuo. Remote procedure call as a managed system service. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), pages 141--159, 2023.Google Scholar
- Jingrong Chen, Hong Zhang, Wei Zhang, Liang Luo, Jeffrey Chase, Ion Stoica, and Danyang Zhuo. {NetHint}:{White-Box} networking for {Multi-Tenant} data centers. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22), pages 1327--1343, 2022.Google Scholar
- Mike Chen, Alice X Zheng, Jim Lloyd, Michael I Jordan, and Eric Brewer. Failure diagnosis using decision trees. In International Conference on Autonomic Computing, 2004. Proceedings., pages 36--43. IEEE, 2004.Google ScholarCross Ref
- Haiwei Dong, Ali Munir, Hanine Tout, and Yashar Ganjali. Next-generation data center network enabled by machine learning: Review, challenges, and opportunities. IEEE Access, 9:136459--136475, 2021.Google ScholarCross Ref
- Nick G Duffield, Pawan Goyal, Albert Greenberg, Partho Mishra, Kadangode K Ramakrishnan, and Jacobus E van der Merive. A flexible model for resource management in virtual private networks. In Proceedings of the conference on Applications, technologies, architectures, and protocols for computer communication, pages 95--108, 1999.Google ScholarDigital Library
- Alireza Farshin, Amir Roozbeh, Gerald Q. Maguire, and Dejan Kostić. Reexamining direct cache access to optimize i/o intensive applications for multi-hundred-gigabit networks. In Proceedings of the 2020 USENIX Conference on Usenix Annual Technical Conference, USENIX ATC'20, USA, 2020. USENIX Association.Google ScholarDigital Library
- Daniel Firestone and Madhan Sivakumar. DPDK on Microsoft Azure. In DPDK Summit, 2017.Google Scholar
- Yixiao Gao, Qiang Li, Lingbo Tang, Yongqing Xi, Pengcheng Zhang, Wenwen Peng, Bo Li, Yaohui Wu, Shaozong Liu, Lei Yan, Fei Feng, Yan Zhuang, Fan Liu, Pan Liu, Xingkui Liu, Zhongjie Wu, Junping Wu, Zheng Cao, Chen Tian, Jinbo Wu, Jiaji Zhu, Haiyong Wang, Dennis Cai, and Jiesheng Wu. When cloud storage meets RDMA. In NSDI 21, 2021.Google Scholar
- Donghyun Gouk, Miryeong Kwon, Hanyeoreum Bae, Sangwon Lee, and Myoungsoo Jung. Memory pooling with cxl. IEEE Micro, pages 1--11, 2023.Google ScholarDigital Library
- Donghyun Gouk, Sangwon Lee, Miryeong Kwon, and Myoungsoo Jung. Direct access,{High-Performance} memory disaggregation with {DirectCXL}. In 2022 USENIX Annual Technical Conference (USENIX ATC 22), pages 287--294, 2022.Google Scholar
- Albert Greenberg, Dave Maltz, Guohan Lu, Jiaxin Cao, Ratul Mahajan, and Yibo Zhu. Packet-level telemetry in large datacenter networks. In SIGCOMM'15, August 2015.Google Scholar
- Chuanxiong Guo. Pingmesh: A large-scale system for data center network latency measurement and analysis. In SIGCOMM. ACM, August 2015.Google Scholar
- Chuanxiong Guo, Guohan Lu, Helen J. Wang, Shuang Yang, Chao Kong, Peng Sun, Wenfei Wu, and Yongguang Zhang. SecondNet: A Data Center Network Virtualization Architecture with Bandwidth Guarantees. In CoNEXT, 2010.Google ScholarDigital Library
- Chuanxiong Guo, Haitao Wu, Zhong Deng, Gaurav Soni, Jianxi Ye, Jitu Padhye, and Marina Lipshteyn. RDMA over Commodity Ethernet at Scale. In SIGCOMM, 2016.Google ScholarDigital Library
- Christian Hopps. Analysis of an equal-cost multi-path algorithm. Technical report, 2000.Google ScholarDigital Library
- Intel. Intel data direct i/o technology. https://www.intel.com/content/dam/www/public/us/en/documents/technology-briefs/data-direct-i-o-technology-brief.pdf.Google Scholar
- Intel. Intel performance counter monitor. https://github.com/intel/pcm.Google Scholar
- Intel. Intel resource director technology (rdt). https://intel.github.io/cri-resource-manager/stable/docs/policy/rdt.html.Google Scholar
- Vimalkumar Jeyakumar, Mohammad Alizadeh, David Mazières, Balaji Prabhakar, Changhoon Kim, and Albert Greenberg. Eyeq: Practical network performance isolation at the edge. REM, 1005(A1):A2, 2013.Google Scholar
- Yimin Jiang, Yibo Zhu, Chang Lan, Bairen Yi, Yong Cui, and Chuanxiong Guo. A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters. In OSDI, 2020.Google Scholar
- Xinhao Kong, Jingrong Chen, Wei Bai, Yechen Xu, Mahmoud Elhaddad, Shachar Raindel, Jitendra Padhye, Alvin R Lebeck, and Danyang Zhuo. Understanding {RDMA} microarchitecture resources for performance isolation. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), pages 31--48, 2023.Google Scholar
- Xinhao Kong, Yibo Zhu, Huaping Zhou, Zhuo Jiang, Jianxi Ye, Chuanxiong Guo, and Danyang Zhuo. Collie: Finding Performance Anomalies in RDMA Subsystems. In NSDI, 2022.Google Scholar
- Praveen Kumar, Nandita Dukkipati, Nathan Lewis, Yi Cui, Yaogong Wang, Chonggang Li, Valas Valancius, Jake Adriaens, Steve Gribble, Nate Foster, et al. Picnic: predictable virtualized nic. In Proceedings of the ACM Special Interest Group on Data Communication, pages 351--366. 2019.Google ScholarDigital Library
- Nikita Lazarev, Shaojie Xiang, Neil Adit, Zhiru Zhang, and Christina Delimitrou. Dagger: efficient and fast rpcs in cloud microservices with near-memory reconfigurable nics. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, pages 36--51, 2021.Google ScholarDigital Library
- Ang Li, Shuaiwen Leon Song, Jieyang Chen, Jiajia Li, Xu Liu, Nathan R. Tallent, and Kevin J. Barker. Evaluating modern GPU interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect. IEEE Transactions on Parallel and Distributed Systems, 31(1):94--110, jan 2020.Google ScholarDigital Library
- Qiang Li, Qiao Xiang, Derui Liu, Yuxin Wang, Haonan Qiu, Gexiao Tian, Xiaoliang Wang, Lulu Chen, Ridi Wen, Jianbo Dong, Yuanyuan Gong, Yixiao Gao, Haohao Song, Zhiwu Wu, Shaozong Liu, Zicheng Luo, Yuchao Shao, Yaohui Wu, Chao Han, Chen Tian, Zhongjie Wu, Zheng Cao, Jinbo Wu, Jiwu Shu, and Jiesheng Wu. Lamda: The last mile of the datacenter network does matter, 2022.Google Scholar
- Shang Li, Dhiraj Reddy, and Bruce Jacob. A performance & power comparison of modern high-speed dram architectures. In Proceedings of the International Symposium on Memory Systems, pages 341--353, 2018.Google ScholarDigital Library
- Yuliang Li, Rui Miao, Changhoon Kim, and Minlan Yu. Flowradar: A better netflow for data centers. In 13th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 16), pages 311--324, 2016.Google Scholar
- Kefei Liu, Zhuo Jiang, Jiao Zhang, Haoran Wei, Xiaolong Zhong, Lizhuang Tan, Tian Pan, and Tao Huang. Hostping: Diagnosing intra-host network bottlenecks in {RDMA} servers. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), pages 15--29, 2023.Google Scholar
- Nick McKeown, Tom Anderson, Hari Balakrishnan, Guru Parulkar, Larry Peterson, Jennifer Rexford, Scott Shenker, and Jonathan Turner. Openflow: enabling innovation in campus networks. ACM SIGCOMM computer communication review, 38(2):69--74, 2008.Google Scholar
- Mellanox. Mellanox neo-host network adapter management software. https://support.mellanox.com/s/productdetails/a2v50000000N2OlAAK/mellanox-neohost.Google Scholar
- Rolf Neugebauer, Gianni Antichi, José Fernando Zazo, Yury Audzevich, Sergio López-Buedo, and Andrew W. Moore. Understanding pcie performance for end host networking. In Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication, SIGCOMM '18, page 327--341, New York, NY, USA, 2018. Association for Computing Machinery.Google ScholarDigital Library
- Nvidia. Nvidia connectx ethernet network adapters. https://www.nvidia.com/en-us/networking/ethernet-adapters/.Google Scholar
- Nvidia. Nvidia dgx a100. https://www.nvidia.com/en-us/data-center/dgx-a100/.Google Scholar
- Nvidia. Nvidia sn4000 series ethernet switches. https://nvdam.widen.net/s/6269c25wv8/nv-spectrum-sn4000-product-brief.Google Scholar
- Lucian Popa, Gautam Kumar, Mosharaf Chowdhury, Arvind Krishnamurthy, Sylvia Ratnasamy, and Ion Stoica. FairCloud: Sharing the Network in Cloud Computing. In SIGCOMM, 2012.Google ScholarDigital Library
- Lucian Popa, Praveen Yalagandula, Sujata Banerjee, Jeffrey C Mogul, Yoshio Turner, and Jose Renato Santos. Elasticswitch: Practical Work-Conserving Bandwidth Guarantees for Cloud Computing. In SIGCOMM, 2013.Google ScholarDigital Library
- Debendra Das Sharma. Compute express link®: An open industry-standard interconnect enabling heterogeneous data-centric computing. In 2022 IEEE Symposium on High-Performance Interconnects (HOTI), pages 5--12. IEEE, 2022.Google ScholarCross Ref
- Alan Shieh, Srikanth Kandula, Albert Greenberg, Changhoon Kim, and Bikas Saha. Sharing the Data Center Network. In NSDI, 2011.Google ScholarDigital Library
- Peng Sun, Ratul Mahajan, Jennifer Rexford, Lihua Yuan, Ming Zhang, and Ahsan Arefin. A network-state management service. In Proceedings of the 2014 ACM Conference on SIGCOMM, SIGCOMM '14, page 563--574, New York, NY, USA, 2014. Association for Computing Machinery.Google ScholarDigital Library
- Cheng Tan, Ze Jin, Chuanxiong Guo, Tianrong Zhang, Haitao Wu, Karl Deng, Dongming Bi, and Dong Xiang. Netbouncer: Active device and link failure localization in data center networks. In NSDI, pages 599--614, 2019.Google Scholar
- Markus Velten, Robert Schöne, Thomas Ilsche, and Daniel Hackenberg. Memory performance of amd epyc rome and intel cascade lake sp server processors. In Proceedings of the 2022 ACM/SPEC on International Conference on Performance Engineering, pages 165--175, 2022.Google ScholarDigital Library
- Tong Yang, Jie Jiang, Peng Liu, Qun Huang, Junzhi Gong, Yang Zhou, Rui Miao, Xiaoming Li, and Steve Uhlig. Elastic sketch: Adaptive and fast network-wide measurements. In Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication, pages 561--575, 2018.Google ScholarDigital Library
- Da Yu, Yibo Zhu, Behnaz Arzani, Rodrigo Fonseca, Tianrong Zhang, Karl Deng, and Lihua Yuan. dshark: a general, easy to program and scalable framework for analyzing in-network packet traces. In Proceedings of the 16th USENIX Conference on Networked Systems Design and Implementation, pages 207--220, 2019.Google ScholarDigital Library
- Rohit Zambre, Megan Grodowitz, Aparna Chandramowlishwaran, and Pavel Shamis. Breaking band: A breakdown of high-performance communication. In Proceedings of the 48th International Conference on Parallel Processing, ICPP '19, New York, NY, USA, 2019. Association for Computing Machinery.Google Scholar
- Junlan Zhou, Malveeka Tewari, Min Zhu, Abdul Kabbani, Leon Poutievski, Arjun Singh, and Amin Vahdat. Wcmp: Weighted cost multipathing for improved fairness in data centers. In Proceedings of the Ninth European Conference on Computer Systems, pages 1--14, 2014.Google ScholarDigital Library
Index Terms
- Towards a Manageable Intra-Host Network
Recommendations
Identification of potential Tpx inhibitors against pathogen-host interactions
Display Omitted Eight ligands from different chemotypes potentially inhibit Tpx.Asp 57, Glu 156, Ile 153, Phe 89, Ser 55, Thr 154 are the key residues in binding.RMSD values of each ligand pose from IFD explain stable interaction mode in Tpx.MD ...
Comments