ABSTRACT
Sudden changes in the applications and events in the network are often related. Many of the datacenter applications go through sudden state changes (such as a query-response in MemCached application) that may result in an event in the network (such as utilization, packet drops, etc.). Existing works do not fully leverage the relationship between application state changes and network events and as a result provide limited performance improvements. An ideal application and network management system should be able to automatically identify the sources of sudden changes in the application, host or network and relate these changes to network events to enable proactive network management. In this work, we propose SmartTags, a system that automatically learns which of the application state changes (Tags) are related to the network events, and uses this information for proactive network management. At a high level, smartTags is orthogonal to current NAI approaches. It provides a systemic way for application developers and network designers to automatically learn the relationship between application behavior and network events. Through very simple small scale real testbed based experiments, we demonstrate that smartTags can improve the training time of a distributed machine learning application by 27% while minimizing loss to zero. Similarly, it can improve the query completion time of MemCached by 32% while achieving near zero loss. We envision much more gains in large scale distributed systems.
- Neal Cardwell, Yuchung Cheng, C Stephen Gunn, Soheil Hassas Yeganeh, and Van Jacobson. 2017. BBR: congestion-based congestion control. Comm. ACM 2017 60, 2 (2017), 58--66.Google ScholarDigital Library
- Jingrong Chen, Hong Zhang, Wei Zhang, Liang Luo, Jeffrey Chase, Ion Stoica, and Danyang Zhuo. 2022. NetHint: White-Box Networking for Multi-Tenant Data Centers. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22). 1327--1343.Google Scholar
- Sally Floyd and Van Jacobson. 1993. Random early detection gateways for congestion avoidance. IEEE/ACM ToN 1993 1, 4 (1993), 397--413.Google ScholarDigital Library
- Dong Guo, Shuhe Wang, and Y Richard Yang. 2021. Socker: Network-application Co-programming with Socket Tracing. In ACM SIGCOMM NAI 2021. 14--19.Google Scholar
- Dong Guo, Shuhe Wang, and Y Richard Yang. 2022. NCE: An ECN Dual Mechanism to Mitigate Micro-bursts. In ACM SIGCOMM NAI 2022. 14--19.Google Scholar
- Jithin Jose, Hari Subramoni, Miao Luo, Minjia Zhang, Jian Huang, Md Wasi-ur Rahman, Nusrat S Islam, Xiangyong Ouyang, Hao Wang, Sayantan Sur, et al. 2011. Memcached design on high performance rdma capable interconnects. In ICPP 2011. IEEE, 743--752.Google ScholarDigital Library
- Minkoo Kang, Gyeongsik Yang, Yeonho Yoo, and Chuck Yoo. 2020. Proactive congestion avoidance for distributed deep learning. Sensors 2020 21, 1 (2020), 174.Google Scholar
- Mu Li, Li Zhou, Zichao Yang, Aaron Li, Fei Xia, David G Andersen, and Alexander Smola. 2013. Parameter server for distributed machine learning. In NIPS 2013, Vol. 6. 2.Google Scholar
- Ali Munir, Ghufran Baig, Syed M Irteza, Ihsan A Qazi, Alex X Liu, and Fahad R Dogar. 2014. Friends, not foes: synthesizing existing transport strategies for data center networks. In ACM SIGCOMM 2014. 491--502.Google ScholarDigital Library
- Philipp S Schmidt, Theresa Enghardt, Ramin Khalili, and Anja Feldmann. 2013. Socket intents: Leveraging application awareness for multi-access connectivity. In CoNext 2013. 295--300.Google ScholarDigital Library
- Mahmoud Mohamed Bahnsay et. al. Seyed Hossein Mortazavi, Ali Munir. 2022. EarlyBird: Automating Application Signalling for Network Application Integration. In ACM SIGCOMM NAI 2022.Google Scholar
- Benjamin H Sigelman, Luiz Andre Barroso, Mike Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, and Chandan Shanbhag. 2010. Dapper, a large-scale distributed systems tracing infrastructure. (2010).Google Scholar
- Vojislav Đukić, Sangeetha Abdu Jyothi, Bojan Karlaš, Muhsen Owaida, Ce Zhang, and Ankit Singla. 2019. Is advance knowledge of flow sizes a plausible assumption?. In USENIX NSDI 2019). 565--580.Google Scholar
- Ting Wang, Mudhakar Srivatsa, Dakshi Agrawal, and Ling Liu. 2009. Learning, indexing, and diagnosing network faults. In ACM SIGKDD 2009. 857--866.Google ScholarDigital Library
- Ting Wang, Mudhakar Srivatsa, Dakshi Agrawal, and Ling Liu. 2010. Spatio-temporal patterns in network events. In CoNEXT 2010. 1--12.Google ScholarDigital Library
Index Terms
- SmartTags: bridging applications and network for proactive performance management
Recommendations
EarlyBird: automating application signalling for network application integration in datacenters
NAI '22: Proceedings of the ACM SIGCOMM Workshop on Network-Application IntegrationMany recent studies in datacenter networking have proposed the idea of using information from applications for optimizing and resource planning. These Application-Aware Networks generally assume that applications can provide an accurate view about their ...
The influence of datacenter usage on symmetry in datacenter network design
We undertake the first formal analysis of the role of symmetry, interpreted broadly, in the design of server-centric datacenter networks. Although symmetry has been mentioned by other researchers, we explicitly relate it to various specific, structural, ...
Energy proportional datacenter networks
ISCA '10: Proceedings of the 37th annual international symposium on Computer architectureNumerous studies have shown that datacenter computers rarely operate at full utilization, leading to a number of proposals for creating servers that are energy proportional with respect to the computation that they are performing.
In this paper, we show ...
Comments