skip to main content
10.1145/3538401.3546601acmconferencesArticle/Chapter ViewAbstractPublication PagesnaiConference Proceedingsconference-collections
research-article

SmartTags: bridging applications and network for proactive performance management

Authors Info & Claims
Published:25 August 2022Publication History

ABSTRACT

Sudden changes in the applications and events in the network are often related. Many of the datacenter applications go through sudden state changes (such as a query-response in MemCached application) that may result in an event in the network (such as utilization, packet drops, etc.). Existing works do not fully leverage the relationship between application state changes and network events and as a result provide limited performance improvements. An ideal application and network management system should be able to automatically identify the sources of sudden changes in the application, host or network and relate these changes to network events to enable proactive network management. In this work, we propose SmartTags, a system that automatically learns which of the application state changes (Tags) are related to the network events, and uses this information for proactive network management. At a high level, smartTags is orthogonal to current NAI approaches. It provides a systemic way for application developers and network designers to automatically learn the relationship between application behavior and network events. Through very simple small scale real testbed based experiments, we demonstrate that smartTags can improve the training time of a distributed machine learning application by 27% while minimizing loss to zero. Similarly, it can improve the query completion time of MemCached by 32% while achieving near zero loss. We envision much more gains in large scale distributed systems.

References

  1. Neal Cardwell, Yuchung Cheng, C Stephen Gunn, Soheil Hassas Yeganeh, and Van Jacobson. 2017. BBR: congestion-based congestion control. Comm. ACM 2017 60, 2 (2017), 58--66.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Jingrong Chen, Hong Zhang, Wei Zhang, Liang Luo, Jeffrey Chase, Ion Stoica, and Danyang Zhuo. 2022. NetHint: White-Box Networking for Multi-Tenant Data Centers. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22). 1327--1343.Google ScholarGoogle Scholar
  3. Sally Floyd and Van Jacobson. 1993. Random early detection gateways for congestion avoidance. IEEE/ACM ToN 1993 1, 4 (1993), 397--413.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Dong Guo, Shuhe Wang, and Y Richard Yang. 2021. Socker: Network-application Co-programming with Socket Tracing. In ACM SIGCOMM NAI 2021. 14--19.Google ScholarGoogle Scholar
  5. Dong Guo, Shuhe Wang, and Y Richard Yang. 2022. NCE: An ECN Dual Mechanism to Mitigate Micro-bursts. In ACM SIGCOMM NAI 2022. 14--19.Google ScholarGoogle Scholar
  6. Jithin Jose, Hari Subramoni, Miao Luo, Minjia Zhang, Jian Huang, Md Wasi-ur Rahman, Nusrat S Islam, Xiangyong Ouyang, Hao Wang, Sayantan Sur, et al. 2011. Memcached design on high performance rdma capable interconnects. In ICPP 2011. IEEE, 743--752.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Minkoo Kang, Gyeongsik Yang, Yeonho Yoo, and Chuck Yoo. 2020. Proactive congestion avoidance for distributed deep learning. Sensors 2020 21, 1 (2020), 174.Google ScholarGoogle Scholar
  8. Mu Li, Li Zhou, Zichao Yang, Aaron Li, Fei Xia, David G Andersen, and Alexander Smola. 2013. Parameter server for distributed machine learning. In NIPS 2013, Vol. 6. 2.Google ScholarGoogle Scholar
  9. Ali Munir, Ghufran Baig, Syed M Irteza, Ihsan A Qazi, Alex X Liu, and Fahad R Dogar. 2014. Friends, not foes: synthesizing existing transport strategies for data center networks. In ACM SIGCOMM 2014. 491--502.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Philipp S Schmidt, Theresa Enghardt, Ramin Khalili, and Anja Feldmann. 2013. Socket intents: Leveraging application awareness for multi-access connectivity. In CoNext 2013. 295--300.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Mahmoud Mohamed Bahnsay et. al. Seyed Hossein Mortazavi, Ali Munir. 2022. EarlyBird: Automating Application Signalling for Network Application Integration. In ACM SIGCOMM NAI 2022.Google ScholarGoogle Scholar
  12. Benjamin H Sigelman, Luiz Andre Barroso, Mike Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, and Chandan Shanbhag. 2010. Dapper, a large-scale distributed systems tracing infrastructure. (2010).Google ScholarGoogle Scholar
  13. Vojislav Đukić, Sangeetha Abdu Jyothi, Bojan Karlaš, Muhsen Owaida, Ce Zhang, and Ankit Singla. 2019. Is advance knowledge of flow sizes a plausible assumption?. In USENIX NSDI 2019). 565--580.Google ScholarGoogle Scholar
  14. Ting Wang, Mudhakar Srivatsa, Dakshi Agrawal, and Ling Liu. 2009. Learning, indexing, and diagnosing network faults. In ACM SIGKDD 2009. 857--866.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Ting Wang, Mudhakar Srivatsa, Dakshi Agrawal, and Ling Liu. 2010. Spatio-temporal patterns in network events. In CoNEXT 2010. 1--12.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. SmartTags: bridging applications and network for proactive performance management

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      NAI '22: Proceedings of the ACM SIGCOMM Workshop on Network-Application Integration
      August 2022
      70 pages
      ISBN:9781450393959
      DOI:10.1145/3538401

      Copyright © 2022 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 25 August 2022

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate12of24submissions,50%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader