skip to main content
10.1145/3517745.3561447acmconferencesArticle/Chapter ViewAbstractPublication PagesimcConference Proceedingsconference-collections
research-article

Cross-layer diagnosis of optical backbone failures

Published:25 October 2022Publication History

ABSTRACT

Optical backbone networks, the physical infrastructure interconnecting data centers, are the cornerstones of Wide-Area Network (WAN) connectivity and resilience. Yet, there is limited research on failure characteristics and diagnosis in large-scale operational optical networks. This paper fills the gap by presenting a comprehensive analysis and modeling of optical network failures from a production optical backbone consisting of hundreds of sites and thousands of optical devices. Subsequently, we present a diagnosis system for optical backbone failures, consisting of a multi-level dependency graph and a root-cause inference algorithm across the IP and optical layers. Further, we share our experiences of operating this system for six years and introduce three methods to make the outcome actionable in practice. With empirical evaluation, we demonstrate its high accuracy of 96% and a ticket reduction of 95% for our optical backbone.

Skip Supplemental Material Section

Supplemental Material

266.m4v

m4v

38.9 MB

References

  1. Netnorad: Troubleshooting networks via end-to-end probing. https://engineering.fb.com/core-data/netnorad-troubleshooting-networks-via-end-to-end-probing/.Google ScholarGoogle Scholar
  2. Network configuration protocol. https://tools.ietf.org/html/rfc6241.Google ScholarGoogle Scholar
  3. Snmp trap. https://www.cisco.com/c/en/us/support/docs/ip/simple-network-management-protocol-snmp/7244-snmp-trap.html.Google ScholarGoogle Scholar
  4. Splicebox. https://en.wikipedia.org/wiki/Splicebox.Google ScholarGoogle Scholar
  5. Transaction language 1. https://en.wikipedia.org/wiki/Transaction_Language_1.Google ScholarGoogle Scholar
  6. Squirrels are the number one culprit for animal damage to aerial fiber, 2011. https://www.theatlantic.com/technology/archive/2011/08/squirrels-do-17-of-the-damage-to-fiber-optic-network/243319/.Google ScholarGoogle Scholar
  7. Disaster survivability in optical communication networks. Computer Communications 36, 6 (2013), 630--644. Reliable Network-based Services.Google ScholarGoogle Scholar
  8. Cows were causing mysterious google outages, 2020. https://www.businessinsider.com/cows-were-causing-mysterious-google-outages-2020-5.Google ScholarGoogle Scholar
  9. Agarwal, B., Bhagwan, R., Das, T., Eswaran, S., Padmanabhan, V., and Voelker, G. Netprints: Diagnosing home network misconfigurations using shared knowledge. In NSDI (01 2009).Google ScholarGoogle Scholar
  10. Arzani, B., Ciraci, S., Chamon, L., Zhu, Y., Liu, H., Padhye, J., Loo, B. T., and Outhred, G. 007 democratically finding the cause of packet drops. In Proceedings of the 15th USENIX Conference on Networked Systems Design and Implementation (2018).Google ScholarGoogle Scholar
  11. Babarczi, P., Tapolcai, J., and Ho, P.-H. Adjacent link failure localization with monitoring trails in all-optical mesh networks. IEEE/ACM Transactions on Networking 19, 3 (2011), 907--920.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Babbitt, J., and Best, R. Maintaining availability in an optical backbone network. In Optical Fiber Communication Conference and Exposition and The National Fiber Optic Engineers Conference (2006), Optica Publishing Group, p. NThB1.Google ScholarGoogle ScholarCross RefCross Ref
  13. Bahl, P., Chandra, R., Greenberg, A., Kandula, S., Maltz, D. A., and Zhang, M. Towards highly reliable enterprise network services via inference of multi-level dependencies. In Proceedings of the 2007 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications (New York, NY, USA, 2007).Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Chen, X., Zhang, M., Mao, Z., and Bahl, P. Automating network application dependency discovery: Experiences, limitations, and new solutions. In OSDI (01 2008).Google ScholarGoogle Scholar
  15. Dikbiyik, F., Tornatore, M., and Mukherjee, B. Minimizing the risk from disaster failures in optical backbone networks. J. Lightwave Technol. 32, 18 (Sep 2014), 3175--3183.Google ScholarGoogle ScholarCross RefCross Ref
  16. Dou, S., Lindsey, N., Wagner, A. M., Daley, T. M., Freifeld, B., Robertson, M., Peterson, J., Ulrich, C., Martin, E. R., and AjoFranklin, J. B. Distributed acoustic sensing for seismic monitoring of the near surface: A traffic-noise interferometry case study. In Scientific Reports (2017).Google ScholarGoogle Scholar
  17. Ghobadi, M., and Mahajan, R. Optical layer failures in a large backbone. In Proceedings of the 2016 Internet Measurement Conference (2016).Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Guo, C., Yuan, L., Xiang, D., Dang, Y., Huang, R., Maltz, D., Liu, Z., Wang, V., Pang, B., Chen, H., et al. Pingmesh: A large-scale system for data center network latency measurement and analysis. In ACM SIGCOMM Computer Communication Review (2015), vol. 45, ACM, pp. 139--152.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Habib, M. F., Musumeci, F., Tornatore, M., and Mukherjee, B. Cascading-failure-resilient interconnection for interdependent power grid - optical network. Optical Switching and Networking 42 (2021), 100632.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Kandula, S., Mahajan, R., Verkaik, P., Agarwal, S., Padhye, J., and Bahl, P. Detailed diagnosis in enterprise networks. vol. 39, pp. 243--254.Google ScholarGoogle Scholar
  21. Kompella, R. R., Yates, J., Greenberg, A., and Snoeren, A. C. Ip fault localization via risk modeling. In Proceedings of the 2nd conference on Symposium on Networked Systems Design & Implementation-Volume 2 (2005), USENIX Association, pp. 57--70.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Kumar, D., Kumar, R., and Sharma, N. A risk reduction approach in optical backbone network. In 2019 5th International Conference on Signal Processing, Computing and Control (ISPCC) (2019), pp. 206--211.Google ScholarGoogle ScholarCross RefCross Ref
  23. Mahimkar, A., Yates, J., Zhang, Y., Shaikh, A., Wang, J., Ge, Z., and Ee, C. Troubleshooting chronic conditions in large ip networks. In CoNEXT (01 2008), p. 2.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Markopoulou, A., Iannaccone, G., Bhattacharyya, S., Chuah, C.-N., Ganjali, Y., and Diot, C. Characterization of failures in an operational ip backbone network. IEEE/ACM Trans. Netw. 16, 4 (2008).Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Miao, C., Minggang, C., Gupta, A., Meng, Z., Chen, J., Zekun, H., Luo, X., Wang, J., and Yu, H. Detecting ephemeral optical events with optel. 19th USENIX Symposium on Networked Systems Design and Implementation.Google ScholarGoogle Scholar
  26. Mogul, J. C., Goricanec, D., Pool, M., Shaikh, A., Turk, D., Koley, B., and Zhao, X. Experiences with modeling network topologies at multiple levels of abstraction. In 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 20) (Santa Clara, CA, Feb. 2020), USENIX Association, pp. 403--418.Google ScholarGoogle Scholar
  27. Mysore, R. N., Mahajan, R., Vahdat, A., and Varghese, G. Gestalt: Fast, unified fault localization for networked systems. In Proceedings of the 2014 USENIX Conference on USENIX Annual Technical Conference (Berkeley, CA, USA, 2014), USENIX ATC'14, USENIX Association.Google ScholarGoogle Scholar
  28. Owen, A., Duckworth, G., and Worsley, J. Optasense: Fibre optic distributed acoustic sensing for border monitoring. In 2012 European Intelligence and Security Informatics Conference (2012), pp. 362--364.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Roy, A., Zeng, H., Bagga, J., and Snoeren, A. C. Passive realtime datacenter fault detection and localization. In Proceedings of the 14th USENIX Conference on Networked Systems Design and Implementation (2017).Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Tammana, P., Agarwal, R., and Lee, M. Simplifying datacenter network debugging with pathdump. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation (2016).Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Tan, C., Jin, Z., Guo, C., Zhang, T., Wu, H., Deng, K., Bi, D., and Xiang, D. Netbouncer: Active device and link failure localization in data center networks. In Proceedings of the 16th USENIX Conference on Networked Systems Design and Implementation (USA, 2019), USENIX Association.Google ScholarGoogle Scholar
  32. Vadrevu, C. S., and Tornatore, M. Survivable ip topology design with re-use of backup wavelength capacity in optical backbone networks. Optical Switching and Networking 7, 4 (2010), 196--205. Selected Papers from the Third International Symposium on Advanced Networks and Telecommunication Systems (ANTS 2009).Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Wang, Z., Zhang, M., Wang, D., Song, C., Liu, M., Li, J., Lou, L., and Liu, Z. Failure prediction using machine learning and time series in optical network. Opt. Express 25, 16 (Aug 2017), 18553--18565.Google ScholarGoogle Scholar
  34. Wiatr, P., Chen, J., Monti, P., Wosinska, L., and Yuan, D. Routing and wavelength assignment vs. edfa reliability performance in optical backbone networks: An operational cost perspective. Optical Switching and Networking 31 (2019), 211--217.Google ScholarGoogle ScholarCross RefCross Ref
  35. wu, X., Turner, D., Chen, C.-C., Maltz, D., Yang, X., Yuan, L., and Zhang, M. Netpilot: Automating datacenter network failure mitigation. ACM SIGCOMM Computer Communication Review 42 (09 2012), 419--430.Google ScholarGoogle Scholar
  36. Wundsam, A., Levin, D., Seetharaman, S., and Feldmann, A. Ofrewind: Enabling record and replay troubleshooting for networks. In USENIX Annual technical conference (06 2011).Google ScholarGoogle Scholar
  37. Xia, Y., Zhang, Y., Zhong, Z., Yan, G., Lim, C. L., Ahuja, S. S., Bali, S., Nikolaidis, A., Ghobadi, K., and Ghobadi, M. A social network under social distancing: Risk-driven backbone management during covid-19 and beyond. In 18th USENIX Symposium on Networked Systems Design and Implementation (NSDI 21) (Apr. 2021).Google ScholarGoogle Scholar
  38. Yu, D., Zhu, Y., Arzani, B., Fonseca, R., Zhang, T., Deng, K., and Yuan, L. Dshark: A general, easy to program and scalable framework for analyzing in-network packet traces. In Proceedings of the 16th USENIX Conference on Networked Systems Design and Implementation (USA, 2019), USENIX Association.Google ScholarGoogle Scholar
  39. Zhou, Y., Sun, C., Liu, H. H., Miao, R., Bai, S., Li, B., Zheng, Z., Zhu, L., Shen, Z., Xi, Y., Zhang, P., Cai, D., Zhang, M., and Xu, M. Flow event telemetry on programmable data plane. In SIGCOMM (2020).Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Zhu, Y., Kang, N., Cao, J., Greenberg, A., Lu, G., Mahajan, R., Maltz, D., Yuan, L., Zhang, M., Zhao, B. Y., and et al. Packet-level telemetry in large datacenter networks. In SIGCOMM (2015).Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Zhuo, D., Ghobadi, M., Mahajan, R., Förster, K.-T., Krishnamurthy, A., and Anderson, T. Understanding and mitigating packet corruption in data center networks. In SIGCOMM (New York, NY, USA, 2017), Association for Computing Machinery.Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Conferences
    IMC '22: Proceedings of the 22nd ACM Internet Measurement Conference
    October 2022
    796 pages
    ISBN:9781450392594
    DOI:10.1145/3517745

    Copyright © 2022 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 25 October 2022

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • research-article

    Acceptance Rates

    Overall Acceptance Rate277of1,083submissions,26%
  • Article Metrics

    • Downloads (Last 12 months)74
    • Downloads (Last 6 weeks)5

    Other Metrics

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader