ABSTRACT
Optical backbone networks, the physical infrastructure interconnecting data centers, are the cornerstones of Wide-Area Network (WAN) connectivity and resilience. Yet, there is limited research on failure characteristics and diagnosis in large-scale operational optical networks. This paper fills the gap by presenting a comprehensive analysis and modeling of optical network failures from a production optical backbone consisting of hundreds of sites and thousands of optical devices. Subsequently, we present a diagnosis system for optical backbone failures, consisting of a multi-level dependency graph and a root-cause inference algorithm across the IP and optical layers. Further, we share our experiences of operating this system for six years and introduce three methods to make the outcome actionable in practice. With empirical evaluation, we demonstrate its high accuracy of 96% and a ticket reduction of 95% for our optical backbone.
Supplemental Material
- Netnorad: Troubleshooting networks via end-to-end probing. https://engineering.fb.com/core-data/netnorad-troubleshooting-networks-via-end-to-end-probing/.Google Scholar
- Network configuration protocol. https://tools.ietf.org/html/rfc6241.Google Scholar
- Snmp trap. https://www.cisco.com/c/en/us/support/docs/ip/simple-network-management-protocol-snmp/7244-snmp-trap.html.Google Scholar
- Splicebox. https://en.wikipedia.org/wiki/Splicebox.Google Scholar
- Transaction language 1. https://en.wikipedia.org/wiki/Transaction_Language_1.Google Scholar
- Squirrels are the number one culprit for animal damage to aerial fiber, 2011. https://www.theatlantic.com/technology/archive/2011/08/squirrels-do-17-of-the-damage-to-fiber-optic-network/243319/.Google Scholar
- Disaster survivability in optical communication networks. Computer Communications 36, 6 (2013), 630--644. Reliable Network-based Services.Google Scholar
- Cows were causing mysterious google outages, 2020. https://www.businessinsider.com/cows-were-causing-mysterious-google-outages-2020-5.Google Scholar
- Agarwal, B., Bhagwan, R., Das, T., Eswaran, S., Padmanabhan, V., and Voelker, G. Netprints: Diagnosing home network misconfigurations using shared knowledge. In NSDI (01 2009).Google Scholar
- Arzani, B., Ciraci, S., Chamon, L., Zhu, Y., Liu, H., Padhye, J., Loo, B. T., and Outhred, G. 007 democratically finding the cause of packet drops. In Proceedings of the 15th USENIX Conference on Networked Systems Design and Implementation (2018).Google Scholar
- Babarczi, P., Tapolcai, J., and Ho, P.-H. Adjacent link failure localization with monitoring trails in all-optical mesh networks. IEEE/ACM Transactions on Networking 19, 3 (2011), 907--920.Google ScholarDigital Library
- Babbitt, J., and Best, R. Maintaining availability in an optical backbone network. In Optical Fiber Communication Conference and Exposition and The National Fiber Optic Engineers Conference (2006), Optica Publishing Group, p. NThB1.Google ScholarCross Ref
- Bahl, P., Chandra, R., Greenberg, A., Kandula, S., Maltz, D. A., and Zhang, M. Towards highly reliable enterprise network services via inference of multi-level dependencies. In Proceedings of the 2007 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications (New York, NY, USA, 2007).Google ScholarDigital Library
- Chen, X., Zhang, M., Mao, Z., and Bahl, P. Automating network application dependency discovery: Experiences, limitations, and new solutions. In OSDI (01 2008).Google Scholar
- Dikbiyik, F., Tornatore, M., and Mukherjee, B. Minimizing the risk from disaster failures in optical backbone networks. J. Lightwave Technol. 32, 18 (Sep 2014), 3175--3183.Google ScholarCross Ref
- Dou, S., Lindsey, N., Wagner, A. M., Daley, T. M., Freifeld, B., Robertson, M., Peterson, J., Ulrich, C., Martin, E. R., and AjoFranklin, J. B. Distributed acoustic sensing for seismic monitoring of the near surface: A traffic-noise interferometry case study. In Scientific Reports (2017).Google Scholar
- Ghobadi, M., and Mahajan, R. Optical layer failures in a large backbone. In Proceedings of the 2016 Internet Measurement Conference (2016).Google ScholarDigital Library
- Guo, C., Yuan, L., Xiang, D., Dang, Y., Huang, R., Maltz, D., Liu, Z., Wang, V., Pang, B., Chen, H., et al. Pingmesh: A large-scale system for data center network latency measurement and analysis. In ACM SIGCOMM Computer Communication Review (2015), vol. 45, ACM, pp. 139--152.Google ScholarDigital Library
- Habib, M. F., Musumeci, F., Tornatore, M., and Mukherjee, B. Cascading-failure-resilient interconnection for interdependent power grid - optical network. Optical Switching and Networking 42 (2021), 100632.Google ScholarDigital Library
- Kandula, S., Mahajan, R., Verkaik, P., Agarwal, S., Padhye, J., and Bahl, P. Detailed diagnosis in enterprise networks. vol. 39, pp. 243--254.Google Scholar
- Kompella, R. R., Yates, J., Greenberg, A., and Snoeren, A. C. Ip fault localization via risk modeling. In Proceedings of the 2nd conference on Symposium on Networked Systems Design & Implementation-Volume 2 (2005), USENIX Association, pp. 57--70.Google ScholarDigital Library
- Kumar, D., Kumar, R., and Sharma, N. A risk reduction approach in optical backbone network. In 2019 5th International Conference on Signal Processing, Computing and Control (ISPCC) (2019), pp. 206--211.Google ScholarCross Ref
- Mahimkar, A., Yates, J., Zhang, Y., Shaikh, A., Wang, J., Ge, Z., and Ee, C. Troubleshooting chronic conditions in large ip networks. In CoNEXT (01 2008), p. 2.Google ScholarDigital Library
- Markopoulou, A., Iannaccone, G., Bhattacharyya, S., Chuah, C.-N., Ganjali, Y., and Diot, C. Characterization of failures in an operational ip backbone network. IEEE/ACM Trans. Netw. 16, 4 (2008).Google ScholarDigital Library
- Miao, C., Minggang, C., Gupta, A., Meng, Z., Chen, J., Zekun, H., Luo, X., Wang, J., and Yu, H. Detecting ephemeral optical events with optel. 19th USENIX Symposium on Networked Systems Design and Implementation.Google Scholar
- Mogul, J. C., Goricanec, D., Pool, M., Shaikh, A., Turk, D., Koley, B., and Zhao, X. Experiences with modeling network topologies at multiple levels of abstraction. In 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 20) (Santa Clara, CA, Feb. 2020), USENIX Association, pp. 403--418.Google Scholar
- Mysore, R. N., Mahajan, R., Vahdat, A., and Varghese, G. Gestalt: Fast, unified fault localization for networked systems. In Proceedings of the 2014 USENIX Conference on USENIX Annual Technical Conference (Berkeley, CA, USA, 2014), USENIX ATC'14, USENIX Association.Google Scholar
- Owen, A., Duckworth, G., and Worsley, J. Optasense: Fibre optic distributed acoustic sensing for border monitoring. In 2012 European Intelligence and Security Informatics Conference (2012), pp. 362--364.Google ScholarDigital Library
- Roy, A., Zeng, H., Bagga, J., and Snoeren, A. C. Passive realtime datacenter fault detection and localization. In Proceedings of the 14th USENIX Conference on Networked Systems Design and Implementation (2017).Google ScholarDigital Library
- Tammana, P., Agarwal, R., and Lee, M. Simplifying datacenter network debugging with pathdump. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation (2016).Google ScholarDigital Library
- Tan, C., Jin, Z., Guo, C., Zhang, T., Wu, H., Deng, K., Bi, D., and Xiang, D. Netbouncer: Active device and link failure localization in data center networks. In Proceedings of the 16th USENIX Conference on Networked Systems Design and Implementation (USA, 2019), USENIX Association.Google Scholar
- Vadrevu, C. S., and Tornatore, M. Survivable ip topology design with re-use of backup wavelength capacity in optical backbone networks. Optical Switching and Networking 7, 4 (2010), 196--205. Selected Papers from the Third International Symposium on Advanced Networks and Telecommunication Systems (ANTS 2009).Google ScholarDigital Library
- Wang, Z., Zhang, M., Wang, D., Song, C., Liu, M., Li, J., Lou, L., and Liu, Z. Failure prediction using machine learning and time series in optical network. Opt. Express 25, 16 (Aug 2017), 18553--18565.Google Scholar
- Wiatr, P., Chen, J., Monti, P., Wosinska, L., and Yuan, D. Routing and wavelength assignment vs. edfa reliability performance in optical backbone networks: An operational cost perspective. Optical Switching and Networking 31 (2019), 211--217.Google ScholarCross Ref
- wu, X., Turner, D., Chen, C.-C., Maltz, D., Yang, X., Yuan, L., and Zhang, M. Netpilot: Automating datacenter network failure mitigation. ACM SIGCOMM Computer Communication Review 42 (09 2012), 419--430.Google Scholar
- Wundsam, A., Levin, D., Seetharaman, S., and Feldmann, A. Ofrewind: Enabling record and replay troubleshooting for networks. In USENIX Annual technical conference (06 2011).Google Scholar
- Xia, Y., Zhang, Y., Zhong, Z., Yan, G., Lim, C. L., Ahuja, S. S., Bali, S., Nikolaidis, A., Ghobadi, K., and Ghobadi, M. A social network under social distancing: Risk-driven backbone management during covid-19 and beyond. In 18th USENIX Symposium on Networked Systems Design and Implementation (NSDI 21) (Apr. 2021).Google Scholar
- Yu, D., Zhu, Y., Arzani, B., Fonseca, R., Zhang, T., Deng, K., and Yuan, L. Dshark: A general, easy to program and scalable framework for analyzing in-network packet traces. In Proceedings of the 16th USENIX Conference on Networked Systems Design and Implementation (USA, 2019), USENIX Association.Google Scholar
- Zhou, Y., Sun, C., Liu, H. H., Miao, R., Bai, S., Li, B., Zheng, Z., Zhu, L., Shen, Z., Xi, Y., Zhang, P., Cai, D., Zhang, M., and Xu, M. Flow event telemetry on programmable data plane. In SIGCOMM (2020).Google ScholarDigital Library
- Zhu, Y., Kang, N., Cao, J., Greenberg, A., Lu, G., Mahajan, R., Maltz, D., Yuan, L., Zhang, M., Zhao, B. Y., and et al. Packet-level telemetry in large datacenter networks. In SIGCOMM (2015).Google ScholarDigital Library
- Zhuo, D., Ghobadi, M., Mahajan, R., Förster, K.-T., Krishnamurthy, A., and Anderson, T. Understanding and mitigating packet corruption in data center networks. In SIGCOMM (New York, NY, USA, 2017), Association for Computing Machinery.Google ScholarDigital Library
Recommendations
Optical Layer Failures in a Large Backbone
IMC '16: Proceedings of the 2016 Internet Measurement ConferenceWe analyze optical layer outages in a large backbone, using data for over a year from thousands of optical channels carrying live IP layer traffic. Our analysis uncovers several findings that can help improve network management and routing. For instance,...
An agile optical layer restoration method for router failures
The optical layer can provide its IP clients with rapid and efficient restoration for link failures; however, its inability to protect against router failures erodes its attractiveness. Here, we propose a joint IP/optical restoration mechanism suitable ...
Restoration mechanisms for handling channel and link failures in optical WDM networks: tunable laser-based switch architectures and performance analysis
In this paper, we study restoration mechanisms to handle channel and link failures in an optical wavelength division multiplexed (WDM) wavelength-routed wide-area backbone network based on a mesh topology. The solution uses a small number of tunable ...
Comments