Understanding the Limits of Passive Realtime Datacenter Fault Detection and Localization | IEEE Journals & Magazine | IEEE Xplore

Understanding the Limits of Passive Realtime Datacenter Fault Detection and Localization


Abstract:

Datacenters are characterized by large scale, stringent reliability requirements, and significant application diversity. However, the realities of employing hardware with...Show More

Abstract:

Datacenters are characterized by large scale, stringent reliability requirements, and significant application diversity. However, the realities of employing hardware with non-zero failure rates mean that datacenters are subject to significant numbers of failures that can impact performance. Moreover, failures are not always obvious; network components can fail partially, dropping or delaying only subsets of packets. Thus, traditional fault detection techniques involving end-host or router-based statistics can fall short in their ability to identify these errors. We describe how to expedite the process of detecting and localizing partial datacenter faults using an end-host method generalizable to most datacenter applications. In particular, we correlate end-host transport-layer flow metrics with per-flow network paths and apply statistical analysis techniques to identify outliers and localize faulty links and/or switches. We evaluate our approach in a production Facebook front-end datacenter, focusing on its effectiveness across a range of traffic patterns.
Published in: IEEE/ACM Transactions on Networking ( Volume: 27, Issue: 5, October 2019)
Page(s): 2001 - 2014
Date of Publication: 13 September 2019

ISSN Information:

Funding Agency:


Contact IEEE to Subscribe

References

References is not available for this document.