skip to main content
10.1145/3341302.3342073acmconferencesArticle/Chapter ViewAbstractPublication PagescommConference Proceedingsconference-collections
research-article

Zooming in on wide-area latencies to a global cloud provider

Published: 19 August 2019 Publication History

Abstract

The network communications between the cloud and the client have become the weak link for global cloud services that aim to provide low latency services to their clients. In this paper, we first characterize WAN latency from the viewpoint of a large cloud provider Azure, whose network edges serve hundreds of billions of TCP connections a day across hundreds of locations worldwide. In particular, we focus on instances of latency degradation and design a tool, BlameIt, that enables cloud operators to localize the cause (i.e., faulty AS) of such degradation. BlameIt uses passive diagnosis, using measurements of existing connections between clients and the cloud locations, to localize the cause to one of cloud, middle, or client segments. Then it invokes selective active probing (within a probing budget) to localize the cause more precisely. We validate BlameIt by comparing its automatic fault localization results with that arrived at by network engineers manually, and observe that BlameIt correctly localized the problem in all the 88 incidents. Further, BlameIt issues 72X fewer active probes than a solution relying on active probing alone, and is deployed in production at Azure.

Supplementary Material

MP4 File (p104-jin.mp4)

References

[1]
Google Video Quality Report. https://support.google.com/youtube/answer/6013340?hl=en.
[2]
B. Ager, N. Chatzis, A. Feldmann, N. Sarrar, S. Uhlig, and W. Willinger. Anatomy of a large european ixp. ACM SIGCOMM Computer Communication Review, 42(4):163--174, 2012.
[3]
B. Arzani, S. Ciraci, L. Chamon, Y. Zhu, H. H. Liu, J. Padhye, B. T. Loo, and G. Outhred. 007: Democratically finding the cause of packet drops. In 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI 18), pages 419--435, Renton, WA, 2018. USENIX Association.
[4]
B. Arzani, S. Ciraci, B. T. Loo, A. Schuster, and G. Outhred. Taking the blame game out of data centers operations with netpoirot. In Proceedings of the 2016 ACM SIGCOMM Conference, pages 440--453. ACM, 2016.
[5]
B. Augustin, X. Cuvellier, B. Orgogozo, F. Viger, T. Friedman, M. Latapy, C. Magnien, and R. Teixeira. Avoiding traceroute anomalies with paris traceroute. In Proceedings of the 6th ACM SIGCOMM conference on Internet measurement, pages 153--158. ACM, 2006.
[6]
A. Broido and k. claffy. Analysis of RouteViews BGP data: policy atoms. In Network Resource Data Management Workshop, Santa Barbara, CA, May 2001.
[7]
M. Calder, X. Fan, Z. Hu, E. Katz-Bassett, J. Heidemann, and R. Govindan. Mapping the expansion of google's serving infrastructure. In Proceedings of the 2013 conference on Internet measurement conference, pages 313--326. ACM, 2013.
[8]
M. Calder, R. Gao, M. Schröder, R. Stewart, J. Padhye, R. Mahajan, G. Ananthanarayanan, and E. Katz-Bassett. Odin: Microsoft's scalable fault-tolerant CDN measurement system. In 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI 18), Renton, WA, 2018. USENIX Association.
[9]
R. Castro, M. Coates, G. Liang, R. Nowak, and B. Yu. Network tomography: Recent developments. Statistical science, pages 499--517, 2004.
[10]
F. Chen, R. K. Sitaraman, and M. Torres. End-user mapping: Next generation request routing for content delivery. In ACM SIGCOMM Computer Communication Review, volume 45, pages 167--181. ACM, 2015.
[11]
Í. Cunha, P. Marchetta, M. Calder, Y.-C. Chiu, B. Schlinker, B. V. Machado, A. Pescapè, V. Giotsas, H. V. Madhyastha, and E. Katz-Bassett. Sibyl: A practical internet route oracle. In NSDI, pages 325--344, 2016.
[12]
A. Dhamdhere, D. D. Clark, A. Gamero-Garrido, M. Luckie, R. K. Mok, G. Akiwate, K. Gogia, V. Bajpai, A. C. Snoeren, and K. Claffy. Inferring persistent interdomain congestion. In Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication, pages 1--15. ACM, 2018.
[13]
N. Duffield. Network tomography of binary network performance characteristics. IEEE Transactions on Information Theory, 52(12):5373--5388, 2006.
[14]
A. Flavel, P. Mani, D. A. Maltz, N. Holt, J. Liu, Y. Chen, and O. Surmachev. Fastroute: A scalable load-aware anycast routing architecture for modern cdns. connections, 27:19, 2015.
[15]
D. Ghita, K. Argyraki, and P. Thiran. Network tomography on correlated links. In Proceedings of the 10th ACM SIGCOMM conference on Internet measurement, pages 225--238. ACM, 2010.
[16]
D. Ghita, C. Karakus, K. Argyraki, and P. Thiran. Shifting network tomography toward a practical goal. In Proceedings of the Seventh COnference on Emerging Networking EXperiments and Technologies, CoNEXT '11, pages 24:1--24:12, New York, NY, USA, 2011. ACM.
[17]
V. Giotsas, C. Dietzel, G. Smaragdakis, A. Feldmann, A. Berger, and E. Aben. Detecting peering infrastructure outages in the wild. In Proceedings of the Conference of the ACM Special Interest Group on Data Communication, pages 446--459. ACM, 2017.
[18]
O. Haq, M. Raja, and F. R. Dogar. Measuring and improving the reliability of wide-area cloud paths. In Proceedings of the 26th International Conference on World Wide Web, pages 253--262. International World Wide Web Conferences Steering Committee, 2017.
[19]
Y. He, M. Faloutsos, S. Krishnamurthy, and B. Huffaker. On routing asymmetry in the internet. In GLOBECOM'05. IEEE Global Telecommunications Conference, 2005., volume 2, pages 6--pp. IEEE, 2005.
[20]
J. Jiang, R. Das, G. Ananthanarayanan, P. A. Chou, V. Padmanabhan, V. Sekar, E. Dominique, M. Goliszewski, D. Kukoleca, R. Vafin, et al. Via: Improving internet telephony call quality using predictive relay selection. In Proceedings of the 2016 conference on ACM SIGCOMM 2016 Conference, pages 286--299. ACM, 2016.
[21]
P. Kanuparthy and C. Dovrolis. Pythia: Diagnosing performance problems in wide area providers. In USENIX Annual Technical Conference, pages 371--382, 2014.
[22]
R. Krishnan, H. V. Madhyastha, S. Jain, S. Srinivasan, A. Krishnamurthy, T. Anderson, and J. Gao. Moving beyond end-to-end path information to optimize cdn performance. In Internet Measurement Conference (IMC), pages 190--201, Chicago, IL, 2009.
[23]
A. Lakhina, M. Crovella, and C. Diot. Diagnosing network-wide traffic anomalies. In ACM SIGCOMM Computer Communication Review, volume 34, pages 219--230. ACM, 2004.
[24]
F. Lau, S. H. Rubin, M. H. Smith, and L. Trajkovic. Distributed denial of service attacks. In Systems, Man, and Cybernetics, 2000 IEEE International Conference on, volume 3, pages 2275--2280. IEEE, 2000.
[25]
Y. Lee and N. Spring. Identifying and aggregating homogeneous ipv4 /24 blocks with hobbit. In Internet Measurement Conference (IMC), Santa Monica, CA, 2016.
[26]
H. V. Madhyastha, T. Isdal, M. Piatek, C. Dixon, T. Anderson, A. Krishnamurthy, and A. Venkataramani. iplane: An information plane for distributed services. In Proceedings of the 7th symposium on Operating systems design and implementation, pages 367--380. USENIX Association, 2006.
[27]
A. A. Mahimkar, Z. Ge, A. Shaikh, J. Wang, J. Yates, Y. Zhang, and Q. Zhao. Towards automated performance diagnosis in a large iptv network. In ACM SIGCOMM Computer Communication Review, volume 39, pages 231--242. ACM, 2009.
[28]
M. Mao, J. Rexford, J. Wang, and R. Katz. Towards an accurate as-level traceroute tool. In ACM SIGCOMM, 2003.
[29]
V. N. Padmanabhan, S. Ramabhadran, and J. Padhye. Netprofiler: Profiling wide-area networks using peer cooperation. In International Workshop on Peer-to-Peer Systems, pages 80--92. Springer, 2005.
[30]
L. Quan, J. Heidemann, and Y. Pradkin. Trinocular: Understanding internet reliability through adaptive probing. In ACM SIGCOMM Computer Communication Review, volume 43, pages 255--266. ACM, 2013.
[31]
A. Roy, H. Zeng, J. Bagga, and A. C. Snoeren. Passive realtime datacenter fault detection and localization. In NSDI, pages 595--612, 2017.
[32]
B. Schlinker, H. Kim, T. Cui, E. Katz-Bassett, H. V. Madhyastha, I. Cunha, J. Quinn, S. Hasan, P. Lapukhov, and H. Zeng. Engineering egress with edge fabric: Steering oceans of content to the world. In Proceedings of the Conference of the ACM Special Interest Group on Data Communication, pages 418--431. ACM, 2017.
[33]
A. Singla, B. Chandrasekaran, P. Godfrey, and B. Maggs. The internet at the speed of light. In Proceedings of the 13th ACM Workshop on Hot Topics in Networks, page 1. ACM, 2014.
[34]
N. Spring, R. Mahajan, and T. Anderson. The causes of path inflation. In Proceedings of the 2003 conference on Applications, technologies, architectures, and protocols for computer communications, pages 113--124. ACM, 2003.
[35]
R. Steenbergen. A practical guide to (correctly) a practical guide to (correctly) troubleshooting with traceroute. In NANOG, 2017.
[36]
V. Valancius, B. Ravi, N. Feamster, and A. C. Snoeren. Quantifying the benefits of joint content and network routing. In ACM SIGMETRICS Performance Evaluation Review, volume 41, pages 243--254. ACM, 2013.
[37]
K.-K. Yap, M. Motiwala, J. Rahe, S. Padgett, M. Holliman, G. Baldus, M. Hines, T. Kim, A. Narayanan, A. Jain, et al. Taking the edge off with espresso: Scale, reliability and programmability for global internet peering. In Proceedings of the Conference of the ACM Special Interest Group on Data Communication, pages 432--445. ACM, 2017.
[38]
M. Zhang, C. Zhang, V. S. Pai, L. L. Peterson, and R. Y. Wang. Planetseer: Internet path failure monitoring and characterization in wide-area services. In OSDI, volume 4, pages 12--12, 2004.
[39]
Z. Zhang, M. Zhang, A. G. Greenberg, Y. C. Hu, R. Mahajan, and B. Christian. Optimizing cost and performance in online service provider networks. In NSDI, pages 33--48, 2010.

Cited By

View all
  • (2024)PanoramaProceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference10.5555/3691992.3692049(935-949)Online publication date: 10-Jul-2024
  • (2024)L3: Latency-aware Load Balancing in Multi-Cluster Service MeshProceedings of the 25th International Middleware Conference10.1145/3652892.3654793(49-61)Online publication date: 2-Dec-2024
  • (2024)A Multifaceted Look at Starlink PerformanceProceedings of the ACM Web Conference 202410.1145/3589334.3645328(2723-2734)Online publication date: 13-May-2024
  • Show More Cited By

Recommendations

Reviews

Mariam Kiran

The authors measure wide area network (WAN) latency from the viewpoint of a large cloud provider, Azure, by tracking the round-trip time (RTT) of transmission control protocol (TCP) connections. Presenting their tool BlameIt, the authors aim to find the faults and diagnose where the WAN is having issues. Tracking where the problem is happening in a large WAN is a pressing challenge in networks today. It is difficult to find where and why problems are occurring, such as data not reaching its destination or packets being lost along the way, as the networks grow and become more complex. This paper presents a passive measurement tool to help localize certain problems in a WAN. The paper first does a measurement analysis on various aspects of the Azure network. It describes the datasets collected and how they are able to deduce (1) the common countries in which bad RTT is recorded, (2) how long these bad connections last, and (3) how it affects their clients. It then goes on to present BlameIt. The tool is able to passively record various RTT-relevant data to understand where the problems are happening: client-side, middle, or end-side. A number of issues are recognized, for example, middle-segment problems dominate in India, China, and Brazil. The authors also found that the US has more directly related high RTTs than the rest of the world. By taking measurements on autonomous systems (AS) and the border gateway protocol (BGP), where there is a latency degradation between client and cloud locations, the tool uses a combination of passive measurements (TCP handshake RTTs) and selective active measurements (traceroutes) to localize issues. The paper is easy to read, and it's exciting to see how Azure measures and determines where bad performance is happening on its network. In other networks, tools such as perfSONAR and measuring loss are used, and it would be interesting to see how Google Cloud Platform (GCP) and Amazon Web Services (AWS) measure their network performance. This paper is a good read for those working to improve network performance using machine learning.

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGCOMM '19: Proceedings of the ACM Special Interest Group on Data Communication
August 2019
526 pages
ISBN:9781450359566
DOI:10.1145/3341302
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 August 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. active network probes
  2. internet latency measurement
  3. network diagnosis
  4. networkfault localization
  5. tomography
  6. wide-area network

Qualifiers

  • Research-article

Conference

SIGCOMM '19
Sponsor:
SIGCOMM '19: ACM SIGCOMM 2019 Conference
August 19 - 23, 2019
Beijing, China

Acceptance Rates

Overall Acceptance Rate 462 of 3,389 submissions, 14%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)81
  • Downloads (Last 6 weeks)11
Reflects downloads up to 15 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)PanoramaProceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference10.5555/3691992.3692049(935-949)Online publication date: 10-Jul-2024
  • (2024)L3: Latency-aware Load Balancing in Multi-Cluster Service MeshProceedings of the 25th International Middleware Conference10.1145/3652892.3654793(49-61)Online publication date: 2-Dec-2024
  • (2024)A Multifaceted Look at Starlink PerformanceProceedings of the ACM Web Conference 202410.1145/3589334.3645328(2723-2734)Online publication date: 13-May-2024
  • (2024)MLPing: Real-Time Proactive Fault Detection and Alarm for Large-Scale Distributed IDC Network2024 IEEE 44th International Conference on Distributed Computing Systems (ICDCS)10.1109/ICDCS60910.2024.00089(913-924)Online publication date: 23-Jul-2024
  • (2024)Clearing Clouds from the Horizon: Latency Characterization of Public Cloud Service Platforms2024 33rd International Conference on Computer Communications and Networks (ICCCN)10.1109/ICCCN61486.2024.10637605(1-9)Online publication date: 29-Jul-2024
  • (2024)Recent Advancements of Public Edge Platforms5G Edge Computing10.1007/978-981-97-0213-8_2(17-43)Online publication date: 3-Jan-2024
  • (2023)Using Gaming Footage as a Source of Internet Latency InformationProceedings of the 2023 ACM on Internet Measurement Conference10.1145/3618257.3624816(606-626)Online publication date: 24-Oct-2023
  • (2023)Realizing Fine-Grained Inference of AS Path With a Generative Measurable ProcessIEEE/ACM Transactions on Networking10.1109/TNET.2023.327056531:6(3112-3127)Online publication date: Dec-2023
  • (2023)FlowPinpoint: Localizing Anomalies in Cloud-client Services for Cloud ProvidersIEEE Transactions on Cloud Computing10.1109/TCC.2023.3257162(1-15)Online publication date: 2023
  • (2023)WAN-INT: Cost-Effective In-Band Network Telemetry in WAN With A Performance-aware Path Planner2023 IEEE 29th International Conference on Parallel and Distributed Systems (ICPADS)10.1109/ICPADS60453.2023.00256(1861-1868)Online publication date: 17-Dec-2023
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media