research-article

Pingmesh: A Large-Scale System for Data Center Network Latency Measurement and Analysis

Authors:
Chuanxiong Guo

Microsoft, Redmond, WA, USA

Microsoft, Redmond, WA, USA
View Profile

,
Lihua Yuan

Microsoft, Redmond, WA, USA

Microsoft, Redmond, WA, USA
View Profile

,
Dong Xiang

Microsoft, Redmond, WA, USA

Microsoft, Redmond, WA, USA
View Profile

,
Yingnong Dang

Microsoft, Redmond, WA, USA

Microsoft, Redmond, WA, USA
View Profile

,
Ray Huang

Microsoft, Beijing, China

Microsoft, Beijing, China
View Profile

,
Dave Maltz

Microsoft, Redmond, WA, USA

Microsoft, Redmond, WA, USA
View Profile

,
Zhaoyi Liu

Microsoft, Beijing, China

Microsoft, Beijing, China
View Profile

,
Vin Wang

Microsoft, Redmond, WA, USA

Microsoft, Redmond, WA, USA
View Profile

,
Bin Pang

Microsoft, Redmond, WA, USA

Microsoft, Redmond, WA, USA
View Profile

,
Hua Chen

Microsoft, Beijing, China

Microsoft, Beijing, China
View Profile

,
Zhi-Wei Lin

Microsoft, Redmond, USA

Microsoft, Redmond, USA
View Profile

,
Varugis Kurien

Midfin Systems, Redmond, WA, USA

Midfin Systems, Redmond, WA, USA
View Profile

ACM SIGCOMM Computer Communication Review Volume 45 Issue 4October 2015pp 139–152https://doi.org/10.1145/2829988.2787496

Published:17 August 2015Publication History

ACM SIGCOMM Computer Communication Review

Abstract

Can we get network latency between any two servers at any time in large-scale data center networks? The collected latency data can then be used to address a series of challenges: telling if an application perceived latency issue is caused by the network or not, defining and tracking network service level agreement (SLA), and automatic network troubleshooting. We have developed the Pingmesh system for large-scale data center network latency measurement and analysis to answer the above question affirmatively. Pingmesh has been running in Microsoft data centers for more than four years, and it collects tens of terabytes of latency data per day. Pingmesh is widely used by not only network software developers and engineers, but also application and service developers and operators.

Supplemental Material

p139-guo.webm

webm

161.8 MB

Download

References

M. Al-Fares, A. Loukissas, and A. Vahdat. A Scalable, Commodity Data Center Network Architecture. In Proc. SIGCOMM, 2008. Google ScholarDigital Library
Alexey Andreyev. Introducing data center fabric, the next-generation Facebook data center network. https://code.facebook.com/posts/360346274145943/, Nov 2014.Google Scholar
Hadoop. http://hadoop.apache.org/.Google Scholar
Peter Bailis and Kyle Kingsbury. The Network is Reliable: An Informal Survey of Real-World Communications Failures. ACM Queue, 2014. Google ScholarDigital Library
Luiz Barroso, Jeffrey Dean, and Urs H$\ddoto$lzle. Web Search for a Planet: The Google Cluster Architecture. IEEE Micro, March-April 2003. Google ScholarDigital Library
Theophilus Benson, Aditya Akella, and David A. Maltz. Network Traffic Characteristics of Data Centers in the Wild. In Internet Measurement Conference, November 2010. Google ScholarDigital Library
et.al Brad Calder. Windows Azure Storage: A Highly Available Cloud Storage Service with Strong Consistency. In SOSP, 2011. Google ScholarDigital Library
Cisco. IP SLAs Configuration Guide, Cisco IOS Release 12.4T. http://www.cisco.com/c/en/us/td/docs/ios-xml/ios/ipsla/configuration/12--4t/sla-12--4t-book.pdf.Google Scholar
Citrix. What is Load Balancing? http://www.citrix.com/glossary/load-balancing.html.Google Scholar
Jeffrey Dean and Luiz Andr$\acutee$ Barroso. The Tail at Scale. CACM, Februry 2013. Google ScholarDigital Library
Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In OSDI, 2004. Google ScholarDigital Library
Albert Greenberg et al. VL2: A Scalable and Flexible Data Center Network. In SIGCOMM, August 2009. Google ScholarDigital Library
Chi-Yao Hong et al. Achieving High Utilization with Software-Driven WAN. In SIGCOMM, 2013. Google ScholarDigital Library
Parveen Patel et al. Ananta: Cloud Scale Load Balancing. In ACM SIGCOMMM. ACM, 2013. Google ScholarDigital Library
R. Chaiken et al. SCOPE: Easy and Efficient Parallel Processing of Massive Data Sets. In VLDB'08, 2008. Google ScholarDigital Library
Sushant Jain et al. B4: Experience with a Globally-Deployed Software Defined WAN. In SIGCOMM, 2013. Google ScholarDigital Library
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. The Google File System. In ACM SOSP. ACM, 2003. Google ScholarDigital Library
Nicolas Guilbaud and Ross Cartlidge. Google Backbone Monitoring, Localizing Packet Loss in a Large Complex Network, Feburary 2013. Nanog57.Google Scholar
Nikhil Handigol, Brandon Heller, Vimalkumar Jeyakumar, David Mazi$\gravee$res, and Nick McKeown. I Know What Your Packet Did Last Hop: Using Packet Histories to Troubleshoot Networks. In NSDI, 2014. Google ScholarDigital Library
Michael Isard. Autopilot: Automatic Data Center Management. ACM SIGOPS Operating Systems Review, 2007. Google ScholarDigital Library
Srikanth Kandula, Sudipta Sengupta, Albert Greenberg, Parveen Patel, and Ronnie Chaiken. The nature of data center traffic: Measurements & analysis. In Proceedings of the 9th ACM SIGCOMM Conference on Internet Measurement Conference, IMC '09, 2009. Google ScholarDigital Library
Rishi Kapoor, Alex C. Snoeren, Geoffrey M. Voelker, and George Porter. Bullet Trains: A Study of NIC Burst Behavior at Microsecond Timescales. In ACM CoNEXT, 2013. Google ScholarDigital Library
Cade Metz. Return of the Borg: How Twitter Rebuilt Google's Secret Weapon. http://www.wired.com/2013/03/google-borg-twitter-mesos/all/, March 2013.Google Scholar
Wenfei Wu, Guohui Wang, Aditya Akella, and Anees Shaikh. Virtual Network Diagnosis as a Service. In SoCC, 2013. Google ScholarDigital Library
Hongyi Zeng, Peyman Kazemian, George Varghese, and Nick McKeown. Automatic Test Packet Generation. In CoNEXT, 2012. Google ScholarDigital Library

Index Terms

Pingmesh: A Large-Scale System for Data Center Network Latency Measurement and Analysis
1. Computer systems organization
  1. Architectures
    1. Distributed architectures
      1. Cloud computing
2. Networks
  1. Network performance evaluation
    1. Network measurement
  2. Network services
    1. Cloud computing
    2. Network monitoring

Recommendations

Pingmesh: A Large-Scale System for Data Center Network Latency Measurement and Analysis
SIGCOMM '15: Proceedings of the 2015 ACM Conference on Special Interest Group on Data Communication

Can we get network latency between any two servers at any time in large-scale data center networks? The collected latency data can then be used to address a series of challenges: telling if an application perceived latency issue is caused by the network ...
Read More
sRetor: a semi-centralized regular topology routing scheme for data center networking
Abstract
The performance of the data center network is critical for lowering costs and increasing efficiency. The software-defined networks (SDN) technique has been adopted in data center networks due to the recent emergence of advanced network control and ...
Read More
EPOXIDE: A Modular Prototype for SDN Troubleshooting
SIGCOMM'15

SDN opens a new chapter in network troubleshooting as besides misconfigurations and firmware/hardware errors, software bugs can occur all over the SDN stack. As an answer to this challenge the networking community developed a wealth of piecemeal SDN ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM SIGCOMM Computer Communication Review Volume 45, Issue 4
SIGCOMM'15
October 2015
659 pages
ISSN:0146-4833
DOI:10.1145/2829988
Editors:
Konstantina Papagiannaki
Telefonica Research, Barcelona, Spain
,
Katerina Argyraki
EPFL, Switzerland
,
Hitesh Ballani
Microsoft Research Cambridge, UK
,
Fabián Bustamante
Northwestern University, USA
,
Joseph Camp
SMU, USA
,
Augustin Chaintreau
Columbia University, USA
,
Phillipa Gill
Stony Brook University, USA
,
Marco Mellia
Politecnico di Torino, Italy
,
Bhaskaran Raman
IIT Bombay, India
,
Joel Sommers
Colgate University, USA
,
Aline Carneiro Viana
INRIA, France
Issue’s Table of Contents
SIGCOMM '15: Proceedings of the 2015 ACM Conference on Special Interest Group on Data Communication
August 2015
684 pages
ISBN:9781450335423
DOI:10.1145/2785956
General Chairs:
Steve Uhlig
Queen Mary University of London, UK
,
Olaf Maennel
Tallinn U. of Technology in Estonia, Estonia
,
Program Chairs:
Brad Karp
University College London, UK
,
Jitendra Padhye
Microsoft, USA
Copyright © 2015 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 17 August 2015
Check for updates
Author Tags
data center networking
network troubleshooting
silent packet drops
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 317
  Total Citations
  View Citations
- 3,232
  Total Downloads
- Downloads (Last 12 months)727
- Downloads (Last 6 weeks)107
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Pingmesh: A Large-Scale System for Data Center Network Latency Measurement and Analysis

ACM SIGCOMM Computer Communication Review

Abstract

Supplemental Material

References

Cited By

Index Terms

Recommendations

Pingmesh: A Large-Scale System for Data Center Network Latency Measurement and Analysis

sRetor: a semi-centralized regular topology routing scheme for data center networking

EPOXIDE: A Modular Prototype for SDN Troubleshooting