skip to main content
10.1145/3132747.3132759acmconferencesArticle/Chapter ViewAbstractPublication PagessospConference Proceedingsconference-collections
research-article

CrystalNet: Faithfully Emulating Large Production Networks

Published: 14 October 2017 Publication History

Abstract

Network reliability is critical for large clouds and online service providers like Microsoft. Our network is large, heterogeneous, complex and undergoes constant churns. In such an environment even small issues triggered by device failures, buggy device software, configuration errors, unproven management tools and unavoidable human errors can quickly cause large outages. A promising way to minimize such network outages is to proactively validate all network operations in a high-fidelity network emulator, before they are carried out in production. To this end, we present CrystalNet, a cloud-scale, high-fidelity network emulator. It runs real network device firmwares in a network of containers and virtual machines, loaded with production configurations. Network engineers can use the same management tools and methods to interact with the emulated network as they do with a production network. CrystalNet can handle heterogeneous device firmwares and can scale to emulate thousands of network devices in a matter of minutes. To reduce resource consumption, it carefully selects a boundary of emulations, while ensuring correctness of propagation of network changes. Microsoft's network engineers use CrystalNet on a daily basis to test planned network operations. Our experience shows that CrystalNet enables operators to detect many issues that could trigger significant outages.

Supplementary Material

MP4 File (crystalnet.mp4)

References

[1]
Cloudlab. https://www.cloudlab.us/.
[2]
Emulab. https://www.emulab.net/.
[3]
GNS3. https://www.gns3.com/.
[4]
Introducing Data Center Fabric, the Next-Generation Facebook Data Center Network. https://code.facebook.com/posts/360346274145943/introducing-data-center-fabric-the-next-generation-facebook\-data-center-network/.
[5]
Routing Design for Large Scale Datacenters: BGP is a better IGP! https://www.nanog.org/meetings/nanog55/presentations/Monday/Lapukhov.pdf.
[6]
Al-Fares, M., Loukissas, A., and Vahdat, A. A Scalable, Commodity Data Center Network Architecture. In ACM SIGCOMM Computer Communication Review (2008), vol. 38, ACM, pp. 63--74.
[7]
Barefoot. P4 Software Switch. https://github.com/p4lang/behavioral-model/.
[8]
Beckett, R., Gupta, A., Mahajan, R., and Walker, D. A general approach to network configuration verification. In Proceedings of the Conference of the ACM Special Interest Group on Data Communication (2017), ACM, pp. 155--168.
[9]
Beckett, R., Mahajan, R., Millstein, T., Padhye, J., and Walker, D. Don't Mind the Gap: Bridging Network-wide Objectives and Device-level Configurations. In SIGCOMM (2016), ACM, pp. 328--341.
[10]
Bosshart, P., Daly, D., Gibb, G., Izzard, M., McKeown, N., Rexford, J., Schlesinger, C., Talayco, D., Vahdat, A., Varghese, G., et al. P4: Programming protocol-independent packet processors. ACM SIGCOMM Computer Communication Review 44, 3 (2014), 87--95.
[11]
Fayaz, S. K., Sharma, T., Fogel, A., Mahajan, R., Millstein, T., Sekar, V., and Varghese, G. Efficient Network Reachability Analysis using a Succinct Control Plane Representation. In OSDI (2016), USENIX Association, pp. 217--232.
[12]
Feamster, N., and Balakrishnan, H. Verifying the Correctness of Wide-Area Internet Routing.
[13]
Fogel, A., Fung, S., Pedrosa, L., Walraed-Sullivan, M., Govindan, R., Mahajan, R., and Millstein, T. D. A General Approach to Network Configuration Analysis. In NSDI (2015), pp. 469--483.
[14]
Ford, B., Srisuresh, P., and Kegel, D. Peer-to-Peer Communication Across Network Address Translators. In ATC (2005), pp. 179--192.
[15]
Gember-Jacobson, A., Viswanathan, R., Akella, A., and Mahajan, R. Fast Control Plane Analysis using an Abstract Representation. In Proceedings of the 2016 conference on ACM SIGCOMM 2016 Conference (2016), ACM, pp. 300-- 313.
[16]
Google. Google Compute Engine Incident NO.16007. Connectivity issues in all regions. https://status.cloud.google.com/incident/compute/16007.
[17]
Griffin, T. G., Shepherd, F. B., and Wilfong, G. The Stable Paths Problem and Interdomain Routing. IEEE/ACM Transactions on Networking (ToN) 10, 2 (2002), 232--243.
[18]
Handigol, N., Heller, B., Jeyakumar, V., Lantz, B., and McKeown, N. Reproducible Network Experiments using Container-Based Emulation. In Proceedings of the 8th international conference on Emerging networking experiments and technologies (2012), ACM, pp. 253--264.
[19]
Horn, A., Kheradmand, A., and Prasad, M. R. Delta-net: Real-time Network Verification Using Atoms. arXiv preprint arXiv:1702.07375 (2017).
[20]
Kang, H., and Tao, S. Container-based emulation of network control plane. In Proceedings of the Workshop on Hot Topics in Container Networking and Networked Systems (2017), ACM, pp. 24--29.
[21]
Kazemian, P., Varghese, G., and McKeown, N. Header Space Analysis: Static Checking for Networks. In NSDI (2012), vol. 12, pp. 113--126.
[22]
Khurshid, A., Zhou, W., Caesar, M., and Godfrey, P. Veriflow: Verifying Network-Wide Invariants in Real Time. ACM SIGCOMM Computer Communication Review 42, 4 (2012), 467--472.
[23]
Lopes, N. P., Bjørner, N., Godefroid, P., Jayaraman, K., and Varghese, G. Checking Beliefs in Dynamic Networks. In NSDI (2015), pp. 499--512.
[24]
Moy, J. T. OSPF: Anatomy of an Internet Routing Protocol. Addison-Wesley Professional, 1998.
[25]
Ousterhout, A., Perry, J., Balakrishnan, H., and Lapukhov, P. Flexplane: An experimentation platform for resource management in datacenters. In NSDI (2017), pp. 438-- 451.
[26]
Plotkin, G. D., Bjørner, N., Lopes, N. P., Rybalchenko, A., and Varghese, G. Scaling Network Verification using Symmetry and Surgery. In POPL (2016).
[27]
Premji, A., Lapukhov, P., and Mitchell, J. RFC 7938: Use of BGP for Routing in Large-Scale Data Centers, 2016.
[28]
Sung, Y.-W. E., Tie, X., Wong, S. H., and Zeng, H. Robotron: Top-down Network Management at Facebook Scale. In Proceedings of the 2016 conference on ACM SIGCOMM 2016 Conference (2016), ACM, pp. 426--439.
[29]
Wette, P., Draxler, M., Schwabe, A., Wallaschek, F., Zahraee, M. H., and Karl, H. Maxinet: Distributed Emulation of Software-Defined Networks. In Networking Conference, 2014 IFIP (2014), IEEE, pp. 1--9.
[30]
Yuan, L., Chen, H., Mai, J., Chuah, C.-N., Su, Z., and Mohapatra, P. Fireman: A Toolkit for Firewall Modeling and Analysis. In Security and Privacy, 2006 IEEE Symposium on (2006), IEEE, pp. 15--pp.
[31]
Zhai, E., Chen, R., Wolinsky, D. I., and Ford, B. Heading Off Correlated Failures through Independence-as-a-Service. In OSDI (2014), pp. 317--334.
[32]
Zhu, Y., Kang, N., Cao, J., Greenberg, A., Lu, G., Mahajan, R., Maltz, D., Yuan, L., Zhang, M., Zhao, B. Y., et al. Packet-Level Telemetry in Large Datacenter Networks. In ACM SIGCOMM Computer Communication Review (2015), vol. 45, ACM, pp. 479--491.

Cited By

View all
  • (2025)Verifying Network-level Properties for Large-scale Networks with Header Transformations in RealtimeJournal of Information Processing10.2197/ipsjjip.33.4133(41-54)Online publication date: 2025
  • (2025)Kollaps: Decentralized and Efficient Network Emulation for Large-Scale SystemsIEEE Transactions on Networking10.1109/TNET.2024.347805033:1(35-50)Online publication date: Feb-2025
  • (2024)KiviProceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference10.5555/3691992.3692024(509-527)Online publication date: 10-Jul-2024
  • Show More Cited By

Index Terms

  1. CrystalNet: Faithfully Emulating Large Production Networks

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SOSP '17: Proceedings of the 26th Symposium on Operating Systems Principles
    October 2017
    677 pages
    ISBN:9781450350853
    DOI:10.1145/3132747
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 14 October 2017

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Emulation
    2. Network
    3. Reliability
    4. Verification

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Conference

    SOSP '17
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 174 of 961 submissions, 18%

    Upcoming Conference

    SOSP '25
    ACM SIGOPS 31st Symposium on Operating Systems Principles
    October 13 - 16, 2025
    Seoul , Republic of Korea

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)90
    • Downloads (Last 6 weeks)7
    Reflects downloads up to 07 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)Verifying Network-level Properties for Large-scale Networks with Header Transformations in RealtimeJournal of Information Processing10.2197/ipsjjip.33.4133(41-54)Online publication date: 2025
    • (2025)Kollaps: Decentralized and Efficient Network Emulation for Large-Scale SystemsIEEE Transactions on Networking10.1109/TNET.2024.347805033:1(35-50)Online publication date: Feb-2025
    • (2024)KiviProceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference10.5555/3691992.3692024(509-527)Online publication date: 10-Jul-2024
    • (2024)KlonetProceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation10.5555/3691825.3691936(2025-2045)Online publication date: 16-Apr-2024
    • (2024)Reasoning about network traffic load property at production scaleProceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation10.5555/3691825.3691884(1063-1081)Online publication date: 16-Apr-2024
    • (2024)CrescentProceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation10.5555/3691825.3691883(1045-1062)Online publication date: 16-Apr-2024
    • (2024)MESSIProceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation10.5555/3691825.3691881(1009-1023)Online publication date: 16-Apr-2024
    • (2024)NetcastleProceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation10.5555/3691825.3691880(993-1008)Online publication date: 16-Apr-2024
    • (2024)A Review of Intelligent Configuration and Its Security for Complex NetworksChinese Journal of Electronics10.23919/cje.2023.00.00133:4(920-947)Online publication date: Jul-2024
    • (2024)The Case for Validating Inputs in Software-Defined WANsProceedings of the 23rd ACM Workshop on Hot Topics in Networks10.1145/3696348.3696874(246-254)Online publication date: 18-Nov-2024
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media