Skip to main content

Tail Latency in Datacenter Networks

  • Conference paper
  • First Online:
Modelling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS 2020)

Part of the book series: Lecture Notes in Computer Science ((LNCCN,volume 12527))

Abstract

One of the major challenges in cloud service data centers is to satisfy service-level agreements without significant over-provisioning. Achieving predictable performance is critical for many interactive applications. While the focus, particularly in theoretical models, has been on reducing average latency, the skewed tail of the latency distribution is much harder to reduce despite over-provisioning. In this paper, we take two approaches to mitigate tail latency in data centers. The first approach is based on bridging selected edge and aggregate switches to reduce east-west traffic latency. The second approach is based on task scheduling dependent tasks via their dependency acyclic graph. A queuing network model has been developed which can be used to reduce the average latency. Numerical and simulation results have shown the techniques are effective in terms of reducing the average tail latency within a data center network.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Simpy. https://simpy.readthedocs.io/en/latest/contents.html#

  2. Alibaba.com: Alibaba production cluster data (2018). https://github.com/alibaba/clusterdata

  3. Alizadeh, M., et al.: Data center TCP (DCTCP). In: Kalyanaraman, S., Padmanabhan, V.N., Ramakrishnan, K.K., Shorey, R., Voelker, G.M. (eds.) SIGCOMM, pp. 63–74. ACM (2010)

    Google Scholar 

  4. Alizadeh, M., Kabbani, A., Edsall, T., Prabhakar, B., Vahdat, A., Yasuda, M.: Less is more: trading a little bandwidth for ultra-low latency in the data center. In: Gribble, S.D., Katabi, D. (eds.) Proceedings of the 9th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2012, San Jose, CA, USA, 25–27 April 2012, pp. 253–266. USENIX Association (2012)

    Google Scholar 

  5. Amdahl, G.: Validity of the single-processor approach to achieving large-scale computing requirements. Comput. Des. 6(12), 39–40 (1967)

    Google Scholar 

  6. Ananthanarayanan, G., Ghodsi, A., Shenker, S., Stoica, I.: Effective straggler mitigation: Attack of the clones. In: Feamster, N., Mogul, J.C. (eds.) Proceedings of the 10th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2013, Lombard, IL, USA, 2–5 April 2013, pp. 185–198. USENIX Association (2013)

    Google Scholar 

  7. Ardagna, D., et al.: Performance prediction of cloud-based big data applications. In: Proceedings of the 2018 ACM/SPEC International Conference on Performance Engineering, pp. 192–199. ACM (2018)

    Google Scholar 

  8. Bai, W.H., Xi, J.Q., Zhu, J.X., Huang, S.W.: Performance analysis of heterogeneous data centers in cloud computing using a complex queuing model. Math. Probl. Eng. 2015, 1–15 (2015)

    MathSciNet  MATH  Google Scholar 

  9. Berger, D.S., Berg, B., Zhu, T., Sen, S., Harchol-Balter, M.: Robinhood: tail latency aware caching - dynamic reallocation from cache-rich to cache-poor. In: Arpaci-Dusseau, A.C., Voelker, G. (eds.) 13th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2018, Carlsbad, CA, USA, 8–10 October 2018, pp. 195–212. USENIX Association (2018)

    Google Scholar 

  10. Boutin, E., et al.: Apollo: scalable and coordinated scheduling for cloud-scale computing. In: Flinn, J., Levy, H. (eds.) 11th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2014, Broomfield, CO, USA, 6–8 October 2014, pp. 285–300. USENIX Association (2014)

    Google Scholar 

  11. Chowdhury, M., Zaharia, M., Ma, J., Jordan, M.I., Stoica, I.: Managing data transfers in computer clusters with orchestra. In: Keshav, S., Liebeherr, J., Byers, J.W., Mogul, J.C. (eds.) Proceedings of the ACM SIGCOMM 2011 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications, Toronto, ON, Canada, 15–19 August 2011, pp. 98–109. ACM (2011)

    Google Scholar 

  12. Dean, J., Barroso, L.A.: The tail at scale. Commun. ACM 56(2), 74–80 (2013)

    Article  Google Scholar 

  13. Delimitrou, C., Sanchez, D., Kozyrakis, C.: Tarcil: reconciling scheduling speed and quality in large shared clusters. In: Ghandeharizadeh, S., Barahmand, S., Balazinska, M., Freedman, M.J. (eds.) SoCC, pp. 97–110. ACM (2015)

    Google Scholar 

  14. Delimitrou, C., Kozyrakis, C.: QoS-aware scheduling in heterogeneous datacenters with paragon. ACM Trans. Comput. Syst 31(4), 12:1–12:34 (2013)

    Article  Google Scholar 

  15. Delimitrou, C., Kozyrakis, C.: Quasar: resource-efficient and QoS-aware cluster management. In: Balasubramonian, R., Davis, A., Adve, S.V. (eds.) Architectural Support for Programming Languages and Operating Systems, ASPLOS 2014, Salt Lake City, UT, USA, 1–5 March 2014, pp. 127–144. ACM (2014)

    Google Scholar 

  16. Delimitrou, C., Kozyrakis, C.: Amdahl’s law for tail latency. Commun. ACM 61(8), 65–72 (2018)

    Article  Google Scholar 

  17. El Kafhali, S., Salah, K.: Stochastic modelling and analysis of cloud computing data center. In: 2017 20th Conference on Innovations in Clouds, Internet and Networks (ICIN), pp. 122–126. IEEE (2017)

    Google Scholar 

  18. Feitelson, D.G.: Workload Modeling for Computer Systems Performance Evaluation, 1st edn. Cambridge University Press, New York (2015)

    Google Scholar 

  19. Graham, C., Buest, R., Ackerman, D., Nag, S.: Forecast analysis: cloud managed services, worldwide, February 2020. https://www.gartner.com/en/documents/3981360

  20. Gupta, V., Burroughs, M., Harchol-Balter, M.: Analysis of scheduling policies under correlated job sizes. Perform. Eval. 67(11), 996–1013 (2010)

    Article  Google Scholar 

  21. Hill, M.D., Marty, M.R.: Amdahl’s law in the multicore era. IEEE Comput. 41(7), 33–38 (2008)

    Article  Google Scholar 

  22. Jackson, J.R.: Networks of waiting lines. Oper. Res. 5(4), 518–521 (1957)

    Article  MathSciNet  Google Scholar 

  23. Jafarnejad Ghomi, E., Rahmani, A.M., Qader, N.N.: Applying queue theory for modeling of cloud computing: a systematic review. Concurr. Comput. Pract. Exp. 31, e5186 (2019)

    Article  Google Scholar 

  24. Khazaei, H., Misic, J.V., Misic, V.B.: Performance analysis of cloud computing centers using m/g/m/m+r queuing systems. IEEE Trans. Parallel Distrib. Syst 23(5), 936–943 (2012)

    Article  Google Scholar 

  25. Li, J., Sharma, N.K., Ports, D.R.K., Gribble, S.D.: Tales of the tail: hardware, OS, and application-level sources of tail latency. In: Lazowska, E., Terry, D., Arpaci-Dusseau, R.H., Gehrke, J. (eds.) Proceedings of the ACM Symposium on Cloud Computing, Seattle, WA, USA, 3–5 November 2014. pp. 9:1–9:14. ACM (2014)

    Google Scholar 

  26. Ousterhout, K., Canel, C., Ratnasamy, S., Shenker, S.: Monotasks: architecting for performance clarity in data analytics frameworks. In: SOSP, pp. 184–200. ACM (2017)

    Google Scholar 

  27. Ousterhout, K., Wendell, P., Zaharia, M., Stoica, I.: Sparrow: distributed, low latency scheduling. In: Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, pp. 69–84. ACM (2013)

    Google Scholar 

  28. Poola, D., Ramamohanarao, K., Buyya, R.: Enhancing reliability of workflow execution using task replication and spot instances. ACM Trans. Auton. Adapt. Syst. 10(4) (2016)

    Google Scholar 

  29. Qi, H., Shiraz, M., Liu, J., Gani, A., Rahman, Z.A., Altameem, T.A.: Data center network architecture in cloud computing: review, taxonomy, and open research issues. J. Zhejiang Univ. Sci. C 15(9), 776–793 (2014)

    Article  Google Scholar 

  30. Rojas-Cessa, R., Kaymak, Y., Dong, Z.: Schemes for fast transmission of flows in data center networks. IEEE Commun. Surv. Tutor. 17(3), 1391–1422 (2015)

    Article  Google Scholar 

  31. Schwarzkopf, M., Bailis, P.: Research for practice: cluster scheduling for datacenters. Commun. ACM 61(5), 50–53 (2018)

    Article  Google Scholar 

  32. Schwarzkopf, M., Konwinski, A., Abd-El-Malek, M., Wilkes, J.: Omega: flexible, scalable schedulers for large compute clusters. In: Proceedings of the 8th ACM European Conference on Computer Systems, pp. 351–364. ACM (2013)

    Google Scholar 

  33. Suresh, P.L., Canini, M., Schmid, S., Feldmann, A.: C3: cutting tail latency in cloud data stores via adaptive replica selection. In: NSDI, pp. 513–527. USENIX Association (2015)

    Google Scholar 

  34. Vilaplana, J., Solsona, F., Teixidó, I., Mateo, J., Abella, F., Rius, J.: A queuing theory model for cloud computing. J. Supercomput. 69(1), 492–507 (2014)

    Article  Google Scholar 

  35. Wang, W., Harchol-Balter, M., Jiang, H., Scheller-Wolf, A., Srikant, R.: Delay asymptotics and bounds for multi-task parallel jobs. ACM SIGMETRICS Perform. Eval. Rev. 46(3), 2–7 (2019)

    Article  Google Scholar 

  36. Yang, B., Tan, F., Dai, Y.S.: Performance evaluation of cloud service considering fault recovery. J. Supercomput. 65(1), 426–444 (2013). https://doi.org/10.1007/s11227-011-0551-2

    Article  Google Scholar 

  37. Zats, D., Das, T., Mohan, P., Borthakur, D., Katz, R.H.: Detail: reducing the flow completion time tail in datacenter networks. In: Eggert, L., Ott, J., Padmanabhan, V.N., Varghese, G. (eds.) ACM SIGCOMM 2012 Conference, SIGCOMM 2012, Helsinki, Finland - 13–17 August 2012, pp. 139–150. ACM (2012)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Assad Althoubi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Althoubi, A., Alshahrani, R., Peyravi, H. (2021). Tail Latency in Datacenter Networks. In: Calzarossa, M.C., Gelenbe, E., Grochla, K., Lent, R., Czachórski, T. (eds) Modelling, Analysis, and Simulation of Computer and Telecommunication Systems. MASCOTS 2020. Lecture Notes in Computer Science(), vol 12527. Springer, Cham. https://doi.org/10.1007/978-3-030-68110-4_17

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-68110-4_17

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-68109-8

  • Online ISBN: 978-3-030-68110-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics