Tail Latency in Datacenter Networks

Althoubi, Assad; Alshahrani, Reem; Peyravi, Hssan

doi:10.1007/978-3-030-68110-4_17

Assad Althoubi¹³,
Reem Alshahrani¹⁴ &
Hssan Peyravi¹³

Part of the book series: Lecture Notes in Computer Science ((LNCCN,volume 12527))

Included in the following conference series:

Symposium on Modelling, Analysis, and Simulation of Computer and Telecommunication Systems

610 Accesses
1 Citations

Abstract

One of the major challenges in cloud service data centers is to satisfy service-level agreements without significant over-provisioning. Achieving predictable performance is critical for many interactive applications. While the focus, particularly in theoretical models, has been on reducing average latency, the skewed tail of the latency distribution is much harder to reduce despite over-provisioning. In this paper, we take two approaches to mitigate tail latency in data centers. The first approach is based on bridging selected edge and aggregate switches to reduce east-west traffic latency. The second approach is based on task scheduling dependent tasks via their dependency acyclic graph. A queuing network model has been developed which can be used to reduce the average latency. Numerical and simulation results have shown the techniques are effective in terms of reducing the average tail latency within a data center network.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Simpy. https://simpy.readthedocs.io/en/latest/contents.html#
Alibaba.com: Alibaba production cluster data (2018). https://github.com/alibaba/clusterdata
Alizadeh, M., et al.: Data center TCP (DCTCP). In: Kalyanaraman, S., Padmanabhan, V.N., Ramakrishnan, K.K., Shorey, R., Voelker, G.M. (eds.) SIGCOMM, pp. 63–74. ACM (2010)
Google Scholar
Alizadeh, M., Kabbani, A., Edsall, T., Prabhakar, B., Vahdat, A., Yasuda, M.: Less is more: trading a little bandwidth for ultra-low latency in the data center. In: Gribble, S.D., Katabi, D. (eds.) Proceedings of the 9th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2012, San Jose, CA, USA, 25–27 April 2012, pp. 253–266. USENIX Association (2012)
Google Scholar
Amdahl, G.: Validity of the single-processor approach to achieving large-scale computing requirements. Comput. Des. 6(12), 39–40 (1967)
Google Scholar
Ananthanarayanan, G., Ghodsi, A., Shenker, S., Stoica, I.: Effective straggler mitigation: Attack of the clones. In: Feamster, N., Mogul, J.C. (eds.) Proceedings of the 10th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2013, Lombard, IL, USA, 2–5 April 2013, pp. 185–198. USENIX Association (2013)
Google Scholar
Ardagna, D., et al.: Performance prediction of cloud-based big data applications. In: Proceedings of the 2018 ACM/SPEC International Conference on Performance Engineering, pp. 192–199. ACM (2018)
Google Scholar
Bai, W.H., Xi, J.Q., Zhu, J.X., Huang, S.W.: Performance analysis of heterogeneous data centers in cloud computing using a complex queuing model. Math. Probl. Eng. 2015, 1–15 (2015)
MathSciNet MATH Google Scholar
Berger, D.S., Berg, B., Zhu, T., Sen, S., Harchol-Balter, M.: Robinhood: tail latency aware caching - dynamic reallocation from cache-rich to cache-poor. In: Arpaci-Dusseau, A.C., Voelker, G. (eds.) 13th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2018, Carlsbad, CA, USA, 8–10 October 2018, pp. 195–212. USENIX Association (2018)
Google Scholar
Boutin, E., et al.: Apollo: scalable and coordinated scheduling for cloud-scale computing. In: Flinn, J., Levy, H. (eds.) 11th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2014, Broomfield, CO, USA, 6–8 October 2014, pp. 285–300. USENIX Association (2014)
Google Scholar
Chowdhury, M., Zaharia, M., Ma, J., Jordan, M.I., Stoica, I.: Managing data transfers in computer clusters with orchestra. In: Keshav, S., Liebeherr, J., Byers, J.W., Mogul, J.C. (eds.) Proceedings of the ACM SIGCOMM 2011 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications, Toronto, ON, Canada, 15–19 August 2011, pp. 98–109. ACM (2011)
Google Scholar
Dean, J., Barroso, L.A.: The tail at scale. Commun. ACM 56(2), 74–80 (2013)
Article Google Scholar
Delimitrou, C., Sanchez, D., Kozyrakis, C.: Tarcil: reconciling scheduling speed and quality in large shared clusters. In: Ghandeharizadeh, S., Barahmand, S., Balazinska, M., Freedman, M.J. (eds.) SoCC, pp. 97–110. ACM (2015)
Google Scholar
Delimitrou, C., Kozyrakis, C.: QoS-aware scheduling in heterogeneous datacenters with paragon. ACM Trans. Comput. Syst 31(4), 12:1–12:34 (2013)
Article Google Scholar
Delimitrou, C., Kozyrakis, C.: Quasar: resource-efficient and QoS-aware cluster management. In: Balasubramonian, R., Davis, A., Adve, S.V. (eds.) Architectural Support for Programming Languages and Operating Systems, ASPLOS 2014, Salt Lake City, UT, USA, 1–5 March 2014, pp. 127–144. ACM (2014)
Google Scholar
Delimitrou, C., Kozyrakis, C.: Amdahl’s law for tail latency. Commun. ACM 61(8), 65–72 (2018)
Article Google Scholar
El Kafhali, S., Salah, K.: Stochastic modelling and analysis of cloud computing data center. In: 2017 20th Conference on Innovations in Clouds, Internet and Networks (ICIN), pp. 122–126. IEEE (2017)
Google Scholar
Feitelson, D.G.: Workload Modeling for Computer Systems Performance Evaluation, 1st edn. Cambridge University Press, New York (2015)
Google Scholar
Graham, C., Buest, R., Ackerman, D., Nag, S.: Forecast analysis: cloud managed services, worldwide, February 2020. https://www.gartner.com/en/documents/3981360
Gupta, V., Burroughs, M., Harchol-Balter, M.: Analysis of scheduling policies under correlated job sizes. Perform. Eval. 67(11), 996–1013 (2010)
Article Google Scholar
Hill, M.D., Marty, M.R.: Amdahl’s law in the multicore era. IEEE Comput. 41(7), 33–38 (2008)
Article Google Scholar
Jackson, J.R.: Networks of waiting lines. Oper. Res. 5(4), 518–521 (1957)
Article MathSciNet Google Scholar
Jafarnejad Ghomi, E., Rahmani, A.M., Qader, N.N.: Applying queue theory for modeling of cloud computing: a systematic review. Concurr. Comput. Pract. Exp. 31, e5186 (2019)
Article Google Scholar
Khazaei, H., Misic, J.V., Misic, V.B.: Performance analysis of cloud computing centers using m/g/m/m+r queuing systems. IEEE Trans. Parallel Distrib. Syst 23(5), 936–943 (2012)
Article Google Scholar
Li, J., Sharma, N.K., Ports, D.R.K., Gribble, S.D.: Tales of the tail: hardware, OS, and application-level sources of tail latency. In: Lazowska, E., Terry, D., Arpaci-Dusseau, R.H., Gehrke, J. (eds.) Proceedings of the ACM Symposium on Cloud Computing, Seattle, WA, USA, 3–5 November 2014. pp. 9:1–9:14. ACM (2014)
Google Scholar
Ousterhout, K., Canel, C., Ratnasamy, S., Shenker, S.: Monotasks: architecting for performance clarity in data analytics frameworks. In: SOSP, pp. 184–200. ACM (2017)
Google Scholar
Ousterhout, K., Wendell, P., Zaharia, M., Stoica, I.: Sparrow: distributed, low latency scheduling. In: Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, pp. 69–84. ACM (2013)
Google Scholar
Poola, D., Ramamohanarao, K., Buyya, R.: Enhancing reliability of workflow execution using task replication and spot instances. ACM Trans. Auton. Adapt. Syst. 10(4) (2016)
Google Scholar
Qi, H., Shiraz, M., Liu, J., Gani, A., Rahman, Z.A., Altameem, T.A.: Data center network architecture in cloud computing: review, taxonomy, and open research issues. J. Zhejiang Univ. Sci. C 15(9), 776–793 (2014)
Article Google Scholar
Rojas-Cessa, R., Kaymak, Y., Dong, Z.: Schemes for fast transmission of flows in data center networks. IEEE Commun. Surv. Tutor. 17(3), 1391–1422 (2015)
Article Google Scholar
Schwarzkopf, M., Bailis, P.: Research for practice: cluster scheduling for datacenters. Commun. ACM 61(5), 50–53 (2018)
Article Google Scholar
Schwarzkopf, M., Konwinski, A., Abd-El-Malek, M., Wilkes, J.: Omega: flexible, scalable schedulers for large compute clusters. In: Proceedings of the 8th ACM European Conference on Computer Systems, pp. 351–364. ACM (2013)
Google Scholar
Suresh, P.L., Canini, M., Schmid, S., Feldmann, A.: C3: cutting tail latency in cloud data stores via adaptive replica selection. In: NSDI, pp. 513–527. USENIX Association (2015)
Google Scholar
Vilaplana, J., Solsona, F., Teixidó, I., Mateo, J., Abella, F., Rius, J.: A queuing theory model for cloud computing. J. Supercomput. 69(1), 492–507 (2014)
Article Google Scholar
Wang, W., Harchol-Balter, M., Jiang, H., Scheller-Wolf, A., Srikant, R.: Delay asymptotics and bounds for multi-task parallel jobs. ACM SIGMETRICS Perform. Eval. Rev. 46(3), 2–7 (2019)
Article Google Scholar
Yang, B., Tan, F., Dai, Y.S.: Performance evaluation of cloud service considering fault recovery. J. Supercomput. 65(1), 426–444 (2013). https://doi.org/10.1007/s11227-011-0551-2
Article Google Scholar
Zats, D., Das, T., Mohan, P., Borthakur, D., Katz, R.H.: Detail: reducing the flow completion time tail in datacenter networks. In: Eggert, L., Ott, J., Padmanabhan, V.N., Varghese, G. (eds.) ACM SIGCOMM 2012 Conference, SIGCOMM 2012, Helsinki, Finland - 13–17 August 2012, pp. 139–150. ACM (2012)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Kent State University, Kent, OH, 44242, USA
Assad Althoubi & Hssan Peyravi
Department of Computer Science, Taif University, Taif, 26571, Kingdom of Saudi Arabia
Reem Alshahrani

Authors

Assad Althoubi
View author publications
You can also search for this author in PubMed Google Scholar
Reem Alshahrani
View author publications
You can also search for this author in PubMed Google Scholar
Hssan Peyravi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Assad Althoubi .

Editor information

Editors and Affiliations

Department of Industrial and Information Engineering, University of Pavia, Pavia, Italy
Maria Carla Calzarossa
Institute of Theoretical and Applied Informatics, Gliwice, Poland
Erol Gelenbe
Polish Academy of Sciences, Institute of Theoretical and Applied Informatics, Gliwice, Poland
Krysztof Grochla
University of Houston, Houston, TX, USA
Ricardo Lent
Institute of Theoretical and Applied Informatics of the Polish Academy of Sciences, Gliwice, Poland
Tadeusz Czachórski

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Althoubi, A., Alshahrani, R., Peyravi, H. (2021). Tail Latency in Datacenter Networks. In: Calzarossa, M.C., Gelenbe, E., Grochla, K., Lent, R., Czachórski, T. (eds) Modelling, Analysis, and Simulation of Computer and Telecommunication Systems. MASCOTS 2020. Lecture Notes in Computer Science(), vol 12527. Springer, Cham. https://doi.org/10.1007/978-3-030-68110-4_17

Download citation

DOI: https://doi.org/10.1007/978-3-030-68110-4_17
Published: 29 January 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-68109-8
Online ISBN: 978-3-030-68110-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics