Abstract
One of the major challenges in cloud service data centers is to satisfy service-level agreements without significant over-provisioning. Achieving predictable performance is critical for many interactive applications. While the focus, particularly in theoretical models, has been on reducing average latency, the skewed tail of the latency distribution is much harder to reduce despite over-provisioning. In this paper, we take two approaches to mitigate tail latency in data centers. The first approach is based on bridging selected edge and aggregate switches to reduce east-west traffic latency. The second approach is based on task scheduling dependent tasks via their dependency acyclic graph. A queuing network model has been developed which can be used to reduce the average latency. Numerical and simulation results have shown the techniques are effective in terms of reducing the average tail latency within a data center network.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Simpy. https://simpy.readthedocs.io/en/latest/contents.html#
Alibaba.com: Alibaba production cluster data (2018). https://github.com/alibaba/clusterdata
Alizadeh, M., et al.: Data center TCP (DCTCP). In: Kalyanaraman, S., Padmanabhan, V.N., Ramakrishnan, K.K., Shorey, R., Voelker, G.M. (eds.) SIGCOMM, pp. 63–74. ACM (2010)
Alizadeh, M., Kabbani, A., Edsall, T., Prabhakar, B., Vahdat, A., Yasuda, M.: Less is more: trading a little bandwidth for ultra-low latency in the data center. In: Gribble, S.D., Katabi, D. (eds.) Proceedings of the 9th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2012, San Jose, CA, USA, 25–27 April 2012, pp. 253–266. USENIX Association (2012)
Amdahl, G.: Validity of the single-processor approach to achieving large-scale computing requirements. Comput. Des. 6(12), 39–40 (1967)
Ananthanarayanan, G., Ghodsi, A., Shenker, S., Stoica, I.: Effective straggler mitigation: Attack of the clones. In: Feamster, N., Mogul, J.C. (eds.) Proceedings of the 10th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2013, Lombard, IL, USA, 2–5 April 2013, pp. 185–198. USENIX Association (2013)
Ardagna, D., et al.: Performance prediction of cloud-based big data applications. In: Proceedings of the 2018 ACM/SPEC International Conference on Performance Engineering, pp. 192–199. ACM (2018)
Bai, W.H., Xi, J.Q., Zhu, J.X., Huang, S.W.: Performance analysis of heterogeneous data centers in cloud computing using a complex queuing model. Math. Probl. Eng. 2015, 1–15 (2015)
Berger, D.S., Berg, B., Zhu, T., Sen, S., Harchol-Balter, M.: Robinhood: tail latency aware caching - dynamic reallocation from cache-rich to cache-poor. In: Arpaci-Dusseau, A.C., Voelker, G. (eds.) 13th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2018, Carlsbad, CA, USA, 8–10 October 2018, pp. 195–212. USENIX Association (2018)
Boutin, E., et al.: Apollo: scalable and coordinated scheduling for cloud-scale computing. In: Flinn, J., Levy, H. (eds.) 11th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2014, Broomfield, CO, USA, 6–8 October 2014, pp. 285–300. USENIX Association (2014)
Chowdhury, M., Zaharia, M., Ma, J., Jordan, M.I., Stoica, I.: Managing data transfers in computer clusters with orchestra. In: Keshav, S., Liebeherr, J., Byers, J.W., Mogul, J.C. (eds.) Proceedings of the ACM SIGCOMM 2011 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications, Toronto, ON, Canada, 15–19 August 2011, pp. 98–109. ACM (2011)
Dean, J., Barroso, L.A.: The tail at scale. Commun. ACM 56(2), 74–80 (2013)
Delimitrou, C., Sanchez, D., Kozyrakis, C.: Tarcil: reconciling scheduling speed and quality in large shared clusters. In: Ghandeharizadeh, S., Barahmand, S., Balazinska, M., Freedman, M.J. (eds.) SoCC, pp. 97–110. ACM (2015)
Delimitrou, C., Kozyrakis, C.: QoS-aware scheduling in heterogeneous datacenters with paragon. ACM Trans. Comput. Syst 31(4), 12:1–12:34 (2013)
Delimitrou, C., Kozyrakis, C.: Quasar: resource-efficient and QoS-aware cluster management. In: Balasubramonian, R., Davis, A., Adve, S.V. (eds.) Architectural Support for Programming Languages and Operating Systems, ASPLOS 2014, Salt Lake City, UT, USA, 1–5 March 2014, pp. 127–144. ACM (2014)
Delimitrou, C., Kozyrakis, C.: Amdahl’s law for tail latency. Commun. ACM 61(8), 65–72 (2018)
El Kafhali, S., Salah, K.: Stochastic modelling and analysis of cloud computing data center. In: 2017 20th Conference on Innovations in Clouds, Internet and Networks (ICIN), pp. 122–126. IEEE (2017)
Feitelson, D.G.: Workload Modeling for Computer Systems Performance Evaluation, 1st edn. Cambridge University Press, New York (2015)
Graham, C., Buest, R., Ackerman, D., Nag, S.: Forecast analysis: cloud managed services, worldwide, February 2020. https://www.gartner.com/en/documents/3981360
Gupta, V., Burroughs, M., Harchol-Balter, M.: Analysis of scheduling policies under correlated job sizes. Perform. Eval. 67(11), 996–1013 (2010)
Hill, M.D., Marty, M.R.: Amdahl’s law in the multicore era. IEEE Comput. 41(7), 33–38 (2008)
Jackson, J.R.: Networks of waiting lines. Oper. Res. 5(4), 518–521 (1957)
Jafarnejad Ghomi, E., Rahmani, A.M., Qader, N.N.: Applying queue theory for modeling of cloud computing: a systematic review. Concurr. Comput. Pract. Exp. 31, e5186 (2019)
Khazaei, H., Misic, J.V., Misic, V.B.: Performance analysis of cloud computing centers using m/g/m/m+r queuing systems. IEEE Trans. Parallel Distrib. Syst 23(5), 936–943 (2012)
Li, J., Sharma, N.K., Ports, D.R.K., Gribble, S.D.: Tales of the tail: hardware, OS, and application-level sources of tail latency. In: Lazowska, E., Terry, D., Arpaci-Dusseau, R.H., Gehrke, J. (eds.) Proceedings of the ACM Symposium on Cloud Computing, Seattle, WA, USA, 3–5 November 2014. pp. 9:1–9:14. ACM (2014)
Ousterhout, K., Canel, C., Ratnasamy, S., Shenker, S.: Monotasks: architecting for performance clarity in data analytics frameworks. In: SOSP, pp. 184–200. ACM (2017)
Ousterhout, K., Wendell, P., Zaharia, M., Stoica, I.: Sparrow: distributed, low latency scheduling. In: Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, pp. 69–84. ACM (2013)
Poola, D., Ramamohanarao, K., Buyya, R.: Enhancing reliability of workflow execution using task replication and spot instances. ACM Trans. Auton. Adapt. Syst. 10(4) (2016)
Qi, H., Shiraz, M., Liu, J., Gani, A., Rahman, Z.A., Altameem, T.A.: Data center network architecture in cloud computing: review, taxonomy, and open research issues. J. Zhejiang Univ. Sci. C 15(9), 776–793 (2014)
Rojas-Cessa, R., Kaymak, Y., Dong, Z.: Schemes for fast transmission of flows in data center networks. IEEE Commun. Surv. Tutor. 17(3), 1391–1422 (2015)
Schwarzkopf, M., Bailis, P.: Research for practice: cluster scheduling for datacenters. Commun. ACM 61(5), 50–53 (2018)
Schwarzkopf, M., Konwinski, A., Abd-El-Malek, M., Wilkes, J.: Omega: flexible, scalable schedulers for large compute clusters. In: Proceedings of the 8th ACM European Conference on Computer Systems, pp. 351–364. ACM (2013)
Suresh, P.L., Canini, M., Schmid, S., Feldmann, A.: C3: cutting tail latency in cloud data stores via adaptive replica selection. In: NSDI, pp. 513–527. USENIX Association (2015)
Vilaplana, J., Solsona, F., Teixidó, I., Mateo, J., Abella, F., Rius, J.: A queuing theory model for cloud computing. J. Supercomput. 69(1), 492–507 (2014)
Wang, W., Harchol-Balter, M., Jiang, H., Scheller-Wolf, A., Srikant, R.: Delay asymptotics and bounds for multi-task parallel jobs. ACM SIGMETRICS Perform. Eval. Rev. 46(3), 2–7 (2019)
Yang, B., Tan, F., Dai, Y.S.: Performance evaluation of cloud service considering fault recovery. J. Supercomput. 65(1), 426–444 (2013). https://doi.org/10.1007/s11227-011-0551-2
Zats, D., Das, T., Mohan, P., Borthakur, D., Katz, R.H.: Detail: reducing the flow completion time tail in datacenter networks. In: Eggert, L., Ott, J., Padmanabhan, V.N., Varghese, G. (eds.) ACM SIGCOMM 2012 Conference, SIGCOMM 2012, Helsinki, Finland - 13–17 August 2012, pp. 139–150. ACM (2012)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Althoubi, A., Alshahrani, R., Peyravi, H. (2021). Tail Latency in Datacenter Networks. In: Calzarossa, M.C., Gelenbe, E., Grochla, K., Lent, R., Czachórski, T. (eds) Modelling, Analysis, and Simulation of Computer and Telecommunication Systems. MASCOTS 2020. Lecture Notes in Computer Science(), vol 12527. Springer, Cham. https://doi.org/10.1007/978-3-030-68110-4_17
Download citation
DOI: https://doi.org/10.1007/978-3-030-68110-4_17
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-68109-8
Online ISBN: 978-3-030-68110-4
eBook Packages: Computer ScienceComputer Science (R0)