Abstract
Datacenter applications (e.g., web search, recommendation systems, and social networking) are designed to have a high fanout for the purpose of achieving scalable performance. Frequent fabric congestion (e.g., due to incast, imperfect hashing) is a corollary of such a design. This is true even when the network utilization is low. Such fabric congestion exhibits both temporal as well as spatial (intra-rack and inter-rack) variations. There exist two basic design paradigms which are used to address this issue. Current solutions lie somewhere between the two. On one hand we have arbiter based approaches where senders poll a centralized arbiter and collectively obey global scheduling decisions. On the other end of the spectrum, we have self adjusting end point based approaches where senders independently adjust transmission rate based on network congestion. The former incurs greater overhead, compared to the latter which trades off complexity for sub-optimality. Our work seeks a middle ground - optimality of arbiter based approaches with the simplicity of self adjusting end point based approaches. Our key design principle is that since the receiver has complete information regarding the flows destined for it, rather than having a centralized arbiter schedule flows or the senders making independent scheduling decisions, the receiver can orchestrate the various flows destined for it. Since multiple receivers may be using a bottleneck link, datapath visibility should be used to ensure fair sharing of the bottleneck capacity between receivers with minimum overhead. We propose RecFlow, which is a receiver-based proactive congestion control scheme. RecFlow employs OpenFlow provided path visibility to track changing bottlenecks on the fly. It spaces TCP acknowledgements to prevent traffic bursts and ensure that no receiver exceeds its fair share of the bottleneck capacity. The goal is to reduce buffer overflows while maintaining fairness among flows and high link utilization. Using extensive simulation results and real testbed evaluation, we show that compared to the state-of-the-art, RecFlow achieves up to 6× improvement in the inter-rack scenario and 1.5× in the intra-rack scenario while sharing the link capacity fairly between all flows.
Similar content being viewed by others
References
Abdelmoniem, A.M., Bensaou, B., Abu, A.J.: Sicc: Sdn-based incast congestion control for data centers. In: 2017 IEEE International Conference on Communications (ICC), pp. 1–6 (2017). https://doi.org/10.1109/ICC.2017.7996826
Abts, D., Felderman, B.: A guided tour of data-center networking. Commun. ACM 55(6), 44–51 (2012). https://doi.org/10.1145/2184319.2184335
Akidau, T., Balikov, A., Bekiroğlu, K., Chernyak, S., Haberman, J., Lax, R., McVeety, S., Mills, D., Nordstrom, P., Whittle, S.: Millwheel: fault-tolerant stream processing at internet scale. Proc. VLDB Endow. 6(11), 1033–1044 (2013). https://doi.org/10.14778/2536222.2536229
Alizadeh, M., Edsall, T., Dharmapurikar, S., Vaidyanathan, R., Chu, K., Fingerhut, A., Lam, V.T., Matus, F., Pan, R., Yadav, N., Varghese, G.: Conga: distributed congestion-aware load balancing for datacenters, pp. 503–514. ACM, New York, NY, USA (2014). https://doi.org/10.1145/2740070.2626316
Alizadeh, M., Greenberg, A., Maltz, D.A., Padhye, J., Patel, P., Prabhakar, B., Sengupta, S., Sridharan, M.: Data center tcp (dctcp). In: Proceedings of the ACM SIGCOMM 2010 Conference, SIGCOMM ’10, pp. 63–74. ACM, New York, NY, USA (2010). https://doi.org/10.1145/1851182.1851192
Ananthanarayanan, G., Kandula, S., Greenberg, A., Stoica, I., Lu, Y., Saha, B., Harris, E.: Reining in the outliers in map-reduce clusters using mantri. In: Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation, OSDI’10, pp. 265–278. USENIX Association, Berkeley, CA, USA (2010). http://dl.acm.org/citation.cfm?id=1924943.1924962
Bai, W., Chen, K., Wu, H., Lan, W., Zhao, Y.: PAC: Taming TCP incast congestion using proactive ACK control. In: Proceedings of the 2014 IEEE 22nd International Conference on Network Protocols, ICNP ’14, pp. 385–396. IEEE Computer Society, Washington, DC, USA (2014). https://doi.org/10.1109/ICNP.2014.62
Bai, W., Chen, L., Chen, K., Wu, H.: Enabling ECN in multi-service multi-queue data centers. In: 13th USENIX Symposium on Networked Systems Design and Implementation (NSDI 16), pp. 537–549. USENIX Association, Santa Clara, CA (2016). https://www.usenix.org/conference/nsdi16/technical-sessions/presentation/bai
Cheng, P., Ren, F., Shu, R., Lin, C.: Catch the whole lot in an action: rapid precise packet loss notification in data centers. In: Proceedings of the 11th USENIX Conference on Networked Systems Design and Implementation, NSDI’14, pp. 17–28. USENIX Association, Berkeley, CA, USA (2014). http://dl.acm.org/citation.cfm?id=2616448.2616451
Dalton, M., Schultz, D., Adriaens, J., Arefin, A., Gupta, A., Fahs, B., Rubinstein, D., Zermeno, E.C., Rubow, E., Docauer, J.A., Alpert, J., Ai, J., Olson, J., DeCabooter, K., de Kruijf, M., Hua, N., Lewis, N., Kasinadhuni, N., Crepaldi, R., Krishnan, S., Venkata, S., Richter, Y., Naik, U., Vahdat, A.: Andromeda: performance, isolation, and velocity at scale in cloud network virtualization. In: 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI 18), pp. 373–387. USENIX Association, Renton, WA (2018). https://www.usenix.org/conference/nsdi18/presentation/dalton
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Emmerich, P., Raumer, D., Wohlfart, F., Carle, G.: Performance characteristics of virtual switching. In: 2014 IEEE 3rd International Conference on Cloud Networking (CloudNet), pp. 120–125 (2014). https://doi.org/10.1109/CloudNet.2014.6968979
Facebook: Newsroom (2017). http://newsroom.fb.com/company-info
Ghobadi, M., Yeganeh, S.H., Ganjali, Y.: Rethinking end-to-end congestion control in software-defined networks. In: Proceedings of the 11th ACM Workshop on Hot Topics in Networks, HotNets-XI, pp. 61–66. ACM, New York, NY, USA (2012). https://doi.org/10.1145/2390231.2390242
Hafeez, U.U., Kashaf, A., u. a. Bajwa, Q., Mushtaq, A., Zaidi, H., Qazi, I.A., Uzmi, Z.A.: Mitigating datacenter incast congestion using rto randomization. In: 2015 IEEE Global Communications Conference (GLOBECOM), pp. 1–6 (2015). https://doi.org/10.1109/GLOCOM.2015.7417797
He, K., Rozner, E., Agarwal, K., Felter, W., Carter, J., Akella, A.: Presto: Edge-based load balancing for fast datacenter networks. In: Proceedings of the 2015 ACM Conference on Special Interest Group on Data Communication, SIGCOMM ’15, pp. 465–478. ACM, New York, NY, USA (2015). https://doi.org/10.1145/2785956.2787507
Hoff, T.: Latency is everywhere and it costs you sales how to crush it (2009). https://www.highscalability.com/blog/2009/7/25/latency-is-everywhere-and-it-costs-you-sales-how-to-crush-it.html
Hwang, J., Yoo, J., Choi, N.: Deadline and incast aware TCP for cloud data center networks. Comput. Netw. 68(Supplement C), 20–34 (2014). https://doi.org/10.1016/j.comnet.2013.12.002
Jouet, S., Pezaros, D.P.: Measurement-based tcp parameter tuning in cloud data centers. In: 2013 21st IEEE International Conference on Network Protocols (ICNP), pp. 1–3 (2013). https://doi.org/10.1109/ICNP.2013.6733644
Khan, A.Z., Qazi, I.A.: Receiver-driven flow scheduling for commodity datacenters. In: 2017 IEEE International Conference on Communications (ICC), pp. 1–6 (2017). https://doi.org/10.1109/ICC.2017.7996676
Knowledge, D.C.: The facebook data center faq (2010). http://www.datacenterknowledge.com/the-facebook-data-center-faq-page-2/
Krevat, E., Vasudevan, V., Phanishayee, A., Andersen, D.G., Ganger, G.R., Gibson, G.A., Seshan, S.: On application-level approaches to avoiding tcp throughput collapse in cluster-based storage systems. In: Proceedings of the 2nd International Workshop on Petascale Data Storage: Held in Conjunction with Supercomputing ’07, PDSW ’07, pp. 1–4. ACM, New York, NY, USA (2007). https://doi.org/10.1145/1374596.1374598
Kulkarni, S., Agrawal, P.: A probabilistic approach to address tcp incast in data center networks. In: 2011 31st International Conference on Distributed Computing Systems Workshops, pp. 26–33 (2011). https://doi.org/10.1109/ICDCSW.2011.41
Lu, Y., Zhu, S.: SDN-based TCP congestion control in data center networks. In: Proceedings of the 2015 IEEE 34th International Performance Computing and Communications Conference (IPCCC), IPCCC ’15, pp. 1–7. IEEE Computer Society, Washington, DC, USA (2015). https://doi.org/10.1109/PCCC.2015.7410275
Luo, T., Tan, H.P., Quan, P.C., Law, Y.W., Jin, J.: Enhancing responsiveness and scalability for openflow networks via control-message quenching. In: 2012 International Conference on ICT Convergence (ICTC), pp. 348–353 (2012). https://doi.org/10.1109/ICTC.2012.6386857
Miller, R.: Google uses about 900,000 servers (2011). www.datacenterknowledge.com/archives/2011/08/01/report-google-uses-about-900000-servers
Nishtala, R., Fugal, H., Grimm, S., Kwiatkowski, M., Lee, H., Li, H.C., McElroy, R., Paleczny, M., Peek, D., Saab, P., Stafford, D., Tung, T., Venkataramani, V.: Scaling memcache at facebook. In: Proceedings of the 10th USENIX Conference on Networked Systems Design and Implementation, nsdi’13, pp. 385–398. USENIX Association, Berkeley, CA, USA (2013). http://dl.acm.org/citation.cfm?id=2482626.2482663
ns-2 Network Simulator. https://www.isi.edu/nsnam/ns/
OpenFlow: https://www.opennetworking.org/sdn-resources/openflow
Peng, D., Dabek, F.: Large-scale incremental processing using distributed transactions and notifications. In: Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation, OSDI’10, pp. 251–264. USENIX Association, Berkeley, CA, USA (2010). http://dl.acm.org/citation.cfm?id=1924943.1924961
Perlin, M.: Downtime, outages and failures—understanding their true costs (2012). https://www.evolven.com/blog/downtime-outages-and-failures-understanding-their-true-costs.html
Perry, J., Ousterhout, A., Balakrishnan, H., Shah, D., Fugal, H.: Fastpass: A centralized “zero-queue” datacenter network. In: Proceedings of the 2014 ACM Conference on SIGCOMM, SIGCOMM ’14, pp. 307–318. ACM, New York, NY, USA (2014). https://doi.org/10.1145/2619239.2626309
Pfaff, B., Pettit, J., Koponen, T., Jackson, E.J., Zhou, A., Rajahalme, J., Gross, J., Wang, A., Stringer, J., Shelar, P., Amidon, K., Casado, M.: The design and implementation of open vswitch. In: Proceedings of the 12th USENIX Conference on Networked Systems Design and Implementation, NSDI’15, pp. 117–130. USENIX Association, Berkeley, CA, USA (2015). http://dl.acm.org/citation.cfm?id=2789770.2789779
Phanishayee, A., Krevat, E., Vasudevan, V., Andersen, D.G., Ganger, G.R., Gibson, G.A., Seshan, S.: Measurement and analysis of tcp throughput collapse in cluster-based storage systems. In: Proceedings of the 6th USENIX Conference on File and Storage Technologies, FAST’08, pp. 12:1–12:14. USENIX Association, Berkeley, CA, USA (2008). http://dl.acm.org/citation.cfm?id=1364813.1364825
Pirzada, H.A., Mahboob, M.R., Qazi, I.A.: esdn: Rethinking datacenter transports using end-host sdn controllers. In: Proceedings of the 2015 ACM Conference on Special Interest Group on Data Communication, SIGCOMM ’15, pp. 605–606. ACM, New York, NY, USA (2015). https://doi.org/10.1145/2785956.2790022
Ryu: (2017). http://osrg.github.io/ryu/
Rotsos, C., Sarrar, N., Uhlig, S., Sherwood, R., Moore, A.W.: Oflops: An open framework for openflow switch evaluation. In: Proceedings of the 13th International Conference on Passive and Active Measurement, PAM’12, pp. 85–95. Springer, Berlin (2012)
Roy, A., Zeng, H., Bagga, J., Porter, G., Snoeren, A.C.: Inside the social network’s (datacenter) network, pp. 123–137. ACM, New York (2015). https://doi.org/10.1145/2829988.2787472
Singh, A., Ong, J., Agarwal, A., Anderson, G., Armistead, A., Bannon, R., Boving, S., Desai, G., Felderman, B., Germano, P., Kanagala, A., Provost, J., Simmons, J., Tanda, E., Wanderer, J., Hölzle, U., Stuart, S., Vahdat, A.: Jupiter rising: a decade of clos topologies and centralized control in google’s datacenter network, pp. 183–197. ACM, New York, NY, USA (2015). https://doi.org/10.1145/2829988.2787508
Sreekumari, P., Jung, Ji, Lee, M.: A simple and efficient approach for reducing tcp timeouts due to lack of duplicate acknowledgments in data center networks. Clust. Comput. 19(2), 633–645 (2016). https://doi.org/10.1007/s10586-016-0555-z
Stats, I.L.: Google search statistics (2017). http://www.internetlivestats.com/google-search-statistics
The Open Networking Foundation: OpenFlow Switch Specification (2012)
The Open Networking Foundation: OpenFlow and SDN State of the Union (2016)
Tootoonchian, A., Gorbunov, S., Ganjali, Y., Casado, M., Sherwood, R.: On controller performance in software-defined networks. In: Proceedings of the 2Nd USENIX Conference on Hot Topics in Management of Internet, Cloud, and Enterprise Networks and Services, Hot-ICE’12, pp. 10–10. USENIX Association, Berkeley, CA, USA (2012). http://dl.acm.org/citation.cfm?id=2228283.2228297
Vasudevan, V., Phanishayee, A., Shah, H., Krevat, E., Andersen, D.G., Ganger, G.R., Gibson, G.A., Mueller, B.: Safe and effective fine-grained tcp retransmissions for datacenter communication. In: Proceedings of the ACM SIGCOMM 2009 Conference on Data Communication, SIGCOMM ’09, pp. 303–314. ACM, New York, NY, USA (2009). https://doi.org/10.1145/1592568.1592604
Wu, H., Feng, Z., Guo, C., Zhang, Y.: Incast congestion control for TCP in data-center networks. IEEE/ACM Trans. Netw. 21, 345–358 (2013). https://doi.org/10.1109/TNET.2012.2197411
Zhang, J., Ren, F., Lin, C.: Modeling and understanding TCP incast in data center networks. In: 2011 Proceedings IEEE INFOCOM, pp. 1377–1385 (2011). https://doi.org/10.1109/INFCOM.2011.5934923
Zhang, J., Ren, F., Lin, C.: Survey on transport control in data center networks. IEEE Netw. 27(4), 22–26 (2013). https://doi.org/10.1109/MNET.2013.6574661
Zhang, J., Ren, F., Tang, L., Lin, C.: Taming tcp incast throughput collapse in data center networks. In: 2013 21st IEEE International Conference on Network Protocols (ICNP), pp. 1–10 (2013). https://doi.org/10.1109/ICNP.2013.6733609
Zheng, H., Chen, C., Qiao, C.: Understanding the impact of removing tcp binary exponential backoff in data centers. In: 2011 Third International Conference on Communications and Mobile Computing, pp. 174–177 (2011). https://doi.org/10.1109/CMC.2011.85
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Khan, A.Z., Qazi, I.A. RecFlow: SDN-based receiver-driven flow scheduling in datacenters. Cluster Comput 23, 289–306 (2020). https://doi.org/10.1007/s10586-019-02922-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10586-019-02922-4