Abstract
Explicit congestion notification (ECN) has been widely adopted by recent proposals to build up high-throughput and low-latency datacenter network transport. In these ECN-based proposals, when the queue length of a switch exceeds a pre-defined threshold, the switch would mark all arriving packets with ECN to explicitly notify their senders to slow down the rates. Such a design enables the network to eliminate congestions quickly. However, it marks packets without considering the flow state, which may overkill flows, especially those only send a few packets, thus resulting in significant throughput loss and long flow completion times. In this paper, we propose a novel flow-aware ECN marking approach (FECN), which can improve the throughput and flow completion time by taking flow states into consideration. By selectively marking packets respecting to their flow rates, FECN enables the network to precisely slow down the high-speed flows to avoid congestions without killing low-speed short flows. Moreover, FECN does not require switches to maintain per-flow state, which yields low overhead and thus makes FECN to be easily implemented and deployed in commodity switches. Simulations show that FECN can shorten the flow completion time by up to 44.7% and reduce the throughput loss by up to 40.3%, compared with prior flow-agnostic ECN marking approach.
Similar content being viewed by others
References
Sun, G., Zhu, G., Yu, H., et al.: Cost-efficient service function chain orchestration for low-latency applications in NFV networks. IEEE Syst. J. (2018)
Alizadeh, M., Yang, S., Sharif, M., Katti, S. et al.: pFabric: minimal near-optimal datacenter transport. In: Proc. SIGCOMM, pp. 435446 (2013)
Hoff, T.: Latency is everywhere and it costs you sales how to crush it. http://highscalability.com/blog/2009/7/25/latency- is-everywhere-and-it-costs-you-sales-how-to-crush-it.html (2009)
Sun, G., Li, Y., Vasilakos, A., Guizani, M.: Energy-efficient and traffic-aware service function chaining orchestration in multi-domain networks. Future Gener. Comput. Syst. 91, 347–360 (2019)
Sun, G., Yu, H.: A new technique for efficient live migration of multiple virtual machines. Future Gener. Comput. Syst. 55, 74–86 (2016)
Sun, G., Liao, D., Yu, H.: Live migration for multiple correlated virtual machines in cloud-based data centers. IEEE Trans. Serv. Comput. 11(2), 279–291 (2018)
Munir, A., Qazi, I.: Minimizing flow completion times in data centers, INFOCOM. Proc. IEEE IEEE 2013, 2157–2165 (2013)
Alizadeh, M., Greenberg, A., Maltz, D. et al.: Data center TCP (DCTCP). In: Proc. SIGCOMM, pp. 6374 (2010)
Luo, S., Hongfang, Y., Zhao, Y., Wang, S., Shui, Y., Li, L.: Towards practical and near-optimal coflow scheduling for data center networks. IEEE Trans. Parallel Distrib. Syst. 27(11), 3366–3380 (2016)
Zhu, Y., Eran, H., Firestone, D., Guo, C. et al.: Congestion Control for Large-Scale RDMA Deployments. In: Proc. SIGCOMM (2015)
Wu, H., Ju, J., Lu, G., Guo, C., Xiong, Y., Zhang, Y.: Tuning ECN for data center networks. In: CoNEXT (2012)
Bai, W., Chen, L., Chen, K., Wu, H.: Enabling ECN in multi-service multi-queue data centers. In: Usenix Conference on Networked Systems Design and Implementation USENIX Association, pp. 537–549 (2016)
Floyd, S., Jacobson, V.: Random early detection gateways for congestion avoidance. IEEE/ACM Trans. Netw. 4, 397–413 (1993)
Shan, D., Ren, F.: Improving ECN marking scheme with micro-burst traffic in data center networks. In: INFOCOM (2017)
The Network Simulator NS-3. https://www.nsnam.org/
Lin, D., Morris, R.: Dynamics of random early detection. In: Proc. SIGCOMM, pp. 127–137 (1997)
Zhao, Z., Jiang, Z., Lu, C. et al.: Towards coordinated congestion control and load balancing in datacenter networks. In: Global Communications Conference (GLOBECOM), IEEE (2013)
Alizadeh, M., Kabbani, A., et al.: Less is more: trading a little bandwidth for ultra-low latency in the data center. In: Usenix Conference on Networked Systems Design and Implementation pp. 19–19 (2012)
Rong, P., Prabhakar, B., Psounis, K.: CHOKe—a stateless active queue management scheme for approximating fair bandwidth allocation. In: INFOCOM (2000)
Lakshman, T., Wong, L.: SRED: stabilized RED. In: Proceedings of INFOCOM pp. 1346–1355 (1999)
Mittal, R., Radhika, V., et al.: TIMELY: RTT-based congestion control for the datacenter. In: ACM Conference on Special Interest Group on Data Communication ACM, pp. 537–550 (2015)
Lee, C., Park, C.: DX: latency-based congestion control for datacenters. IEEE/ACM Trans. Netw. 25(1), 335–348 (2017)
Zhao, Z., Li, Q., et al.: Reduce completion time and guarantee throughput by transport with slight congestion. In: IEEE International Conference on Communications IEEE pp. 1–6 (2016)
Bai, W., Chen, K., et al.: Enabling ECN over Generic Packet Scheduling. In: International on Conference on Emerging NETWORKING Experiments and Technologies ACM, pp. 191–204 (2016)
Wilson, C., Ballani, H.: Better never than late: meeting deadlines in datacenter networks. Acm Sigcomm Comput. Commun. Rev. 41(4), 50–61 (2011)
Hong, C., Caesar, M., Godfrey, P.: Finishing flows quickly with preemptive scheduling. Acm Sigcomm Comput. Commun. Rev. 42(4), 127–138 (2012)
RFC 791. https://tools.ietf.org/html/rfc791
Nichols, K., Jacobson, V.: Controlling queue delay. Commun. ACM 55, 1–7 (2012)
Yuanwei, L., et al.: Multi-Path Transport for RDMA in Datacenters. In: 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI) (2018)
Alizadeh, M., Yang, S., Sharif, M., Katti, S., McKeown, N., Prabhakar, B., Shenker, S.: pFabric: minimal near-optimal datacenter transport. In: ACM SIGCOMM (2013)
Perry, J., Ousterhout, A., Balakrishnan, H., Shah, D., Fugal, H.: Fastpass: A centralized zero-queue datacenter network. In: Proc. ACM SIGCOMM (2014)
Perry, J., Balakrishnan, H., Shah, D.: Flowtune: flowlet control for datacenter networks. In: NSDI (2017)
Vamanan, B., Hasan, J., Vijaykumar, T. N.: Deadline-aware datacenter TCP (D2TCP). In: Proc. ACM SIGCOMM (2012)
Gao, Chengxi, Lee, Victor C.S., Li, Keqin: DemePro: DEcouple packet marking from enqueuing for multiple services with PROactive congestion control. IEEE Trans. Cloud Comput. 1, 1–1 (2017)
David, Z., et al.: DeTail: reducing the flow completion time tail in datacenter networks. In: Proceedings of the ACM SIGCOMM 2012 conference on Applications, technologies, architectures, and protocols for computer communication. ACM (2012)
Sun, G., Liao, D., Zhao, D., Sun, Z., Chang, V.: Towards provisioning hybrid virtual networks in federated cloud data centers. Future Gener. Comput. Syst. 87, 457–469 (2018)
Alizadeh, M., Kabbani, A., Atikoglu, B., Prabhakar, B.: Stability analysis of QCN: the averaging principle. In: SIGMETRICS (2011)
Alizadeh, M., Javanmard, A., Prabhakar, B.: Analysis of DCTCP: Stability, convergence and fairness. In: SIGMETRICS (2011)
Cisco White Paper: Intelligent Buffer Management on Cisco Nexus 9000 Series Switches. https://www.cisco.com/c/en/us/products/collateral/switches/nexus-9000-series-switches/white-paper-c11-738488.html
Lee, C., Nakagawa, Y., Hyoudou, K., Kobayashi, S., Shiraki, O., Shimizu, T.: Control, flow-aware congestion, to improve throughput under TCP incast in datacenter networks. In: IEEE 39th Annual Computer Software and Applications Conference. Taichung, pp. 155–162 (2015)
Sivaraman A., et al.: Programmable packet scheduling at line rate. In: Proc. ACMSIGCOMM Conf., pp. 4457 (2016)
Sharma, N. et al.: Approximating fair queueing on reconfigurable switches. In: USENIX Symposium on Networked Systems Design and Implementation (2018)
Acknowledgements
This research was partially supported by the National Natural Science Foundation of China (61571098), Fundamental Research Funds for the Central Universities (ZYGX2016J217), the 111 Project (B14039), and Fundamental Research Funds for the Central Universities (2682019CX61).
Author information
Authors and Affiliations
Corresponding author
Appendix
Appendix
Before introducing our fluid model, we first briefly review FECN at switch side and the DCTCP [8] algorithm at the source side. Then, like [37, 38], we model N greedy flows at a single bottleneck with capacity C. We assume that all flows have the same RTTs.
Switch side In FECN, the switch makes ECN marking decisions based on flows’ sending rate. The sending rate of each flow is calculated at the source side and is carried in the headers of sending packets. The switch read the sending rate carried in the packet headers and estimates the average sending rate of all flows traversing the same outport, which is symbolized as aveRate. For simplicity, we assume that aveRate can accurately estimate the average sending rate which is not true. However, this approximation does not change the fundamental nature of marking packets from flows with relatively high sending rate, and we are able to capture FECN dynamics. Then aveRate can be modeled as follows:
where N is the number of flows traversing the same outport at the switch, \(r_i(t)\) is the sending rate of the ith flow at time t. In our realization, we use congestion window w(t) as the calculated value of sending rate r(t) to reduce the overhead of source side. So the model of aveRate(t) can be replaced as follows:
The marking probability of packets from ith flow can be modeled as follows:
where q(t) is the instantaneous queue length at time t.
Source side The DCTCP source maintains an estimation of the fraction of its ECN marked packets. This estimate, \(\alpha\), is updated once for per RTT (or per window of data) as follows:
where F is the fraction of packets that were marked in the most recent RTT, and g is a fixed parameter. DCTCP uses \(\alpha\) to reduce its window size in response to marked ACKs as follows:
Note that, DCTCP reduce its window size at most per RTT. For each ACK, DCTCP increases its window size like TCP as follows:
State equations We choose w and \(\alpha\) at each source and q(t) at the switch as the state variables of the system. The evolution of state variables can discribe the dynamics of the system. The evoluation of q(t) can be modeled as follows:
where R(t) is the RTT at time t, \(w_i(t)\) is the window size of ith flow, and \(\frac{\sum _{i=1}^N w_i(t)}{R(t)}\) is the average total input rate. R(t) can be modeled as follows:
where d is the propagation delay (assumed to be equal for all flows), and q(t) / C is the queueing delay at the switch. The evolution of \(\alpha _i(t)\) at source i can be modeled as follows:
The expected value of ECN marked packets is \(p_i(t-R(t))w_i(t-R(t))\) in the last RTT, and the fraction of marked packets can be estimated as \(F_i(t)=p_i(t-R(t))w_i(t-R(t))/w_i(t-R(t))=p_i(t-R(t)).\) Then we can obtain
The evolution of \(w_i(t)\) can be modeled as follows:
where \(p_{noECN}\) is the probability of no ECN marked packets in the last RTT and \(p_{hasECN}\) is the probability of receiving ECN marked packets in the last RTT. Combining (19), (20) and (21), we can obtain
This equation models the additive-increase and multiplicative-decrease behavior of DCTCP.
Rights and permissions
About this article
Cite this article
Zhou, P., Yu, H., Sun, G. et al. Flow-aware explicit congestion notification for datacenter networks. Cluster Comput 22, 1431–1446 (2019). https://doi.org/10.1007/s10586-019-02919-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10586-019-02919-z