Skip to main content
Log in

Flow-aware explicit congestion notification for datacenter networks

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

Explicit congestion notification (ECN) has been widely adopted by recent proposals to build up high-throughput and low-latency datacenter network transport. In these ECN-based proposals, when the queue length of a switch exceeds a pre-defined threshold, the switch would mark all arriving packets with ECN to explicitly notify their senders to slow down the rates. Such a design enables the network to eliminate congestions quickly. However, it marks packets without considering the flow state, which may overkill flows, especially those only send a few packets, thus resulting in significant throughput loss and long flow completion times. In this paper, we propose a novel flow-aware ECN marking approach (FECN), which can improve the throughput and flow completion time by taking flow states into consideration. By selectively marking packets respecting to their flow rates, FECN enables the network to precisely slow down the high-speed flows to avoid congestions without killing low-speed short flows. Moreover, FECN does not require switches to maintain per-flow state, which yields low overhead and thus makes FECN to be easily implemented and deployed in commodity switches. Simulations show that FECN can shorten the flow completion time by up to 44.7% and reduce the throughput loss by up to 40.3%, compared with prior flow-agnostic ECN marking approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20

Similar content being viewed by others

References

  1. Sun, G., Zhu, G., Yu, H., et al.: Cost-efficient service function chain orchestration for low-latency applications in NFV networks. IEEE Syst. J. (2018)

  2. Alizadeh, M., Yang, S., Sharif, M., Katti, S. et al.: pFabric: minimal near-optimal datacenter transport. In: Proc. SIGCOMM, pp. 435446 (2013)

  3. Hoff, T.: Latency is everywhere and it costs you sales how to crush it. http://highscalability.com/blog/2009/7/25/latency- is-everywhere-and-it-costs-you-sales-how-to-crush-it.html (2009)

  4. Sun, G., Li, Y., Vasilakos, A., Guizani, M.: Energy-efficient and traffic-aware service function chaining orchestration in multi-domain networks. Future Gener. Comput. Syst. 91, 347–360 (2019)

    Article  Google Scholar 

  5. Sun, G., Yu, H.: A new technique for efficient live migration of multiple virtual machines. Future Gener. Comput. Syst. 55, 74–86 (2016)

    Article  Google Scholar 

  6. Sun, G., Liao, D., Yu, H.: Live migration for multiple correlated virtual machines in cloud-based data centers. IEEE Trans. Serv. Comput. 11(2), 279–291 (2018)

    Article  Google Scholar 

  7. Munir, A., Qazi, I.: Minimizing flow completion times in data centers, INFOCOM. Proc. IEEE IEEE 2013, 2157–2165 (2013)

    Google Scholar 

  8. Alizadeh, M., Greenberg, A., Maltz, D. et al.: Data center TCP (DCTCP). In: Proc. SIGCOMM, pp. 6374 (2010)

  9. Luo, S., Hongfang, Y., Zhao, Y., Wang, S., Shui, Y., Li, L.: Towards practical and near-optimal coflow scheduling for data center networks. IEEE Trans. Parallel Distrib. Syst. 27(11), 3366–3380 (2016)

    Article  Google Scholar 

  10. Zhu, Y., Eran, H., Firestone, D., Guo, C. et al.: Congestion Control for Large-Scale RDMA Deployments. In: Proc. SIGCOMM (2015)

  11. Wu, H., Ju, J., Lu, G., Guo, C., Xiong, Y., Zhang, Y.: Tuning ECN for data center networks. In: CoNEXT (2012)

  12. Bai, W., Chen, L., Chen, K., Wu, H.: Enabling ECN in multi-service multi-queue data centers. In: Usenix Conference on Networked Systems Design and Implementation USENIX Association, pp. 537–549 (2016)

  13. Floyd, S., Jacobson, V.: Random early detection gateways for congestion avoidance. IEEE/ACM Trans. Netw. 4, 397–413 (1993)

    Article  Google Scholar 

  14. Shan, D., Ren, F.: Improving ECN marking scheme with micro-burst traffic in data center networks. In: INFOCOM (2017)

  15. The Network Simulator NS-3. https://www.nsnam.org/

  16. Lin, D., Morris, R.: Dynamics of random early detection. In: Proc. SIGCOMM, pp. 127–137 (1997)

  17. Zhao, Z., Jiang, Z., Lu, C. et al.: Towards coordinated congestion control and load balancing in datacenter networks. In: Global Communications Conference (GLOBECOM), IEEE (2013)

  18. Alizadeh, M., Kabbani, A., et al.: Less is more: trading a little bandwidth for ultra-low latency in the data center. In: Usenix Conference on Networked Systems Design and Implementation pp. 19–19 (2012)

  19. Rong, P., Prabhakar, B., Psounis, K.: CHOKe—a stateless active queue management scheme for approximating fair bandwidth allocation. In: INFOCOM (2000)

  20. Lakshman, T., Wong, L.: SRED: stabilized RED. In: Proceedings of INFOCOM pp. 1346–1355 (1999)

  21. Mittal, R., Radhika, V., et al.: TIMELY: RTT-based congestion control for the datacenter. In: ACM Conference on Special Interest Group on Data Communication ACM, pp. 537–550 (2015)

  22. Lee, C., Park, C.: DX: latency-based congestion control for datacenters. IEEE/ACM Trans. Netw. 25(1), 335–348 (2017)

    Article  Google Scholar 

  23. Zhao, Z., Li, Q., et al.: Reduce completion time and guarantee throughput by transport with slight congestion. In: IEEE International Conference on Communications IEEE pp. 1–6 (2016)

  24. Bai, W., Chen, K., et al.: Enabling ECN over Generic Packet Scheduling. In: International on Conference on Emerging NETWORKING Experiments and Technologies ACM, pp. 191–204 (2016)

  25. Wilson, C., Ballani, H.: Better never than late: meeting deadlines in datacenter networks. Acm Sigcomm Comput. Commun. Rev. 41(4), 50–61 (2011)

    Article  Google Scholar 

  26. Hong, C., Caesar, M., Godfrey, P.: Finishing flows quickly with preemptive scheduling. Acm Sigcomm Comput. Commun. Rev. 42(4), 127–138 (2012)

    Article  Google Scholar 

  27. RFC 791. https://tools.ietf.org/html/rfc791

  28. Nichols, K., Jacobson, V.: Controlling queue delay. Commun. ACM 55, 1–7 (2012)

    Article  Google Scholar 

  29. Yuanwei, L., et al.: Multi-Path Transport for RDMA in Datacenters. In: 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI) (2018)

  30. Alizadeh, M., Yang, S., Sharif, M., Katti, S., McKeown, N., Prabhakar, B., Shenker, S.: pFabric: minimal near-optimal datacenter transport. In: ACM SIGCOMM (2013)

  31. Perry, J., Ousterhout, A., Balakrishnan, H., Shah, D., Fugal, H.: Fastpass: A centralized zero-queue datacenter network. In: Proc. ACM SIGCOMM (2014)

  32. Perry, J., Balakrishnan, H., Shah, D.: Flowtune: flowlet control for datacenter networks. In: NSDI (2017)

  33. Vamanan, B., Hasan, J., Vijaykumar, T. N.: Deadline-aware datacenter TCP (D2TCP). In: Proc. ACM SIGCOMM (2012)

  34. Gao, Chengxi, Lee, Victor C.S., Li, Keqin: DemePro: DEcouple packet marking from enqueuing for multiple services with PROactive congestion control. IEEE Trans. Cloud Comput. 1, 1–1 (2017)

    Article  Google Scholar 

  35. David, Z., et al.: DeTail: reducing the flow completion time tail in datacenter networks. In: Proceedings of the ACM SIGCOMM 2012 conference on Applications, technologies, architectures, and protocols for computer communication. ACM (2012)

  36. Sun, G., Liao, D., Zhao, D., Sun, Z., Chang, V.: Towards provisioning hybrid virtual networks in federated cloud data centers. Future Gener. Comput. Syst. 87, 457–469 (2018)

    Article  Google Scholar 

  37. Alizadeh, M., Kabbani, A., Atikoglu, B., Prabhakar, B.: Stability analysis of QCN: the averaging principle. In: SIGMETRICS (2011)

  38. Alizadeh, M., Javanmard, A., Prabhakar, B.: Analysis of DCTCP: Stability, convergence and fairness. In: SIGMETRICS (2011)

  39. Cisco White Paper: Intelligent Buffer Management on Cisco Nexus 9000 Series Switches. https://www.cisco.com/c/en/us/products/collateral/switches/nexus-9000-series-switches/white-paper-c11-738488.html

  40. Lee, C., Nakagawa, Y., Hyoudou, K., Kobayashi, S., Shiraki, O., Shimizu, T.: Control, flow-aware congestion, to improve throughput under TCP incast in datacenter networks. In: IEEE 39th Annual Computer Software and Applications Conference. Taichung, pp. 155–162 (2015)

  41. Sivaraman A., et al.: Programmable packet scheduling at line rate. In: Proc. ACMSIGCOMM Conf., pp. 4457 (2016)

  42. Sharma, N. et al.: Approximating fair queueing on reconfigurable switches. In: USENIX Symposium on Networked Systems Design and Implementation (2018)

Download references

Acknowledgements

This research was partially supported by the National Natural Science Foundation of China (61571098), Fundamental Research Funds for the Central Universities (ZYGX2016J217), the 111 Project (B14039), and Fundamental Research Funds for the Central Universities (2682019CX61).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hongfang Yu.

Appendix

Appendix

Before introducing our fluid model, we first briefly review FECN at switch side and the DCTCP [8] algorithm at the source side. Then, like [37, 38], we model N greedy flows at a single bottleneck with capacity C. We assume that all flows have the same RTTs.

Switch side In FECN, the switch makes ECN marking decisions based on flows’ sending rate. The sending rate of each flow is calculated at the source side and is carried in the headers of sending packets. The switch read the sending rate carried in the packet headers and estimates the average sending rate of all flows traversing the same outport, which is symbolized as aveRate. For simplicity, we assume that aveRate can accurately estimate the average sending rate which is not true. However, this approximation does not change the fundamental nature of marking packets from flows with relatively high sending rate, and we are able to capture FECN dynamics. Then aveRate can be modeled as follows:

$$\begin{aligned} aveRate(t) = \frac{\sum _{i=1}^N r_i(t)}{N} \end{aligned}$$
(8)

where N is the number of flows traversing the same outport at the switch, \(r_i(t)\) is the sending rate of the ith flow at time t. In our realization, we use congestion window w(t) as the calculated value of sending rate r(t) to reduce the overhead of source side. So the model of aveRate(t) can be replaced as follows:

$$\begin{aligned} aveRate(t) = \frac{\sum _{i=1}^N w_i(t)}{N} \end{aligned}$$
(9)

The marking probability of packets from ith flow can be modeled as follows:

$$p_{i} \left( t \right) = \left\{ {\begin{array}{ll} 0 & {q(t) < Kmin} \\ 0 & {Kmin \le q(t) < Kmax,w_{i} (t) < aveRate(t)} \\ {f(q(t))} & {Kmin \le q(t) < Kmax,w_{i} (t) \ge aveRate(t)} \\ 1 & {q(t) \ge Kmax} \\ \end{array} } \right.$$
(10)
$$\begin{aligned} f(q(t)) = Pmin+\frac{(q(t)-Kmin)(Pmax-Pmin)}{Kmax-Kmin} \end{aligned}$$
(11)

where q(t) is the instantaneous queue length at time t.

Source side The DCTCP source maintains an estimation of the fraction of its ECN marked packets. This estimate, \(\alpha\), is updated once for per RTT (or per window of data) as follows:

$$\begin{aligned} \alpha = (1-g)\alpha +gF \end{aligned}$$
(12)

where F is the fraction of packets that were marked in the most recent RTT, and g is a fixed parameter. DCTCP uses \(\alpha\) to reduce its window size in response to marked ACKs as follows:

$$\begin{aligned} w = (1-\frac{\alpha }{2})w \end{aligned}$$
(13)

Note that, DCTCP reduce its window size at most per RTT. For each ACK, DCTCP increases its window size like TCP as follows:

$$\begin{aligned} w = w + \frac{1}{w} \end{aligned}$$
(14)

State equations We choose w and \(\alpha\) at each source and q(t) at the switch as the state variables of the system. The evolution of state variables can discribe the dynamics of the system. The evoluation of q(t) can be modeled as follows:

$$\begin{aligned} \frac{dq(t)}{dt} = \frac{\sum _{i=1}^N w_i(t)}{R(t)} - C \end{aligned}$$
(15)

where R(t) is the RTT at time t, \(w_i(t)\) is the window size of ith flow, and \(\frac{\sum _{i=1}^N w_i(t)}{R(t)}\) is the average total input rate. R(t) can be modeled as follows:

$$\begin{aligned} R(t) = d+\frac{q(t)}{C} \end{aligned}$$
(16)

where d is the propagation delay (assumed to be equal for all flows), and q(t) / C is the queueing delay at the switch. The evolution of \(\alpha _i(t)\) at source i can be modeled as follows:

$$\begin{aligned} \begin{aligned} \frac{d\alpha _i(t)}{dt}&= \frac{\alpha _i(t)-\alpha _i(t-R(t))}{R(t)}\\&= \frac{(1-g)\alpha _i(t-R(t))+gF_i(t)-\alpha _i(t-R(t))}{R(t)}\\&= \frac{g}{R(t)}\bigg (F_i(t)-\alpha _i(t-R(t))\bigg ) \end{aligned} \end{aligned}.$$
(17)

The expected value of ECN marked packets is \(p_i(t-R(t))w_i(t-R(t))\) in the last RTT, and the fraction of marked packets can be estimated as \(F_i(t)=p_i(t-R(t))w_i(t-R(t))/w_i(t-R(t))=p_i(t-R(t)).\) Then we can obtain

$$\begin{aligned} \frac{d\alpha _i(t)}{dt} = \frac{g}{R(t)}\bigg (p_i(t-R(t))-\alpha _i(t-R(t))\bigg ) \end{aligned}$$
(18)

The evolution of \(w_i(t)\) can be modeled as follows:

$$\begin{aligned} \frac{dw_i(t)}{dt}= & {} \frac{1}{R(t)}p_{noECN}+\bigg (\frac{1}{R(t)}-\frac{w_i(t) \alpha (t)}{2R(t)}\bigg )p_{hasECN}\end{aligned}$$
(19)
$$\begin{aligned} p_{noECN}= & {} (1-p_i(t-R(t)))^{w_i(t-R(t))} \end{aligned}$$
(20)
$$\begin{aligned} p_{hasECN}= & {} 1-(1-p_i(t-R(t)))^{w_i(t-R(t))} \end{aligned}$$
(21)

where \(p_{noECN}\) is the probability of no ECN marked packets in the last RTT and \(p_{hasECN}\) is the probability of receiving ECN marked packets in the last RTT. Combining (19), (20) and (21), we can obtain

$$\begin{aligned} \frac{dw_i(t)}{dt} = \frac{1}{R(t)} - \frac{\alpha _i(t)w_i(t)}{2R(t)}\bigg (1-(1-p_i(t-R(t)))^{w_i(t-R(t))}\bigg ) \end{aligned}$$
(22)

This equation models the additive-increase and multiplicative-decrease behavior of DCTCP.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhou, P., Yu, H., Sun, G. et al. Flow-aware explicit congestion notification for datacenter networks. Cluster Comput 22, 1431–1446 (2019). https://doi.org/10.1007/s10586-019-02919-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10586-019-02919-z

Keywords

Navigation