Flow-aware explicit congestion notification for datacenter networks

Zhou, Pan; Yu, Hongfang; Sun, Gang; Luo, Long; Luo, Shouxi; Ye, Zilong

doi:10.1007/s10586-019-02919-z

Flow-aware explicit congestion notification for datacenter networks

Published: 26 February 2019

Volume 22, pages 1431–1446, (2019)
Cite this article

Cluster Computing Aims and scope Submit manuscript

Pan Zhou¹,
Hongfang Yu¹,
Gang Sun¹,
Long Luo¹,
Shouxi Luo² &
…
Zilong Ye³

432 Accesses
2 Citations
3 Altmetric
Explore all metrics

Abstract

Explicit congestion notification (ECN) has been widely adopted by recent proposals to build up high-throughput and low-latency datacenter network transport. In these ECN-based proposals, when the queue length of a switch exceeds a pre-defined threshold, the switch would mark all arriving packets with ECN to explicitly notify their senders to slow down the rates. Such a design enables the network to eliminate congestions quickly. However, it marks packets without considering the flow state, which may overkill flows, especially those only send a few packets, thus resulting in significant throughput loss and long flow completion times. In this paper, we propose a novel flow-aware ECN marking approach (FECN), which can improve the throughput and flow completion time by taking flow states into consideration. By selectively marking packets respecting to their flow rates, FECN enables the network to precisely slow down the high-speed flows to avoid congestions without killing low-speed short flows. Moreover, FECN does not require switches to maintain per-flow state, which yields low overhead and thus makes FECN to be easily implemented and deployed in commodity switches. Simulations show that FECN can shorten the flow completion time by up to 44.7% and reduce the throughput loss by up to 40.3%, compared with prior flow-agnostic ECN marking approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 19

Fig. 20

Regional Congestion Mitigation in Lossless Datacenter Networks

AdaFlow: Adaptive Control to Improve Availability of OpenFlow Forwarding for Burst Quantity of Flows

MPICC: Multi-Path INT-Based Congestion Control in Datacenter Networks

References

Sun, G., Zhu, G., Yu, H., et al.: Cost-efficient service function chain orchestration for low-latency applications in NFV networks. IEEE Syst. J. (2018)
Alizadeh, M., Yang, S., Sharif, M., Katti, S. et al.: pFabric: minimal near-optimal datacenter transport. In: Proc. SIGCOMM, pp. 435446 (2013)
Hoff, T.: Latency is everywhere and it costs you sales how to crush it. http://highscalability.com/blog/2009/7/25/latency- is-everywhere-and-it-costs-you-sales-how-to-crush-it.html (2009)
Sun, G., Li, Y., Vasilakos, A., Guizani, M.: Energy-efficient and traffic-aware service function chaining orchestration in multi-domain networks. Future Gener. Comput. Syst. 91, 347–360 (2019)
Article Google Scholar
Sun, G., Yu, H.: A new technique for efficient live migration of multiple virtual machines. Future Gener. Comput. Syst. 55, 74–86 (2016)
Article Google Scholar
Sun, G., Liao, D., Yu, H.: Live migration for multiple correlated virtual machines in cloud-based data centers. IEEE Trans. Serv. Comput. 11(2), 279–291 (2018)
Article Google Scholar
Munir, A., Qazi, I.: Minimizing flow completion times in data centers, INFOCOM. Proc. IEEE IEEE 2013, 2157–2165 (2013)
Google Scholar
Alizadeh, M., Greenberg, A., Maltz, D. et al.: Data center TCP (DCTCP). In: Proc. SIGCOMM, pp. 6374 (2010)
Luo, S., Hongfang, Y., Zhao, Y., Wang, S., Shui, Y., Li, L.: Towards practical and near-optimal coflow scheduling for data center networks. IEEE Trans. Parallel Distrib. Syst. 27(11), 3366–3380 (2016)
Article Google Scholar
Zhu, Y., Eran, H., Firestone, D., Guo, C. et al.: Congestion Control for Large-Scale RDMA Deployments. In: Proc. SIGCOMM (2015)
Wu, H., Ju, J., Lu, G., Guo, C., Xiong, Y., Zhang, Y.: Tuning ECN for data center networks. In: CoNEXT (2012)
Bai, W., Chen, L., Chen, K., Wu, H.: Enabling ECN in multi-service multi-queue data centers. In: Usenix Conference on Networked Systems Design and Implementation USENIX Association, pp. 537–549 (2016)
Floyd, S., Jacobson, V.: Random early detection gateways for congestion avoidance. IEEE/ACM Trans. Netw. 4, 397–413 (1993)
Article Google Scholar
Shan, D., Ren, F.: Improving ECN marking scheme with micro-burst traffic in data center networks. In: INFOCOM (2017)
The Network Simulator NS-3. https://www.nsnam.org/
Lin, D., Morris, R.: Dynamics of random early detection. In: Proc. SIGCOMM, pp. 127–137 (1997)
Zhao, Z., Jiang, Z., Lu, C. et al.: Towards coordinated congestion control and load balancing in datacenter networks. In: Global Communications Conference (GLOBECOM), IEEE (2013)
Alizadeh, M., Kabbani, A., et al.: Less is more: trading a little bandwidth for ultra-low latency in the data center. In: Usenix Conference on Networked Systems Design and Implementation pp. 19–19 (2012)
Rong, P., Prabhakar, B., Psounis, K.: CHOKe—a stateless active queue management scheme for approximating fair bandwidth allocation. In: INFOCOM (2000)
Lakshman, T., Wong, L.: SRED: stabilized RED. In: Proceedings of INFOCOM pp. 1346–1355 (1999)
Mittal, R., Radhika, V., et al.: TIMELY: RTT-based congestion control for the datacenter. In: ACM Conference on Special Interest Group on Data Communication ACM, pp. 537–550 (2015)
Lee, C., Park, C.: DX: latency-based congestion control for datacenters. IEEE/ACM Trans. Netw. 25(1), 335–348 (2017)
Article Google Scholar
Zhao, Z., Li, Q., et al.: Reduce completion time and guarantee throughput by transport with slight congestion. In: IEEE International Conference on Communications IEEE pp. 1–6 (2016)
Bai, W., Chen, K., et al.: Enabling ECN over Generic Packet Scheduling. In: International on Conference on Emerging NETWORKING Experiments and Technologies ACM, pp. 191–204 (2016)
Wilson, C., Ballani, H.: Better never than late: meeting deadlines in datacenter networks. Acm Sigcomm Comput. Commun. Rev. 41(4), 50–61 (2011)
Article Google Scholar
Hong, C., Caesar, M., Godfrey, P.: Finishing flows quickly with preemptive scheduling. Acm Sigcomm Comput. Commun. Rev. 42(4), 127–138 (2012)
Article Google Scholar
RFC 791. https://tools.ietf.org/html/rfc791
Nichols, K., Jacobson, V.: Controlling queue delay. Commun. ACM 55, 1–7 (2012)
Article Google Scholar
Yuanwei, L., et al.: Multi-Path Transport for RDMA in Datacenters. In: 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI) (2018)
Alizadeh, M., Yang, S., Sharif, M., Katti, S., McKeown, N., Prabhakar, B., Shenker, S.: pFabric: minimal near-optimal datacenter transport. In: ACM SIGCOMM (2013)
Perry, J., Ousterhout, A., Balakrishnan, H., Shah, D., Fugal, H.: Fastpass: A centralized zero-queue datacenter network. In: Proc. ACM SIGCOMM (2014)
Perry, J., Balakrishnan, H., Shah, D.: Flowtune: flowlet control for datacenter networks. In: NSDI (2017)
Vamanan, B., Hasan, J., Vijaykumar, T. N.: Deadline-aware datacenter TCP (D2TCP). In: Proc. ACM SIGCOMM (2012)
Gao, Chengxi, Lee, Victor C.S., Li, Keqin: DemePro: DEcouple packet marking from enqueuing for multiple services with PROactive congestion control. IEEE Trans. Cloud Comput. 1, 1–1 (2017)
Article Google Scholar
David, Z., et al.: DeTail: reducing the flow completion time tail in datacenter networks. In: Proceedings of the ACM SIGCOMM 2012 conference on Applications, technologies, architectures, and protocols for computer communication. ACM (2012)
Sun, G., Liao, D., Zhao, D., Sun, Z., Chang, V.: Towards provisioning hybrid virtual networks in federated cloud data centers. Future Gener. Comput. Syst. 87, 457–469 (2018)
Article Google Scholar
Alizadeh, M., Kabbani, A., Atikoglu, B., Prabhakar, B.: Stability analysis of QCN: the averaging principle. In: SIGMETRICS (2011)
Alizadeh, M., Javanmard, A., Prabhakar, B.: Analysis of DCTCP: Stability, convergence and fairness. In: SIGMETRICS (2011)
Cisco White Paper: Intelligent Buffer Management on Cisco Nexus 9000 Series Switches. https://www.cisco.com/c/en/us/products/collateral/switches/nexus-9000-series-switches/white-paper-c11-738488.html
Lee, C., Nakagawa, Y., Hyoudou, K., Kobayashi, S., Shiraki, O., Shimizu, T.: Control, flow-aware congestion, to improve throughput under TCP incast in datacenter networks. In: IEEE 39th Annual Computer Software and Applications Conference. Taichung, pp. 155–162 (2015)
Sivaraman A., et al.: Programmable packet scheduling at line rate. In: Proc. ACMSIGCOMM Conf., pp. 4457 (2016)
Sharma, N. et al.: Approximating fair queueing on reconfigurable switches. In: USENIX Symposium on Networked Systems Design and Implementation (2018)

Download references

Acknowledgements

This research was partially supported by the National Natural Science Foundation of China (61571098), Fundamental Research Funds for the Central Universities (ZYGX2016J217), the 111 Project (B14039), and Fundamental Research Funds for the Central Universities (2682019CX61).

Author information

Authors and Affiliations

University of Electronic Science and Technology of China, Chengdu, People’s Republic of China
Pan Zhou, Hongfang Yu, Gang Sun & Long Luo
Southwest Jiaotong University, Chengdu, People’s Republic of China
Shouxi Luo
California State University, Los Angeles, CA, 90032, USA
Zilong Ye

Authors

Pan Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Hongfang Yu
View author publications
You can also search for this author in PubMed Google Scholar
Gang Sun
View author publications
You can also search for this author in PubMed Google Scholar
Long Luo
View author publications
You can also search for this author in PubMed Google Scholar
Shouxi Luo
View author publications
You can also search for this author in PubMed Google Scholar
Zilong Ye
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hongfang Yu.

Appendix

Before introducing our fluid model, we first briefly review FECN at switch side and the DCTCP [8] algorithm at the source side. Then, like [37, 38], we model N greedy flows at a single bottleneck with capacity C. We assume that all flows have the same RTTs.

Switch side In FECN, the switch makes ECN marking decisions based on flows’ sending rate. The sending rate of each flow is calculated at the source side and is carried in the headers of sending packets. The switch read the sending rate carried in the packet headers and estimates the average sending rate of all flows traversing the same outport, which is symbolized as aveRate. For simplicity, we assume that aveRate can accurately estimate the average sending rate which is not true. However, this approximation does not change the fundamental nature of marking packets from flows with relatively high sending rate, and we are able to capture FECN dynamics. Then aveRate can be modeled as follows:

$$\begin{aligned} aveRate(t) = \frac{\sum _{i=1}^N r_i(t)}{N} \end{aligned}$$

(8)

where N is the number of flows traversing the same outport at the switch, $r_i(t)$ is the sending rate of the ith flow at time t. In our realization, we use congestion window w(t) as the calculated value of sending rate r(t) to reduce the overhead of source side. So the model of aveRate(t) can be replaced as follows:

$$\begin{aligned} aveRate(t) = \frac{\sum _{i=1}^N w_i(t)}{N} \end{aligned}$$

(9)

The marking probability of packets from ith flow can be modeled as follows:

$$p_{i} \left( t \right) = \left\{ {\begin{array}{ll} 0 & {q(t) < Kmin} \\ 0 & {Kmin \le q(t) < Kmax,w_{i} (t) < aveRate(t)} \\ {f(q(t))} & {Kmin \le q(t) < Kmax,w_{i} (t) \ge aveRate(t)} \\ 1 & {q(t) \ge Kmax} \\ \end{array} } \right.$$

(10)

$$\begin{aligned} f(q(t)) = Pmin+\frac{(q(t)-Kmin)(Pmax-Pmin)}{Kmax-Kmin} \end{aligned}$$

(11)

where q(t) is the instantaneous queue length at time t.

Source side The DCTCP source maintains an estimation of the fraction of its ECN marked packets. This estimate, $\alpha$, is updated once for per RTT (or per window of data) as follows:

$$\begin{aligned} \alpha = (1-g)\alpha +gF \end{aligned}$$

(12)

where F is the fraction of packets that were marked in the most recent RTT, and g is a fixed parameter. DCTCP uses $\alpha$ to reduce its window size in response to marked ACKs as follows:

$$\begin{aligned} w = (1-\frac{\alpha }{2})w \end{aligned}$$

(13)

Note that, DCTCP reduce its window size at most per RTT. For each ACK, DCTCP increases its window size like TCP as follows:

$$\begin{aligned} w = w + \frac{1}{w} \end{aligned}$$

(14)

State equations We choose w and $\alpha$ at each source and q(t) at the switch as the state variables of the system. The evolution of state variables can discribe the dynamics of the system. The evoluation of q(t) can be modeled as follows:

$$\begin{aligned} \frac{dq(t)}{dt} = \frac{\sum _{i=1}^N w_i(t)}{R(t)} - C \end{aligned}$$

(15)

where R(t) is the RTT at time t, $w_i(t)$ is the window size of ith flow, and $\frac{\sum _{i=1}^N w_i(t)}{R(t)}$ is the average total input rate. R(t) can be modeled as follows:

$$\begin{aligned} R(t) = d+\frac{q(t)}{C} \end{aligned}$$

(16)

where d is the propagation delay (assumed to be equal for all flows), and q(t) / C is the queueing delay at the switch. The evolution of $\alpha _i(t)$ at source i can be modeled as follows:

$$\begin{aligned} \begin{aligned} \frac{d\alpha _i(t)}{dt}&= \frac{\alpha _i(t)-\alpha _i(t-R(t))}{R(t)}\\&= \frac{(1-g)\alpha _i(t-R(t))+gF_i(t)-\alpha _i(t-R(t))}{R(t)}\\&= \frac{g}{R(t)}\bigg (F_i(t)-\alpha _i(t-R(t))\bigg ) \end{aligned} \end{aligned}.$$

(17)

The expected value of ECN marked packets is $p_i(t-R(t))w_i(t-R(t))$ in the last RTT, and the fraction of marked packets can be estimated as $F_i(t)=p_i(t-R(t))w_i(t-R(t))/w_i(t-R(t))=p_i(t-R(t)).$ Then we can obtain

$$\begin{aligned} \frac{d\alpha _i(t)}{dt} = \frac{g}{R(t)}\bigg (p_i(t-R(t))-\alpha _i(t-R(t))\bigg ) \end{aligned}$$

(18)

The evolution of $w_i(t)$ can be modeled as follows:

$$\begin{aligned} \frac{dw_i(t)}{dt}= & {} \frac{1}{R(t)}p_{noECN}+\bigg (\frac{1}{R(t)}-\frac{w_i(t) \alpha (t)}{2R(t)}\bigg )p_{hasECN}\end{aligned}$$

(19)

$$\begin{aligned} p_{noECN}= & {} (1-p_i(t-R(t)))^{w_i(t-R(t))} \end{aligned}$$

(20)

$$\begin{aligned} p_{hasECN}= & {} 1-(1-p_i(t-R(t)))^{w_i(t-R(t))} \end{aligned}$$

(21)

where $p_{noECN}$ is the probability of no ECN marked packets in the last RTT and $p_{hasECN}$ is the probability of receiving ECN marked packets in the last RTT. Combining (19), (20) and (21), we can obtain

$$\begin{aligned} \frac{dw_i(t)}{dt} = \frac{1}{R(t)} - \frac{\alpha _i(t)w_i(t)}{2R(t)}\bigg (1-(1-p_i(t-R(t)))^{w_i(t-R(t))}\bigg ) \end{aligned}$$

(22)

This equation models the additive-increase and multiplicative-decrease behavior of DCTCP.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhou, P., Yu, H., Sun, G. et al. Flow-aware explicit congestion notification for datacenter networks. Cluster Comput 22, 1431–1446 (2019). https://doi.org/10.1007/s10586-019-02919-z

Download citation

Received: 04 December 2018
Revised: 30 January 2019
Accepted: 20 February 2019
Published: 26 February 2019
Issue Date: December 2019
DOI: https://doi.org/10.1007/s10586-019-02919-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Flow-aware explicit congestion notification for datacenter networks

Abstract

Access this article

Similar content being viewed by others

Regional Congestion Mitigation in Lossless Datacenter Networks

AdaFlow: Adaptive Control to Improve Availability of OpenFlow Forwarding for Burst Quantity of Flows

MPICC: Multi-Path INT-Based Congestion Control in Datacenter Networks

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Appendix

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Flow-aware explicit congestion notification for datacenter networks

Abstract

Access this article

Similar content being viewed by others

Regional Congestion Mitigation in Lossless Datacenter Networks

AdaFlow: Adaptive Control to Improve Availability of OpenFlow Forwarding for Burst Quantity of Flows

MPICC: Multi-Path INT-Based Congestion Control in Datacenter Networks

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation