Abstract
Remote direct memory access (RDMA) has become one of the state-of-the-art high-performance network technologies in datacenters. The reliable transport of RDMA is designed based on a lossless underlying network and cannot endure a high packet loss rate. However, except for switch buffer overflow, there is another kind of packet loss in the RDMA network, i.e., packet corruption, which has not been discussed in depth. The packet corruption incurs long application tail latency by causing timeout retransmissions. The challenges to solving packet corruption in the RDMA network include: 1) packet corruption is inevitable with any remedial mechanisms and 2) RDMA hardware is not programmable. This paper proposes some designs which can guarantee the expected tail latency of applications with the existence of packet corruption. The key idea is controlling the occurring probabilities of timeout events caused by packet corruption through transforming timeout retransmissions into out-of-order retransmissions. We build a probabilistic model to estimate the occurrence probabilities and real effects of the corruption patterns. We implement these two mechanisms with the help of programmable switches and the zero-byte message RDMA feature. We build an ns-3 simulation and implement optimization mechanisms on our testbed. The simulation and testbed experiments show that the optimizations can decrease the flow completion time by several orders of magnitudes with less than 3% bandwidth cost at different packet corruption rates.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Guo C, Wu H, Deng Z, Soni G, Ye J, Padhye J, Lipshteyn M. RDMA over commodity ethernet at scale. In Proc. the 2016 ACM SIGCOMM Conference, August 2016, pp.202-215. https://doi.org/10.1145/2934872.2934908.
Gao Y, Li Q, Tang L et al. When cloud storage meets RDMA. In Proc. the 18th USENIX Symposium on Net-worked Systems Design and Implementation, April 2021, pp.519-533.
Kalia A, Kaminsky M, Andersen D G. Design guidelines for high performance RDMA systems. In Proc. the 2016 USENIX Conference on USENIX Annual Technical Conference, June 2016, pp.437-450.
Kalia A, Kaminsky M, Andersen D G. Using RDMA efficiently for key-value services. In Proc. the 2014 ACM Conference on SIGCOMM, August 2014, pp.295-306. https://doi.org/10.1145/2619239.2626299.
Mitchell C, Geng Y, Li J. Using one-sided RDMA reads to build a fast, CPU-efficient key-value store. In Proc. the 2013 USENIX Annual Technical Conference, June 2013, pp.103-114.
Dragojević A, Narayanan D, Hodson O, Castro M. FaRM: Fast remote memory. In Proc. the 11th USENIX Conference on Networked Systems Design and Implementation, April 2014, pp.401-414.
Wei X, Shi J, Chen Y, Chen R, Chen H. Fast in-memory transaction processing using RDMA and HTM. In Proc. the 25th Symposium on Operating Systems Principles, October 2015, pp.87-104. https://doi.org/10.1145/2815400.2815419.
Chen Y, Wei X, Shi J, Chen R, Chen H. Fast and general distributed transactions using RDMA and HTM. In Proc. the 11th European Conference on Computer Systems, April 2016, Article No. 26. https://doi.org/10.1145/2901318.2901349.
Yang J, Izraelevitz J, Swanson S. Orion: A distributed file system for non-volatile main memories and RDMA-capable networks. In Proc. the 17th USENIX Conference on File and Storage Technologies, February 2019, pp.221-234.
Kim J, Jang I, Reda W, Im J, Canini M, Kostić D, Kwon Y, Peter S, Witchel E. LineFS: Efficient smartNIC offload of a distributed file system with pipeline parallelism. In Proc. the 28th ACM SIGOPS Symposium on Operating Systems Principles, October 2021, pp.756-771. https://doi.org/10.1145/3477132.3483565.
Weil S A, Brandt S A, Miller E L, Long D D E, Maltzahn C. Ceph: A scalable, high-performance distributed file system. In Proc. the 7th Symposium on Operating Systems Design and Implementation, November 2006, pp.307-320.
Aghayev A, Weil S, Kuchnik M, Nelson M, Ganger G R, Amvrosiadis G. File systems unfit as distributed storage backends: Lessons from 10 years of Ceph evolution. In Proc. the 27th ACM Symposium on Operating Systems Principles, October 2019, pp.353-369. https://doi.org/10.1145/3341301.3359656.
Kalia A, Kaminsky M, Andersen D G. FaSST: Fast, scalable and simple distributed transactions with two-sided (RDMA) datagram RPCs. In Proc. the 12th USENIX Symposium on Operating Systems Design and Implementation, November 2016, pp.185-201.
Su M, Zhang M, Chen K, Guo Z, Wu Y. RFP: When RPC is faster than server-bypass with RDMA. In Proc. the 12th European Conference on Computer Systems, April 2017, pp.1-15. https://doi.org/10.1145/3064176.3064189.
Kalia A, Kaminsky M, Andersen D G. Datacenter RPCs can be general and fast. In Proc. the 16th USENIX Conference on Networked Systems Design and Implementation, February 2019, pp.1-16.
Zhu Y, Eran H, Firestone D, Guo C, Lipshteyn M, Liron Y, Padhye J, Raindel S, Yahia M H, Zhang M. Congestion control for large-scale RDMA deployments. In Proc. the 2015 ACM Conference on Special Interest Group on Data Communication, August 2015, pp.523-536. https://doi.org/10.1145/2785956.2787484.
Zhuo D, Ghobadi M, Mahajan R, Förster K T, Krishnamurthy A, Anderson T. Understanding and mitigating packet corruption in data center networks. In Proc. the Conference of the ACM Special Interest Group on Data Communication, August 2017, pp.362-375. https://doi.org/10.1145/3098822.3098849.
Zhuo D, Ghobadi M, Mahajan R, Phanishayee A, Zou X K, Guan H, Krishnamurthy A, Anderson T. RAIL: A case for redundant arrays of inexpensive links in data center networks. In Proc. the 14th USENIX Symposium on Networked Systems Design and Implementation, March 2017, pp.561-576.
Mittal R, Shpiner A, Panda A, Zahavi E, Krishnamurthy A, Ratnasamy S, Shenker S. Revisiting network support for RDMA. In Proc. the 2018 Conference of the ACM Special Interest Group on Data Communication, August 2018, pp.313-326. https://doi.org/10.1145/3230543.3230557.
Wang Y, Liu K, Tian C, Bai B, Zhang G. Error recovery of RDMA packets in data center networks. In Proc. the 28th International Conference on Computer Communication and Networks, July 29-August 1, 2019. https://doi.org/10.1109/ICCCN.2019.8846946.
Langley A, Riddoch A, Wilk A et al. The QUIC transport protocol: Design and Internet-scale deployment. In Proc. the Conference of the ACM Special Interest Group on Data Communication, August 2017, pp.183-196. https://doi.org/10.1145/3098822.3098842.
Bosshart P, Daly D, Gibb G et al. P4: Programming protocol-independent packet processors. ACM SIG-COMM Comput. Commun. Rev., 2014, 44(3): 87-95. https://doi.org/10.1145/2656877.2656890.
Marty M, Kruijf M, Adriaens J et al. Snap: A microkernel approach to host networking. In Proc. the 27th ACM Symposium on Operating Systems Principles, October 2019, pp.399-413. https://doi.org/10.1145/3341301.3359657.
Singhvi A, Akella A, Gibson D et al. 1RMA: Re-envisioning remote memory access for multi-tenant datacenters. In Proc. the 2020 Annual Conference of the ACM Special Interest Group on Data Communication on the Applications, Technologies, Architectures, and Protocols for Computer Communication, August 2020, pp.708-721. https://doi.org/10.1145/3387514.3405897.
Monga S K, Kashyap S, Min C. Birds of a feather flock together: Scaling RDMA RPCs with Flock. In Proc. the 28th ACM SIGOPS Symposium on Operating Systems Principles, October 2021, pp.212-227. https://doi.org/10.1145/3477132.3483576.
Handley M, Raiciu C, Agache A, Voinescu A, Moore A W, Antichi G, Wójcik M. Re-architecting datacenter networks and stacks for low latency and high performance. In Proc. the Conference of the ACM Special Interest Group on Data Communication, August 2017, pp.29-42. https://doi.org/10.1145/3098822.3098825.
Li J, Wang Q, Lee P P C, Shi C. An in-depth analysis of cloud block storage workloads in large-scale production. In Proc. the 2020 IEEE International Symposium on Workload Characterization, October 2020, pp.37-47. https://doi.org/10.1109/IISWC50251.2020.00013.
Geng J, Yan J, Ren Y, Zhang Y. Design and implementation of network monitoring and scheduling architecture based on P4. In Proc. the 2nd International Conference on Computer Science and Application Engineering, October 2018, Article No. 182. https://doi.org/10.1145/3207677.3278059.
Ye J L, Chen C, Huang Chu Y. A weighted ECMP load balancing scheme for data centers using P4 switches. In Proc. the 7th IEEE International Conference on Cloud Networking, October 2018. https://doi.org/10.1109/CloudNet.2018.8549549.
Sultana N, Sonchack J, Giesen H, Pedisich I, Han Z, Shyamkumar N, Burad S, DeHon A, Loo B T. Flightplan: Dataplane disaggregation and placement for P4 programs. In Proc. the 18th USENIX Symposium on Networked Systems Design and Implementation, April 2021, pp.571-592.
Singh A, Ong J, Agarwal A et al. Jupiter rising: A decade of Clos topologies and centralized control in Google's datacenter network. In Proc. the 2015 ACM Conference on Special Interest Group on Data Communication, August 2015, pp.183-197. https://doi.org/10.1145/2829988.2787508.
Zhang Q, Ng K K W, Kazer C, Yan S, Sedoc J, Liu V. MimicNet: Fast performance estimates for data center networks with machine learning. In Proc. the 2021 ACM SIGCOMM Conference, August 2021, pp.287-304. https://doi.org/10.1145/3452296.3472926.
Author information
Authors and Affiliations
Corresponding authors
Supplementary Information
ESM 1
(PDF 157 kb)
Rights and permissions
About this article
Cite this article
Gao, YX., Tian, C., Chen, W. et al. Analyzing and Optimizing Packet Corruption in RDMA Network. J. Comput. Sci. Technol. 37, 743–762 (2022). https://doi.org/10.1007/s11390-022-2123-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11390-022-2123-8