ABSTRACT
Modern cloud computing infrastructures are steadily pushing the performance of their network stacks. At the hardware-level, already some cloud providers have upgraded parts of their network to 10GbE. At the same time there is a continuous effort within the cloud community to improve the network performance inside the virtualization layers. The low-latency/high-throughput properties of those network interfaces are not only opening the cloud for HPC applications, they will also be well received by traditional large scale web applications or data processing frameworks. However, as commodity networks get faster the burden on the end hosts increases. Inefficient memory copying in socket-based networking takes up a significant fraction of the end-to-end latency and also creates serious CPU load on the host machine. Years ago, the supercomputing community has developed RDMA network stacks like Infiniband that offer both low end-to-end latency as well as a low CPU footprint. While adopting RDMA to the commodity cloud environment is difficult (costly, requires special hardware) we argue in this paper that most of the benefits of RDMA can in fact be provided in software. To demonstrate our findings we have implemented and evaluated a prototype of a software-based RDMA stack. Our results, when compared to a socket/TCP approach (with TCP receive copy offload) show significant reduction in end-to-end latencies for messages greater than modest 64kB and reduction of CPU load (w/o TCP receive copy offload) for better efficiency while saturating the 10Gbit/s link.
- Intel Corporation. Intel QuickData Technology Software Guide for Linux* at http://www.intel.com/technology/quickdata/whitepapers/sw_guide_linux.pdf, 2008.Google Scholar
- D. G. Andersen et al. FAWN: a fast array of wimpy nodes. In Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles, SOSP '09, pages 1--14, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
- P. Balaji. Sockets vs RDMA Interface over 10-Gigabit Networks: An In-depth analysis of the Memory Traffic Bottleneck. In In RAIT workshop, 2004.Google Scholar
- D. Beaver et al. Finding a needle in Haystack: Facebook's photo storage. In Proceedings of the 9th USENIX conference on Operating systems design and implementation, OSDI'10, pages 1--8, Berkeley, CA, USA, 2010. USENIX Association. Google ScholarDigital Library
- G. Buzzard et al. An implementation of the Hamlyn sender-managed interface architecture. In Proceedings of the second USENIX symposium on Operating systems design and implementation, OSDI '96, pages 245--259, New York, NY, USA, 1996. ACM. Google ScholarDigital Library
- H.-k. J. Chu. Zero-copy TCP in Solaris. In Proceedings of the 1996 annual conference on USENIX Annual Technical Conference, pages 21--21, Berkeley, CA, USA, 1996. USENIX Association. Google ScholarDigital Library
- D. D. Clark et al. Architectural considerations for a new generation of protocols. In Proceedings of the ACM symposium on Communications architectures & protocols, SIGCOMM '90, pages 200--208, New York, NY, USA, 1990. ACM. Google ScholarDigital Library
- D. Dalessandro et al. iWARP protocol kernel space software implementation. In Proceedings of the 20th international conference on Parallel and distributed processing, IPDPS'06, pages 274--274, Washington, DC, USA, 2006. IEEE Computer Society. Google ScholarDigital Library
- P. Druschel et al. Experiences with a high-speed network adaptor: a software perspective. In Proceedings of the conference on Communications architectures, protocols and applications, SIGCOMM '94, pages 2--13, New York, NY, USA, 1994. ACM. Google ScholarDigital Library
- P. W. Frey et al. Minimizing the Hidden Cost of RDMA. In Proceedings of the 2009 29th IEEE International Conference on Distributed Computing Systems, ICDCS '09, pages 553--560, Washington, DC, USA, 2009. IEEE Computer Society. Google ScholarDigital Library
- IETF. Remote direct data placement working group. http://datatracker.ietf.org/wg/rddp/charter/.Google Scholar
- J. Liu et al. Evaluating high performance communication: a power perspective. In Proceedings of the 23rd international conference on Supercomputing, ICS '09, pages 326--337, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
- K. Magoutis. The Case Against User-level Networking. In In Third Workshop on Novel Uses of System Area Networks (SAN-3), 2004.Google Scholar
- A. Menon et al. Optimizing network virtualization in Xen. In Proceedings of the annual conference on USENIX '06 Annual Technical Conference, pages 2--2, Berkeley, CA, USA, 2006. USENIX Association. Google ScholarDigital Library
- B. Metzler, P. Frey, and A. Trivedi. SoftiWARP - Project Update, 2010. Available online at http://www.openfabrics.org/OFA-Events-sonoma2010.html.Google Scholar
- J. C. Mogul. TCP offload is a dumb idea whose time has come. In Proceedings of the 9th conference on Hot Topics in Operating Systems - Volume 9, pages 5--5, Berkeley, CA, USA, 2003. USENIX Association. Google ScholarDigital Library
- Netperf, 2.4.5. http://www.netperf.org/netperf/, 2011.Google Scholar
- OpenFabric Alliance. OpenFabrics Enterprise Distribution (OFED) Stack, 2010. Available online at www.openfabrics.org.Google Scholar
- Oprofile, 0.9.6. http://oprofile.sourceforge.net, 2011.Google Scholar
- J. Ousterhout et al. The case for RAMCloud. Commun. ACM, 54:121--130, July 2011. Google ScholarDigital Library
- K. K. Ram et al. Achieving 10 Gb/s using safe and transparent network interface virtualization. In Proceedings of the 2009 ACM SIGPLAN/SIGOPS international conference on Virtual execution environments, VEE '09, pages 61--70, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
- S. K. Reinhardt et al. Tempest and Typhoon: user-level shared memory. In Proceedings of the 21st annual international symposium on Computer architecture, ISCA '94, pages 325--336, Los Alamitos, CA, USA, 1994. IEEE Computer Society Press. Google ScholarDigital Library
- B. Tiwana et al. Location, location, location!: modeling data proximity in the cloud. In Proceedings of the Ninth ACM SIGCOMM Workshop on Hot Topics in Networks, Hotnets '10, pages 15:1--15:6, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
- T. von Eicken et al. U-Net: a user-level network interface for parallel and distributed computing. In Proceedings of the fifteenth ACM symposium on Operating systems principles, SOSP '95, pages 40--53, New York, NY, USA, 1995. ACM. Google ScholarDigital Library
- P. Willmann et al. An evaluation of network stack parallelization strategies in modern operating systems. In Proceedings of the annual conference on USENIX '06 Annual Technical Conference, pages 8--8, Berkeley, CA, USA, 2006. USENIX Association. Google ScholarDigital Library
Index Terms
- A case for RDMA in clouds: turning supercomputer networking into commodity
Recommendations
Congestion Control for Large-Scale RDMA Deployments
SIGCOMM'15Modern datacenter applications demand high throughput (40Gbps) and ultra-low latency (< 10 μs per hop) from the network, with low CPU overhead. Standard TCP/IP stacks cannot meet these requirements, but Remote Direct Memory Access (RDMA) can. On IP-...
LITE Kernel RDMA Support for Datacenter Applications
SOSP '17: Proceedings of the 26th Symposium on Operating Systems PrinciplesRecently, there is an increasing interest in building data-center applications with RDMA because of its low-latency, high-throughput, and low-CPU-utilization benefits. However, RDMA is not readily suitable for datacenter applications. It lacks a ...
Revisiting network support for RDMA
SIGCOMM '18: Proceedings of the 2018 Conference of the ACM Special Interest Group on Data CommunicationThe advent of RoCE (RDMA over Converged Ethernet) has led to a significant increase in the use of RDMA in datacenter networks. To achieve good performance, RoCE requires a lossless network which is in turn achieved by enabling Priority Flow Control (PFC)...
Comments