ABSTRACT
The performance of large-scale data-intensive applications running on thousands of machines depends considerably on the performance of the network. To deliver better application performance on rapidly evolving high-bandwidth, low-latency interconnects, researchers have proposed the use of network accelerator devices. However, despite the initial enthusiasm, translating network accelerator's capabilities into high application performance remains a challenging issue.
In this paper, we describe our experience and discuss issues that we uncover with network acceleration using Remote Direct Memory Access (RDMA) capable network controllers (RNICs). RNICs offload the complete packet processing into network controllers, and provide direct userspace access to the networking hardware. Our analysis shows that multiple (un)related factors significantly influence the performance gains for the end-application. We identify factors that span the whole stack, ranging from low-level architectural issues (cache and DMA interaction, hardware pre-fetching) to the high-level application parameters (buffer size, access pattern). We discuss implications of our findings upon application performance and the future of integration of network acceleration technology within the systems.
Supplemental Material
Available for Download
IBM Assent to Assignment of Copyright Transfer
- perf: Linux profiling with performance counters. http://perf.wiki.kernel.org/.Google Scholar
- Apache Hadoop. http://hadoop.apache.org/.Google Scholar
- D. Bachand, S. Bilgin, R. Greiner, P. Hammarlund, D. L. Hill, T. Huff, S. Kulick, and R. Safranek. The Uncore: A Modular Approach to Feeding The High-Performance Cores. In Intel Technology Journal, Volume 14, Issue 3, pages 30--49, 2010.Google Scholar
- J. Brown, S. Woodward, B. Bass, and C. Johnson. IBM Power Edge of Network Processor: A Wire-Speed System on a Chip. Micro, IEEE, 31(2):76--85, March-April. Google ScholarDigital Library
- G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels. Dynamo: amazon's highly available key-value store. In Proceedings of 21st SOSP 2007, pages 205--220. Google ScholarDigital Library
- D. Freimuth, E. Hu, J. LaVoie, R. Mraz, E. Nahum, P. Pradhan, and J. Tracey. Server Network Scalability and TCP Offload. In Proceedings of the USENIX Annual Technical Conference, ATC '05, pages 209--222, 2005. Google ScholarDigital Library
- P. W. Frey, A. Hasler, B. Metzler, and G. Alonso. Server-efficient high-definition media dissemination. In Proceedings of the 18th NOSSDAV '09, pages 49--54. Google ScholarDigital Library
- Intel. Intel Xeon Processor 7500 Series Uncore Programming Guide at http://www.intel.com/Assets/en_US/PDF/designguide/323535.pdf.Google Scholar
- Intel weaves strategy to put interconnect fabrics on chip. http://www.hpcwire.com/hpcwire/2012-09-10/intel_weaves_strategy_to_put_interconnect_fabrics_on_chip.html, 2012.Google Scholar
- M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: distributed data-parallel programs from sequential building blocks. In Proceedings of the 2nd ACM EuroSys 2007, pages 59--72. Google ScholarDigital Library
- A. Lakshman and P. Malik. Cassandra: a decentralized structured storage system. SIGOPS Oper. Syst. Rev., 44(2):35--40, Apr. 2010. Google ScholarDigital Library
- J. Liu, D. Poff, and B. Abali. Evaluating high performance communication: a power perspective. In Proceedings of the 23rd ICS 2009, pages 326--337. Google ScholarDigital Library
- G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. In Proceedings of the 2010 ACM SIGMOD, pages 135--146. Google ScholarDigital Library
- C. Mitchell, Y. Geng, and J. Li. Using one-sided rdma reads to build a fast, cpu-efficient key-value store. In Proceedings of the 2013 USENIX Annual Technical Conference, USENIX ATC'13, pages 103--114, 2013. Google ScholarDigital Library
- J. C. Mogul. Tcp offload is a dumb idea whose time has come. In Proceedings of the 9th conference on Hot Topics in Operating Systems - Volume 9, HotOS'03, pages 5--5, 2003. Google ScholarDigital Library
- Netperf, 2.4.5. http://www.netperf.org/netperf/.Google Scholar
- J. Ousterhout et al. The case for ramclouds: scalable high-performance storage entirely in dram. SIGOPS Oper. Syst. Rev., 43(4):92--105, Jan. 2010. Google ScholarDigital Library
- RDMA for GPUDirect, CUDA Toolkit Documentation. http://docs.nvidia.com/cuda/gpudirect-rdma/index.html, 2013.Google Scholar
- (R)DMA in userspace on Linux RDMA mailing list. http://comments.gmane.org/gmane.linux.drivers.rdma/13635, october, 2012.Google Scholar
- Redis in memory key-value store. http://redis.io.Google Scholar
- S. M. Rumble, D. Ongaro, R. Stutsman, M. Rosenblum, and J. K. Ousterhout. It's time for low latency. In Proc. of the 13th HotOS, pages 11--11, 2011. Google ScholarDigital Library
- P. Shinde, A. Kaufmann, T. Roscoe, and S. Kaestle. We Need to Talk About NICs. In Proceedings of the 14th USENIX workshop on Hot Topics in Operating Systems, HotOS'13, pages 1--1, 2013. Google ScholarDigital Library
- P. Stuedi, B. Metzler, and A. Trivedi. jVerbs: Ultra-low Latency for Data Center Applications. In Proceedings of the 4th ACM Symposium on Cloud Computing, SOCC'13, 2013.Google ScholarDigital Library
- P. Stuedi, A. Trivedi, and B. Metzler. Wimpy nodes with 10GbE: leveraging one-sided operations in soft-RDMA to boost memcached. In Proceedings of the USENIX ATC, 2012. Google ScholarDigital Library
- A. Trivedi, B. Metzler, and P. Stuedi. A case for RDMA in clouds: turning supercomputer networking into commodity. In Proc. of the 2nd APSys, pages 17:1--17:5, 2011. Google ScholarDigital Library
- A. Trivedi, P. Stuedi, B. Metzler, R. Pletka, B. G. Fitch, and T. R. Gross. Unified High-Performance I/O: One Stack to Rule Them All. In Proceedings of the 14th USENIX workshop on Hot Topics in Operating Systems, HotOS'13, pages 4--4, 2013. Google ScholarDigital Library
- M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX NSDI, pages 2--2. Google ScholarDigital Library
Index Terms
- On limitations of network acceleration
Recommendations
Group-caching for NoC based multicore cache coherent systems
DATE '09: Proceedings of the Conference on Design, Automation and Test in EuropeMost CMPs use on-chip networks to connect cores and tend to integrate more simple cores on a single die. Low-radix networks, such as 2D-MESH, are widely used in tiled CMPs since they can be mapped to on-chip networks efficiently. However, low-radix ...
Boosting performance of directory-based cache coherence protocols with coherence bypass at subpage granularity and a novel on-chip page table
CF '16: Proceedings of the ACM International Conference on Computing FrontiersChip multiprocessors (CMPs) require effective cache coherence protocols as well as fast virtual-to-physical address translation mechanisms for high performance. Directory-based cache coherence protocols are the state-of-the-art approaches in many-core ...
Heterogeneous system coherence for integrated CPU-GPU systems
MICRO-46: Proceedings of the 46th Annual IEEE/ACM International Symposium on MicroarchitectureMany future heterogeneous systems will integrate CPUs and GPUs physically on a single chip and logically connect them via shared memory to avoid explicit data copying. Making this shared memory coherent facilitates programming and fine-grained sharing, ...
Comments