research-article

On limitations of network acceleration

Authors:
Animesh Trivedi

IBM Research, Zurich, Zurich, Switzerland

IBM Research, Zurich, Zurich, Switzerland
View Profile

,
Bernard Metzler

IBM Research, Zurich, Zurich, Switzerland

IBM Research, Zurich, Zurich, Switzerland
View Profile

,
Patrick Stuedi

IBM Research, Zurich, Zurich, Switzerland

IBM Research, Zurich, Zurich, Switzerland
View Profile

,
Thomas R. Gross

ETH, Zurich, Zurich, Switzerland

ETH, Zurich, Zurich, Switzerland
View Profile

CoNEXT '13: Proceedings of the ninth ACM conference on Emerging networking experiments and technologiesDecember 2013Pages 121–126https://doi.org/10.1145/2535372.2535412

Published:09 December 2013Publication History

CoNEXT '13: Proceedings of the ninth ACM conference on Emerging networking experiments and technologies

Pages 121–126

ABSTRACT

The performance of large-scale data-intensive applications running on thousands of machines depends considerably on the performance of the network. To deliver better application performance on rapidly evolving high-bandwidth, low-latency interconnects, researchers have proposed the use of network accelerator devices. However, despite the initial enthusiasm, translating network accelerator's capabilities into high application performance remains a challenging issue.

In this paper, we describe our experience and discuss issues that we uncover with network acceleration using Remote Direct Memory Access (RDMA) capable network controllers (RNICs). RNICs offload the complete packet processing into network controllers, and provide direct userspace access to the networking hardware. Our analysis shows that multiple (un)related factors significantly influence the performance gains for the end-application. We identify factors that span the whole stack, ranging from low-level architectural issues (cache and DMA interaction, hardware pre-fetching) to the high-level application parameters (buffer size, access pattern). We discuss implications of our findings upon application performance and the future of integration of network acceleration technology within the systems.

Supplemental Material

Available for Download

zip

next060s.zip (212.3 KB)

IBM Assent to Assignment of Copyright Transfer

References

perf: Linux profiling with performance counters. http://perf.wiki.kernel.org/.Google Scholar
Apache Hadoop. http://hadoop.apache.org/.Google Scholar
D. Bachand, S. Bilgin, R. Greiner, P. Hammarlund, D. L. Hill, T. Huff, S. Kulick, and R. Safranek. The Uncore: A Modular Approach to Feeding The High-Performance Cores. In Intel Technology Journal, Volume 14, Issue 3, pages 30--49, 2010.Google Scholar
J. Brown, S. Woodward, B. Bass, and C. Johnson. IBM Power Edge of Network Processor: A Wire-Speed System on a Chip. Micro, IEEE, 31(2):76--85, March-April. Google ScholarDigital Library
G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels. Dynamo: amazon's highly available key-value store. In Proceedings of 21st SOSP 2007, pages 205--220. Google ScholarDigital Library
D. Freimuth, E. Hu, J. LaVoie, R. Mraz, E. Nahum, P. Pradhan, and J. Tracey. Server Network Scalability and TCP Offload. In Proceedings of the USENIX Annual Technical Conference, ATC '05, pages 209--222, 2005. Google ScholarDigital Library
P. W. Frey, A. Hasler, B. Metzler, and G. Alonso. Server-efficient high-definition media dissemination. In Proceedings of the 18th NOSSDAV '09, pages 49--54. Google ScholarDigital Library
Intel. Intel Xeon Processor 7500 Series Uncore Programming Guide at http://www.intel.com/Assets/en_US/PDF/designguide/323535.pdf.Google Scholar
Intel weaves strategy to put interconnect fabrics on chip. http://www.hpcwire.com/hpcwire/2012-09-10/intel_weaves_strategy_to_put_interconnect_fabrics_on_chip.html, 2012.Google Scholar
M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: distributed data-parallel programs from sequential building blocks. In Proceedings of the 2nd ACM EuroSys 2007, pages 59--72. Google ScholarDigital Library
A. Lakshman and P. Malik. Cassandra: a decentralized structured storage system. SIGOPS Oper. Syst. Rev., 44(2):35--40, Apr. 2010. Google ScholarDigital Library
J. Liu, D. Poff, and B. Abali. Evaluating high performance communication: a power perspective. In Proceedings of the 23rd ICS 2009, pages 326--337. Google ScholarDigital Library
G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. In Proceedings of the 2010 ACM SIGMOD, pages 135--146. Google ScholarDigital Library
C. Mitchell, Y. Geng, and J. Li. Using one-sided rdma reads to build a fast, cpu-efficient key-value store. In Proceedings of the 2013 USENIX Annual Technical Conference, USENIX ATC'13, pages 103--114, 2013. Google ScholarDigital Library
J. C. Mogul. Tcp offload is a dumb idea whose time has come. In Proceedings of the 9th conference on Hot Topics in Operating Systems - Volume 9, HotOS'03, pages 5--5, 2003. Google ScholarDigital Library
Netperf, 2.4.5. http://www.netperf.org/netperf/.Google Scholar
J. Ousterhout et al. The case for ramclouds: scalable high-performance storage entirely in dram. SIGOPS Oper. Syst. Rev., 43(4):92--105, Jan. 2010. Google ScholarDigital Library
RDMA for GPUDirect, CUDA Toolkit Documentation. http://docs.nvidia.com/cuda/gpudirect-rdma/index.html, 2013.Google Scholar
(R)DMA in userspace on Linux RDMA mailing list. http://comments.gmane.org/gmane.linux.drivers.rdma/13635, october, 2012.Google Scholar
Redis in memory key-value store. http://redis.io.Google Scholar
S. M. Rumble, D. Ongaro, R. Stutsman, M. Rosenblum, and J. K. Ousterhout. It's time for low latency. In Proc. of the 13th HotOS, pages 11--11, 2011. Google ScholarDigital Library
P. Shinde, A. Kaufmann, T. Roscoe, and S. Kaestle. We Need to Talk About NICs. In Proceedings of the 14th USENIX workshop on Hot Topics in Operating Systems, HotOS'13, pages 1--1, 2013. Google ScholarDigital Library
P. Stuedi, B. Metzler, and A. Trivedi. jVerbs: Ultra-low Latency for Data Center Applications. In Proceedings of the 4th ACM Symposium on Cloud Computing, SOCC'13, 2013.Google ScholarDigital Library
P. Stuedi, A. Trivedi, and B. Metzler. Wimpy nodes with 10GbE: leveraging one-sided operations in soft-RDMA to boost memcached. In Proceedings of the USENIX ATC, 2012. Google ScholarDigital Library
A. Trivedi, B. Metzler, and P. Stuedi. A case for RDMA in clouds: turning supercomputer networking into commodity. In Proc. of the 2nd APSys, pages 17:1--17:5, 2011. Google ScholarDigital Library
A. Trivedi, P. Stuedi, B. Metzler, R. Pletka, B. G. Fitch, and T. R. Gross. Unified High-Performance I/O: One Stack to Rule Them All. In Proceedings of the 14th USENIX workshop on Hot Topics in Operating Systems, HotOS'13, pages 4--4, 2013. Google ScholarDigital Library
M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX NSDI, pages 2--2. Google ScholarDigital Library

Index Terms

On limitations of network acceleration
1. General and reference
  1. Cross-computing tools and techniques
    1. Performance

Recommendations

Group-caching for NoC based multicore cache coherent systems
DATE '09: Proceedings of the Conference on Design, Automation and Test in Europe

Most CMPs use on-chip networks to connect cores and tend to integrate more simple cores on a single die. Low-radix networks, such as 2D-MESH, are widely used in tiled CMPs since they can be mapped to on-chip networks efficiently. However, low-radix ...
Read More
Boosting performance of directory-based cache coherence protocols with coherence bypass at subpage granularity and a novel on-chip page table
CF '16: Proceedings of the ACM International Conference on Computing Frontiers

Chip multiprocessors (CMPs) require effective cache coherence protocols as well as fast virtual-to-physical address translation mechanisms for high performance. Directory-based cache coherence protocols are the state-of-the-art approaches in many-core ...
Read More
Heterogeneous system coherence for integrated CPU-GPU systems
MICRO-46: Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture

Many future heterogeneous systems will integrate CPUs and GPUs physically on a single chip and logically connect them via shared memory to avoid explicit data copying. Making this shared memory coherent facilitates programming and fine-grained sharing, ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CoNEXT '13: Proceedings of the ninth ACM conference on Emerging networking experiments and technologies
December 2013
454 pages
ISBN:9781450321013
DOI:10.1145/2535372
General Chairs:
Kevin Almeroth
UC Santa Barbara, USA
,
Laurent Mathy
Université de Liège, Belgium
,
Program Chairs:
Konstantina Papagiannaki
Telefonica Research, Spain
,
Vishal Misra
Columbia University
Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 9 December 2013
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
cache coherence
network acceleration
performance
Qualifiers
- research-article
Conference

Acceptance Rates
CoNEXT '13 Paper Acceptance Rate44of226submissions,19%Overall Acceptance Rate198of789submissions,25%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 2
  Total Citations
  View Citations
- 165
  Total Downloads
- Downloads (Last 12 months)3
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

On limitations of network acceleration

CoNEXT '13: Proceedings of the ninth ACM conference on Emerging networking experiments and technologies

ABSTRACT

Supplemental Material

Available for Download

References

Cited By

Index Terms

Recommendations

Group-caching for NoC based multicore cache coherent systems

Boosting performance of directory-based cache coherence protocols with coherence bypass at subpage granularity and a novel on-chip page table

Heterogeneous system coherence for integrated CPU-GPU systems