research-article

Leveraging SIMD parallelism for accelerating network applications

Authors:

Dongsu HanAuthors Info & Claims

APNet '20: Proceedings of the 4th Asia-Pacific Workshop on Networking

Pages 23 - 29

https://doi.org/10.1145/3411029.3411033

Published: 11 August 2020 Publication History

Abstract

Software packet processing frameworks act as critical components in modern network architecture, as their performance has a vital impact on the quality of the network services. Motivated by the increasing number and capability for advanced vector instructions in recent mainstream CPUs, this paper explores a new parallel processing design and implementation of data structures and algorithms that are frequently used for building network applications. In particular, we propose effective SIMD optimization techniques for the bloom filter and Open vSwitch megaflow cache. Our design reduces memory access latency via careful prefetching and a new design that meets the needs of fast data consuming instructions. Our evaluation shows performance improvements up to 162% in bloom filter and 48% in Open vSwitch compared to their scalar version.

References

[1]

Cagri Balkesen, Gustavo Alonso, Jens Teubner, and M Tamer Özsu. 2013. Multi-core, main-memory joins: Sort vs. hash revisited. Proceedings of the VLDB Endowment 7, 1 (2013), 85–96.

Digital Library

[2]

Jatin Chhugani, Anthony D Nguyen, Victor W Lee, William Macy, Mostafa Hagog, Yen-Kuang Chen, Akram Baransi, Sanjeev Kumar, and Pradeep Dubey. 2008. Efficient implementation of sorting on multi-core SIMD CPU architecture. Proceedings of the VLDB Endowment 1, 2 (2008), 1313–1324.

Digital Library

[3]

Byungkwon Choi, Jongwook Chae, Muhammad Jamshed, Kyoungsoo Park, and Dongsu Han. 2016. {DFC}: Accelerating String Pattern Matching for Network Applications. In 13th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 16). 551–565.

[4]

Data Plane Development Kit [n.d.]. Data Plane Development Kit. https://www.dpdk.org/.

[5]

Sarang Dharmapurikar, Praveen Krishnamurthy, Todd Sproull, and John Lockwood. 2003. Deep packet inspection using parallel bloom filters. In 11th Symposium on High Performance Interconnects, 2003. Proceedings. IEEE, 44–51.

[6]

Keith Diefendorff, Pradeep K Dubey, Ron Hochsprung, and HASH Scale. 2000. Altivec extension to PowerPC accelerates media processing. IEEE Micro 20, 2 (2000), 85–95.

Digital Library

[7]

Paul Emmerich, Sebastian Gallenmüller, Daniel Raumer, Florian Wohlfart, and Georg Carle. 2015. Moongen: A scriptable high-speed packet generator. In Proceedings of the 2015 Internet Measurement Conference. ACM, 275–287.

Digital Library

[8]

Shahabeddin Geravand and Mahmood Ahmadi. 2013. Bloom filter applications in network security: A state-of-the-art survey. Computer Networks 57, 18 (2013), 4047–4064.

Digital Library

[9]

Younghwan Go, Muhammad Asim Jamshed, YoungGyoun Moon, Changho Hwang, and KyoungSoo Park. 2017. APUNet: Revitalizing {GPU} as Packet Processing Accelerator. In 14th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 17). 83–96.

[10]

Sangjin Han, Keon Jang, KyoungSoo Park, and Sue Moon. 2011. PacketShader: a GPU-accelerated software router. ACM SIGCOMM Computer Communication Review 41, 4 (2011), 195–206.

Digital Library

[11]

Hiroshi Inoue, Takao Moriyama, Hideaki Komatsu, and Toshio Nakatani. 2007. AA-sort: A new parallel sorting algorithm for multi-core SIMD processors. In 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007). IEEE, 189–198.

[12]

Intel 64 and IA-32 Architectures Software Developer’s Manual [n.d.]. Intel 64 and IA-32 Architectures Software Developer’s Manual. https://software.intel.com/sites/default/files/managed/a4/60/325383-sdm-vol-2abcd.pdf.

[13]

Intel Intrinsics Guide [n.d.]. Intel Intrinsics Guide. https://software.intel.com/sites/landingpage/IntrinsicsGuide.

[14]

Keon Jang, Sangjin Han, Seungyeop Han, Sue B Moon, and KyoungSoo Park. 2011. SSLShader: Cheap SSL Acceleration with Commodity Processors. In NSDI. 1–14.

[15]

Anuj Kalia, Dong Zhou, Michael Kaminsky, and David G Andersen. 2015. Raising the bar for using GPUs in software packet processing. In 12th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 15). 409–423.

[16]

Joongi Kim, Keon Jang, Keunhong Lee, Sangwook Ma, Junhyun Shim, and Sue Moon. 2015. NBA (network balancing act): A high-performance packet processing framework for heterogeneous processors. In Proceedings of the Tenth European Conference on Computer Systems. ACM, 22.

Digital Library

[17]

Harald Lang, Linnea Passing, Andreas Kipf, Peter Boncz, Thomas Neumann, and Alfons Kemper. 2020. Make the most out of your SIMD investments: counter control flow divergence in compiled query pipelines. The VLDB Journal 29, 2 (2020), 757–774.

[18]

Samuel Larsen, Rodric Rabbah, and Saman Amarasinghe. 2005. Exploiting vector parallelism in software pipelined loops. In 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’05). IEEE, 11–pp.

Digital Library

[19]

Daniel Lustig and Margaret Martonosi. 2013. Reducing GPU offload latency via fine-grained CPU-GPU synchronization. In 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA). IEEE, 354–365.

Digital Library

[20]

Gaurav Mitra, Beau Johnston, Alistair P Rendell, Eric McCreath, and Jun Zhou. 2013. Use of SIMD vector operations to accelerate application code performance on low-powered ARM and Intel platforms. In 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum. IEEE, 1107–1116.

Digital Library

[21]

Ben Pfaff, Justin Pettit, Teemu Koponen, Ethan Jackson, Andy Zhou, Jarno Rajahalme, Jesse Gross, Alex Wang, Joe Stringer, Pravin Shelar, 2015. The design and implementation of open vswitch. In 12th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 15). 117–130.

[22]

Orestis Polychroniou, Arun Raghavan, and Kenneth A Ross. 2015. Rethinking SIMD vectorization for in-memory databases. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, 1493–1508.

Digital Library

[23]

Orestis Polychroniou and Kenneth A Ross. 2014. Vectorized Bloom filters for advanced SIMD processors. In Proceedings of the Tenth International Workshop on Data Management on New Hardware. ACM, 6.

Digital Library

[24]

Haoyu Song, Sarang Dharmapurikar, Jonathan Turner, and John Lockwood. 2005. Fast hash table lookup using extended bloom filter: an aid to network processing. In ACM SIGCOMM Computer Communication Review, Vol. 35. ACM, 181–192.

[25]

Venkatachary Srinivasan, Subhash Suri, and George Varghese. 1999. Packet classification using tuple space search. In ACM SIGCOMM Computer Communication Review, Vol. 29. ACM, 135–146.

[26]

Giorgos Vasiliadis, Lazaros Koromilas, Michalis Polychronakis, and Sotiris Ioannidis. 2014. {GASPP}: A GPU-Accelerated Stateful Packet Processing Framework. In 2014 {USENIX} Annual Technical Conference ({USENIX}{ATC} 14). 321–332.

[27]

VPP [n.d.]. The Vector Packet Processor (VPP). https://fd.io/docs/vpp/master/whatisvpp/index.html.

[28]

Thomas Willhalm, Nicolae Popovici, Yazan Boshmaf, Hasso Plattner, Alexander Zeier, and Jan Schaffner. 2009. SIMD-scan: ultra fast in-memory table scan using on-chip vector processing units. Proceedings of the VLDB Endowment 2, 1 (2009), 385–394.

Digital Library

[29]

Jingren Zhou and Kenneth A Ross. 2002. Implementing database operations using SIMD instructions. In Proceedings of the 2002 ACM SIGMOD international conference on Management of data. ACM, 145–156.

Digital Library

Recommendations

Efficient SIMD implementation for accelerating convolutional neural network
ICCIP '18: Proceedings of the 4th International Conference on Communication and Information Processing

Convolutional Neural Network (CNN) has been used in a variety of fields such as computer vision, speech recognition, and natural language processing. Because the amount of computation has increased tremendously, CNN has lately been accelerated through ...
Co-optimizing memory-level parallelism and cache-level parallelism
PLDI 2019: Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation

Minimizing cache misses has been the traditional goal in optimizing cache performance using compiler based techniques. However, continuously increasing dataset sizes combined with large numbers of cache banks and memory banks connected using on-chip ...
Exploiting SIMD Parallelism with the CGiS Compiler Framework
Languages and Compilers for Parallel Computing
Abstract
Today’s desktop PCs feature a variety of parallel processing units. Developing applications that exploit this parallelism is a demanding task, and a programmer has to obtain detailed knowledge about the hardware for efficient implementation. CGiS ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

APNet '20: Proceedings of the 4th Asia-Pacific Workshop on Networking

August 2020

57 pages

ISBN:9781450388764

DOI:10.1145/3411029

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 August 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article
Research
Refereed limited

Conference

APNet '20

APNet '20: 4th Asia-Pacific Workshop on Networking

August 3 - 4, 2020

Seoul, Republic of Korea

Acceptance Rates

Overall Acceptance Rate 50 of 118 submissions, 42%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
172
Total Downloads

Downloads (Last 12 months)20
Downloads (Last 6 weeks)0

Reflects downloads up to 07 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten