FastUDP: a highly scalable user-level UDP framework in multi-core systems for fast packet I/O

Zhang, Hongjun; Zhang, Heng; Zhang, Libo; Wu, Yanjun

doi:10.1007/s11227-020-03486-6

FastUDP: a highly scalable user-level UDP framework in multi-core systems for fast packet I/O

Published: 03 November 2020

Volume 77, pages 5148–5175, (2021)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Hongjun Zhang ORCID: orcid.org/0000-0002-5990-9230^1,2,
Heng Zhang¹,
Libo Zhang¹ &
…
Yanjun Wu¹

486 Accesses
6 Citations
3 Altmetric
Explore all metrics

Abstract

Nowadays, many applications, e.g., network routers, distributed data process engines, firewall, need to transfer packets at linear rate. With the increasing data volume, the performance of cluster in data center is suffering increasingly severe congestion problem of massive message packets. Constructing a high-performance stream methodology of massive small message packets is fundamentally challenging. Although many works have been proposed to address the shortcomings, inefficiency of sending massive small packets via UDP protocol in traditional Linux kernel implementation is persisting, which includes high overhead from socket operations, suboptimal scalability in multi-core systems, nonsupport of multiple network interface card (NIC) ports. In this paper, we present FastUDP, a highly efficient and scalable user-level UDP-based network stack optimization in multi-core systems. FastUDP addresses the inefficiencies from the following three novel designs: (1) enabling the exclusive thread model for improving scalability; (2) adopting a poll mode and batched operation for increasing computing resource utilization; (3) constructing a shared hugepage memory pool to eliminate the context switch overhead. Moreover, to support high throughput, FastUDP also proposes a novel work-queue-based approach to allow concurrent packet to transfer over multiple NIC ports. Based on a 40-core machine, the evaluation shows that FastUDP represents a significant improvement in the packet transfer throughput by up to 13× and reduces the packet transfer latency by up to 4.14× compared to the latest Linux (4.4.0) UDP stack. Besides, it ameliorates the performance of realistic application (memcached) by 36 to 67% compared to those on the Linux stack.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An In-Depth Performance Analysis of Many-Integrated Core for Communication Efficient Heterogeneous Computing

Towards high-performance packet processing on commodity multi-cores: current issues and future directions

Article 18 November 2015

Lu Tang, JinLi Yan, … MinXuan Zhang

Haetae: Scaling the Performance of Network Intrusion Detection with Many-Core Processors

Notes

We refer to a transferred data package whose size is smaller than 1 KB. These packages are typically small in size, but commonly in massive amount.

References

Amazon. http://www.amazon.com/. Accessed 18 Oct 2020
epoll- i/o event notification facility. https://www.kernel.org/doc/man-pages/online/pages/man7/epoll.7.html. Accessed 18 Oct 2020
Facebook. http://www.facebook.com/. Accessed 18 Oct 2020
Google. http://www.google.com/. Accessed 18 Oct 2020
Intel dpdk: Data plane development kit. http://dpdk.org/. Accessed 18 Oct 2020
Libevent. http://libevent.org/. Accessed 18 Oct 2020
memcached—a distributed memory object caching system. http://memcached.org. Accessed 18 Oct 2020
The open group base specifications issue 7. http://pubs.opengroup.org/onlinepubs/9699919799/. Accessed 18 Oct 2020
Packet i/o engine. http://shader.kaist.edu/packetshader/io_engine/. Accessed 18 Oct 2020
Pf_ring zc (zero copy). https://www.ntop.org/guides/pf_ring/zc.html. Accessed 18 Oct 2020
Receive-side scaling. https://docs.microsoft.com/en-us/windows-hardware/drivers/network/introduction-to-receive-side-scaling. Accessed 18 Oct 2020
Rps: Receive packet steering. http://lwn.net/Articles/361440/. Accessed 18 Oct 2020
User datagram protocol. http://www.ietf.org/rfc/rfc768.txt. Accessed 18 Oct 2020
Abeni L, Kiraly C, Li N, Bianco A (2015) On the performanc of kvm-based virtual routers. Comput Commun 70:40–53
Article Google Scholar
Armbrust M, Xin RS, Lian C, Huai Y, Liu D, Bradley JK, Meng X, Kaftan T, Franklin MJ, Ghodsi A et al(2015) Spark sql: relational data processing in spark. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 1383–1394
Atikoglu B, Xu Y, Frachtenberg E, Jiang S, Paleczny M (2012) Workload analysis of a large-scale key-value store. In: ACM SIGMETRICS/PERFORMANCE Joint International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS ’12, pp. 53–64
Boyd-Wickizer S, Chen H, Chen R, Mao Y, Kaashoek MF, Morris R, Pesterev A, Stein L, Wu M, Dai Y, Zhang Y, Zhang Z (2008) Corey: An operating system for many cores. In: 8th USENIX Symposium on Operating Systems Design and Implementation, OSDI ’08, pp. 43–57
Boyd-Wickizer S, Clements AT, Mao Y, Pesterev A, Kaashoek MF, Morris R, Zeldovich N (2010) An analysis of linux scalability to many cores. In: 9th USENIX Symposium on Operating Systems Design and Implementation, OSDI ’10, pp. 1–16
Clements AT, Kaashoek MF, Zeldovich N, Morris RT, Kohler E (2015) The scalable commutativity rule: designing scalable software for multicore processors. ACM Trans Comput Syst 32(4):10:1–10:47
Article Google Scholar
Eigler FC, Prasad V, Cohen W, Nguyen H, Hunt M, Keniston J, Chen B (2005) Architecture of systemtap: a linux trace/probe tool
Ely D, Savage S, Wetherall D (2001) Alpine: A user-level infrastructure for network protocol development. In: 3rd USENIX Symposium on Internet Technologies and Systems, USITS ’01
Ganger GR, Engler DR, Kaashoek MF, Briceño HM, Hunt R, Pinckney T (2002) Fast and flexible application-level networking on exokernel systems. ACM Trans Comput Syst 20(1):49–83
Article Google Scholar
Gunawi HS, Arpaci-Dusseau AC, Arpaci-Dusseau RH (2004) Deploying safe user-level network services with ictcp. In: 6th Symposium on Operating System Design and Implementation, OSDI ’04, pp. 317–332
Han S, Jang K, Park K, Moon SB (2010) Packetshader: a gpu-accelerated software router. In: Proceedings of the ACM SIGCOMM 2010 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications, SIGCOMM ’10, pp. 195–206
Han S, Marshall S, Chun B, Ratnasamy S (2012) Megapipe: A new programming interface for scalable network I/O. In: 10th USENIX Symposium on Operating Systems Design and Implementation, OSDI ’12, pp. 135–148
Honda M, Huici F, Raiciu C, Araujo J, Rizzo L (2014) Rekindling network protocol innovation with user-level stacks. ACM SIGCOMM Comput Commun Rev 44(2):52–58
Article Google Scholar
Jeong E, Woo S, Jamshed MA, Jeong H, Ihm S, Han D, Park K (2014) mtcp: a highly scalable user-level TCP stack for multicore systems. In: Proceedings of the 11th USENIX Symposium on Networked Systems Design and Implementation, NSDI ’14, pp. 489–502
Kim J, Jang K, Lee K, Ma S, Shim J, Moon SB (2015) NBA (network balancing act): a high-performance packet processing framework for heterogeneous processors. In: Proceedings of the Tenth European Conference on Computer Systems, EuroSys ’15, pp. 22:1–22:14
Li B, Ruan Z, Xiao W, Lu Y, Xiong Y, Putnam A, Chen E, Zhang L (2017) Kv-direct: High-performance in-memory key-value store with programmable nic. In: Proceedings of the 26th Symposium on Operating Systems Principles, SOSP ’17, pp. 137–152. ACM, New York, NY, USA
Lim H, Han D, Andersen DG, Kaminsky M (2014) MICA: A holistic approach to fast in-memory key-value storage. In: Proceedings of the 11th USENIX Symposium on Networked Systems Design and Implementation, NSDI ’14, pp. 429–444
Lin X, Chen Y, Li X, Mao J, He J, Xu W, Shi Y (2016) Scalable kernel TCP design and implementation for short-lived connections. In: Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’16, pp. 339–352
Mitchell C, Geng Y, Li J (2013) Using one-sided RDMA reads to build a fast, cpu-efficient key-value store. In: USENIX Annual Technical Conference, ATC ’13, pp. 103–114
Montazeri B, Li Y, Alizadeh M, Ousterhout J (2018) Homa: A receiver-driven low-latency transport protocol using network priorities. In: Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication, pp. 221–235
Nishtala R, Fugal H, Grimm S, Kwiatkowski M, Lee H, Li HC, McElroy R, Paleczny M, Peek D, Saab P, Stafford D, Tung T, Venkataramani V (2013) Scaling memcache at facebook. In: Proceedings of the 10th USENIX Symposium on Networked Systems Design and Implementation, NSDI ’13, pp. 385–398
Ousterhout JK, Gopalan A, Gupta A, Kejriwal A, Lee C, Montazeri B, Ongaro D, Park SJ, Qin H, Rosenblum M, Rumble SM, Stutsman R, Yang S (2015) The ramcloud storage system. ACM Trans Comput Syst 33(3):7:1–7:55
Article Google Scholar
Pesterev A, Strauss J, Zeldovich N, Morris RT (2012) Improving network connection locality on multicore systems. In: European Conference on Computer Systems, Proceedings of the Seventh EuroSys Conference 2012, EuroSys ’12, pp. 337–350
Pesterev A, Strauss J, Zeldovich N, Morris RT (2012) Improving network connection locality on multicore systems. In: European Conference on Computer Systems, Proceedings of the Seventh EuroSys Conference, EuroSys ’12, pp. 337–350
Rajashekhar M, Yue Y. Caching with twemcache. https://blog.twitter.com/engineering/en_us/a/2012/caching-with-twemcache.html. Accessed 18 Oct 2020
Rizzo L (2012) netmap: A novel framework for fast packet I/O. In: USENIX Annual Technical Conference, ATC ’12, pp. 101–112
Soares L, Stumm M (2010) Flexsc: Flexible system call scheduling with exception-less system calls. In: 9th USENIX Symposium on Operating Systems Design and Implementation, OSDI ’10, pp. 33–46
Soares L, Stumm M (2011) Exception-less system calls for event-driven servers. In: USENIX Annual Technical Conference, ATC ’11
Song P, Liu Y, Liu T, Qian D (2017) Controller-proxy: scaling network management for large-scale sdn networks. Comput Commun 108:52–63
Article Google Scholar
Thekkath CA, Nguyen TD, Moy E, Lazowska ED (1993) Implementing network protocols at user level. IEEE/ACM Trans Netw 1(5):554–565
Article Google Scholar
Turull D, Sjődin P, Olsson R (2016) Pktgen: measuring performance on high speed networks. Comput Commun 82:39–48
Article Google Scholar
Yang J, Minturn DB, Hady F (2012) When poll is better than interrupt. In: Proceedings of the 10th USENIX conference on File and Storage Technologies, FAST ’12, pp. 25–31
Yasukata K, Honda M, Santry D, Eggert L (2016) Stackmap: low-latency networking with the OS stack and dedicated nics. In: USENIX Annual Technical Conference, USENIX ATC ’16, pp. 43–56
Zhuang M, Aker B. memaslap: Load testing and benchmarking a server. http://docs.libmemcached.org/bin/memaslap.html. Accessed 18 Oct 2020

Download references

Acknowledgements

This work was partially supported by Grant ZDBS-LY-JSC038 from the Key Research Program of Frontier Sciences, Chinese Academy of Sciences, Grant No. 62002350 from the National Natural Science Foundation of China and Grant No. 61807033 from the National Natural Science Foundation of China. The authors thank to Dr.Lijie.Xu for the suggestions of the paper.

Author information

Authors and Affiliations

Institute of Software, Chinese Academy of Sciences, Beijing, China
Hongjun Zhang, Heng Zhang, Libo Zhang & Yanjun Wu
University of Chinese Academy of Sciences, Beijing, China
Hongjun Zhang

Authors

Hongjun Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Heng Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Libo Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Yanjun Wu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hongjun Zhang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhang, H., Zhang, H., Zhang, L. et al. FastUDP: a highly scalable user-level UDP framework in multi-core systems for fast packet I/O. J Supercomput 77, 5148–5175 (2021). https://doi.org/10.1007/s11227-020-03486-6

Download citation

Accepted: 22 October 2020
Published: 03 November 2020
Issue Date: May 2021
DOI: https://doi.org/10.1007/s11227-020-03486-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

FastUDP: a highly scalable user-level UDP framework in multi-core systems for fast packet I/O

Abstract

Access this article

Similar content being viewed by others

An In-Depth Performance Analysis of Many-Integrated Core for Communication Efficient Heterogeneous Computing

Towards high-performance packet processing on commodity multi-cores: current issues and future directions

Haetae: Scaling the Performance of Network Intrusion Detection with Many-Core Processors

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

FastUDP: a highly scalable user-level UDP framework in multi-core systems for fast packet I/O

Abstract

Access this article

Similar content being viewed by others

An In-Depth Performance Analysis of Many-Integrated Core for Communication Efficient Heterogeneous Computing

Towards high-performance packet processing on commodity multi-cores: current issues and future directions

Haetae: Scaling the Performance of Network Intrusion Detection with Many-Core Processors

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation