Abstract
Nowadays, many applications, e.g., network routers, distributed data process engines, firewall, need to transfer packets at linear rate. With the increasing data volume, the performance of cluster in data center is suffering increasingly severe congestion problem of massive message packets. Constructing a high-performance stream methodology of massive small message packets is fundamentally challenging. Although many works have been proposed to address the shortcomings, inefficiency of sending massive small packets via UDP protocol in traditional Linux kernel implementation is persisting, which includes high overhead from socket operations, suboptimal scalability in multi-core systems, nonsupport of multiple network interface card (NIC) ports. In this paper, we present FastUDP, a highly efficient and scalable user-level UDP-based network stack optimization in multi-core systems. FastUDP addresses the inefficiencies from the following three novel designs: (1) enabling the exclusive thread model for improving scalability; (2) adopting a poll mode and batched operation for increasing computing resource utilization; (3) constructing a shared hugepage memory pool to eliminate the context switch overhead. Moreover, to support high throughput, FastUDP also proposes a novel work-queue-based approach to allow concurrent packet to transfer over multiple NIC ports. Based on a 40-core machine, the evaluation shows that FastUDP represents a significant improvement in the packet transfer throughput by up to 13× and reduces the packet transfer latency by up to 4.14× compared to the latest Linux (4.4.0) UDP stack. Besides, it ameliorates the performance of realistic application (memcached) by 36 to 67% compared to those on the Linux stack.
Similar content being viewed by others
Notes
We refer to a transferred data package whose size is smaller than 1 KB. These packages are typically small in size, but commonly in massive amount.
References
Amazon. http://www.amazon.com/. Accessed 18 Oct 2020
epoll- i/o event notification facility. https://www.kernel.org/doc/man-pages/online/pages/man7/epoll.7.html. Accessed 18 Oct 2020
Facebook. http://www.facebook.com/. Accessed 18 Oct 2020
Google. http://www.google.com/. Accessed 18 Oct 2020
Intel dpdk: Data plane development kit. http://dpdk.org/. Accessed 18 Oct 2020
Libevent. http://libevent.org/. Accessed 18 Oct 2020
memcached—a distributed memory object caching system. http://memcached.org. Accessed 18 Oct 2020
The open group base specifications issue 7. http://pubs.opengroup.org/onlinepubs/9699919799/. Accessed 18 Oct 2020
Packet i/o engine. http://shader.kaist.edu/packetshader/io_engine/. Accessed 18 Oct 2020
Pf_ring zc (zero copy). https://www.ntop.org/guides/pf_ring/zc.html. Accessed 18 Oct 2020
Receive-side scaling. https://docs.microsoft.com/en-us/windows-hardware/drivers/network/introduction-to-receive-side-scaling. Accessed 18 Oct 2020
Rps: Receive packet steering. http://lwn.net/Articles/361440/. Accessed 18 Oct 2020
User datagram protocol. http://www.ietf.org/rfc/rfc768.txt. Accessed 18 Oct 2020
Abeni L, Kiraly C, Li N, Bianco A (2015) On the performanc of kvm-based virtual routers. Comput Commun 70:40–53
Armbrust M, Xin RS, Lian C, Huai Y, Liu D, Bradley JK, Meng X, Kaftan T, Franklin MJ, Ghodsi A et al(2015) Spark sql: relational data processing in spark. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 1383–1394
Atikoglu B, Xu Y, Frachtenberg E, Jiang S, Paleczny M (2012) Workload analysis of a large-scale key-value store. In: ACM SIGMETRICS/PERFORMANCE Joint International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS ’12, pp. 53–64
Boyd-Wickizer S, Chen H, Chen R, Mao Y, Kaashoek MF, Morris R, Pesterev A, Stein L, Wu M, Dai Y, Zhang Y, Zhang Z (2008) Corey: An operating system for many cores. In: 8th USENIX Symposium on Operating Systems Design and Implementation, OSDI ’08, pp. 43–57
Boyd-Wickizer S, Clements AT, Mao Y, Pesterev A, Kaashoek MF, Morris R, Zeldovich N (2010) An analysis of linux scalability to many cores. In: 9th USENIX Symposium on Operating Systems Design and Implementation, OSDI ’10, pp. 1–16
Clements AT, Kaashoek MF, Zeldovich N, Morris RT, Kohler E (2015) The scalable commutativity rule: designing scalable software for multicore processors. ACM Trans Comput Syst 32(4):10:1–10:47
Eigler FC, Prasad V, Cohen W, Nguyen H, Hunt M, Keniston J, Chen B (2005) Architecture of systemtap: a linux trace/probe tool
Ely D, Savage S, Wetherall D (2001) Alpine: A user-level infrastructure for network protocol development. In: 3rd USENIX Symposium on Internet Technologies and Systems, USITS ’01
Ganger GR, Engler DR, Kaashoek MF, Briceño HM, Hunt R, Pinckney T (2002) Fast and flexible application-level networking on exokernel systems. ACM Trans Comput Syst 20(1):49–83
Gunawi HS, Arpaci-Dusseau AC, Arpaci-Dusseau RH (2004) Deploying safe user-level network services with ictcp. In: 6th Symposium on Operating System Design and Implementation, OSDI ’04, pp. 317–332
Han S, Jang K, Park K, Moon SB (2010) Packetshader: a gpu-accelerated software router. In: Proceedings of the ACM SIGCOMM 2010 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications, SIGCOMM ’10, pp. 195–206
Han S, Marshall S, Chun B, Ratnasamy S (2012) Megapipe: A new programming interface for scalable network I/O. In: 10th USENIX Symposium on Operating Systems Design and Implementation, OSDI ’12, pp. 135–148
Honda M, Huici F, Raiciu C, Araujo J, Rizzo L (2014) Rekindling network protocol innovation with user-level stacks. ACM SIGCOMM Comput Commun Rev 44(2):52–58
Jeong E, Woo S, Jamshed MA, Jeong H, Ihm S, Han D, Park K (2014) mtcp: a highly scalable user-level TCP stack for multicore systems. In: Proceedings of the 11th USENIX Symposium on Networked Systems Design and Implementation, NSDI ’14, pp. 489–502
Kim J, Jang K, Lee K, Ma S, Shim J, Moon SB (2015) NBA (network balancing act): a high-performance packet processing framework for heterogeneous processors. In: Proceedings of the Tenth European Conference on Computer Systems, EuroSys ’15, pp. 22:1–22:14
Li B, Ruan Z, Xiao W, Lu Y, Xiong Y, Putnam A, Chen E, Zhang L (2017) Kv-direct: High-performance in-memory key-value store with programmable nic. In: Proceedings of the 26th Symposium on Operating Systems Principles, SOSP ’17, pp. 137–152. ACM, New York, NY, USA
Lim H, Han D, Andersen DG, Kaminsky M (2014) MICA: A holistic approach to fast in-memory key-value storage. In: Proceedings of the 11th USENIX Symposium on Networked Systems Design and Implementation, NSDI ’14, pp. 429–444
Lin X, Chen Y, Li X, Mao J, He J, Xu W, Shi Y (2016) Scalable kernel TCP design and implementation for short-lived connections. In: Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’16, pp. 339–352
Mitchell C, Geng Y, Li J (2013) Using one-sided RDMA reads to build a fast, cpu-efficient key-value store. In: USENIX Annual Technical Conference, ATC ’13, pp. 103–114
Montazeri B, Li Y, Alizadeh M, Ousterhout J (2018) Homa: A receiver-driven low-latency transport protocol using network priorities. In: Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication, pp. 221–235
Nishtala R, Fugal H, Grimm S, Kwiatkowski M, Lee H, Li HC, McElroy R, Paleczny M, Peek D, Saab P, Stafford D, Tung T, Venkataramani V (2013) Scaling memcache at facebook. In: Proceedings of the 10th USENIX Symposium on Networked Systems Design and Implementation, NSDI ’13, pp. 385–398
Ousterhout JK, Gopalan A, Gupta A, Kejriwal A, Lee C, Montazeri B, Ongaro D, Park SJ, Qin H, Rosenblum M, Rumble SM, Stutsman R, Yang S (2015) The ramcloud storage system. ACM Trans Comput Syst 33(3):7:1–7:55
Pesterev A, Strauss J, Zeldovich N, Morris RT (2012) Improving network connection locality on multicore systems. In: European Conference on Computer Systems, Proceedings of the Seventh EuroSys Conference 2012, EuroSys ’12, pp. 337–350
Pesterev A, Strauss J, Zeldovich N, Morris RT (2012) Improving network connection locality on multicore systems. In: European Conference on Computer Systems, Proceedings of the Seventh EuroSys Conference, EuroSys ’12, pp. 337–350
Rajashekhar M, Yue Y. Caching with twemcache. https://blog.twitter.com/engineering/en_us/a/2012/caching-with-twemcache.html. Accessed 18 Oct 2020
Rizzo L (2012) netmap: A novel framework for fast packet I/O. In: USENIX Annual Technical Conference, ATC ’12, pp. 101–112
Soares L, Stumm M (2010) Flexsc: Flexible system call scheduling with exception-less system calls. In: 9th USENIX Symposium on Operating Systems Design and Implementation, OSDI ’10, pp. 33–46
Soares L, Stumm M (2011) Exception-less system calls for event-driven servers. In: USENIX Annual Technical Conference, ATC ’11
Song P, Liu Y, Liu T, Qian D (2017) Controller-proxy: scaling network management for large-scale sdn networks. Comput Commun 108:52–63
Thekkath CA, Nguyen TD, Moy E, Lazowska ED (1993) Implementing network protocols at user level. IEEE/ACM Trans Netw 1(5):554–565
Turull D, Sjődin P, Olsson R (2016) Pktgen: measuring performance on high speed networks. Comput Commun 82:39–48
Yang J, Minturn DB, Hady F (2012) When poll is better than interrupt. In: Proceedings of the 10th USENIX conference on File and Storage Technologies, FAST ’12, pp. 25–31
Yasukata K, Honda M, Santry D, Eggert L (2016) Stackmap: low-latency networking with the OS stack and dedicated nics. In: USENIX Annual Technical Conference, USENIX ATC ’16, pp. 43–56
Zhuang M, Aker B. memaslap: Load testing and benchmarking a server. http://docs.libmemcached.org/bin/memaslap.html. Accessed 18 Oct 2020
Acknowledgements
This work was partially supported by Grant ZDBS-LY-JSC038 from the Key Research Program of Frontier Sciences, Chinese Academy of Sciences, Grant No. 62002350 from the National Natural Science Foundation of China and Grant No. 61807033 from the National Natural Science Foundation of China. The authors thank to Dr.Lijie.Xu for the suggestions of the paper.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Zhang, H., Zhang, H., Zhang, L. et al. FastUDP: a highly scalable user-level UDP framework in multi-core systems for fast packet I/O. J Supercomput 77, 5148–5175 (2021). https://doi.org/10.1007/s11227-020-03486-6
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-020-03486-6