Skip to main content
Log in

FastUDP: a highly scalable user-level UDP framework in multi-core systems for fast packet I/O

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Nowadays, many applications, e.g., network routers, distributed data process engines, firewall, need to transfer packets at linear rate. With the increasing data volume, the performance of cluster in data center is suffering increasingly severe congestion problem of massive message packets. Constructing a high-performance stream methodology of massive small message packets is fundamentally challenging. Although many works have been proposed to address the shortcomings, inefficiency of sending massive small packets via UDP protocol in traditional Linux kernel implementation is persisting, which includes high overhead from socket operations, suboptimal scalability in multi-core systems, nonsupport of multiple network interface card (NIC) ports. In this paper, we present FastUDP, a highly efficient and scalable user-level UDP-based network stack optimization in multi-core systems. FastUDP addresses the inefficiencies from the following three novel designs: (1) enabling the exclusive thread model for improving scalability; (2) adopting a poll mode and batched operation for increasing computing resource utilization; (3) constructing a shared hugepage memory pool to eliminate the context switch overhead. Moreover, to support high throughput, FastUDP also proposes a novel work-queue-based approach to allow concurrent packet to transfer over multiple NIC ports. Based on a 40-core machine, the evaluation shows that FastUDP represents a significant improvement in the packet transfer throughput by up to 13× and reduces the packet transfer latency by up to 4.14× compared to the latest Linux (4.4.0) UDP stack. Besides, it ameliorates the performance of realistic application (memcached) by 36 to 67% compared to those on the Linux stack.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Notes

  1. We refer to a transferred data package whose size is smaller than 1 KB. These packages are typically small in size, but commonly in massive amount.

References

  1. Amazon. http://www.amazon.com/. Accessed 18 Oct 2020

  2. epoll- i/o event notification facility. https://www.kernel.org/doc/man-pages/online/pages/man7/epoll.7.html. Accessed 18 Oct 2020

  3. Facebook. http://www.facebook.com/. Accessed 18 Oct 2020

  4. Google. http://www.google.com/. Accessed 18 Oct 2020

  5. Intel dpdk: Data plane development kit. http://dpdk.org/. Accessed 18 Oct 2020

  6. Libevent. http://libevent.org/. Accessed 18 Oct 2020

  7. memcached—a distributed memory object caching system. http://memcached.org. Accessed 18 Oct 2020

  8. The open group base specifications issue 7. http://pubs.opengroup.org/onlinepubs/9699919799/. Accessed 18 Oct 2020

  9. Packet i/o engine. http://shader.kaist.edu/packetshader/io_engine/. Accessed 18 Oct 2020

  10. Pf_ring zc (zero copy). https://www.ntop.org/guides/pf_ring/zc.html. Accessed 18 Oct 2020

  11. Receive-side scaling. https://docs.microsoft.com/en-us/windows-hardware/drivers/network/introduction-to-receive-side-scaling. Accessed 18 Oct 2020

  12. Rps: Receive packet steering. http://lwn.net/Articles/361440/. Accessed 18 Oct 2020

  13. User datagram protocol. http://www.ietf.org/rfc/rfc768.txt. Accessed 18 Oct 2020

  14. Abeni L, Kiraly C, Li N, Bianco A (2015) On the performanc of kvm-based virtual routers. Comput Commun 70:40–53

    Article  Google Scholar 

  15. Armbrust M, Xin RS, Lian C, Huai Y, Liu D, Bradley JK, Meng X, Kaftan T, Franklin MJ, Ghodsi A et al(2015) Spark sql: relational data processing in spark. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 1383–1394

  16. Atikoglu B, Xu Y, Frachtenberg E, Jiang S, Paleczny M (2012) Workload analysis of a large-scale key-value store. In: ACM SIGMETRICS/PERFORMANCE Joint International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS ’12, pp. 53–64

  17. Boyd-Wickizer S, Chen H, Chen R, Mao Y, Kaashoek MF, Morris R, Pesterev A, Stein L, Wu M, Dai Y, Zhang Y, Zhang Z (2008) Corey: An operating system for many cores. In: 8th USENIX Symposium on Operating Systems Design and Implementation, OSDI ’08, pp. 43–57

  18. Boyd-Wickizer S, Clements AT, Mao Y, Pesterev A, Kaashoek MF, Morris R, Zeldovich N (2010) An analysis of linux scalability to many cores. In: 9th USENIX Symposium on Operating Systems Design and Implementation, OSDI ’10, pp. 1–16

  19. Clements AT, Kaashoek MF, Zeldovich N, Morris RT, Kohler E (2015) The scalable commutativity rule: designing scalable software for multicore processors. ACM Trans Comput Syst 32(4):10:1–10:47

    Article  Google Scholar 

  20. Eigler FC, Prasad V, Cohen W, Nguyen H, Hunt M, Keniston J, Chen B (2005) Architecture of systemtap: a linux trace/probe tool

  21. Ely D, Savage S, Wetherall D (2001) Alpine: A user-level infrastructure for network protocol development. In: 3rd USENIX Symposium on Internet Technologies and Systems, USITS ’01

  22. Ganger GR, Engler DR, Kaashoek MF, Briceño HM, Hunt R, Pinckney T (2002) Fast and flexible application-level networking on exokernel systems. ACM Trans Comput Syst 20(1):49–83

    Article  Google Scholar 

  23. Gunawi HS, Arpaci-Dusseau AC, Arpaci-Dusseau RH (2004) Deploying safe user-level network services with ictcp. In: 6th Symposium on Operating System Design and Implementation, OSDI ’04, pp. 317–332

  24. Han S, Jang K, Park K, Moon SB (2010) Packetshader: a gpu-accelerated software router. In: Proceedings of the ACM SIGCOMM 2010 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications, SIGCOMM ’10, pp. 195–206

  25. Han S, Marshall S, Chun B, Ratnasamy S (2012) Megapipe: A new programming interface for scalable network I/O. In: 10th USENIX Symposium on Operating Systems Design and Implementation, OSDI ’12, pp. 135–148

  26. Honda M, Huici F, Raiciu C, Araujo J, Rizzo L (2014) Rekindling network protocol innovation with user-level stacks. ACM SIGCOMM Comput Commun Rev 44(2):52–58

    Article  Google Scholar 

  27. Jeong E, Woo S, Jamshed MA, Jeong H, Ihm S, Han D, Park K (2014) mtcp: a highly scalable user-level TCP stack for multicore systems. In: Proceedings of the 11th USENIX Symposium on Networked Systems Design and Implementation, NSDI ’14, pp. 489–502

  28. Kim J, Jang K, Lee K, Ma S, Shim J, Moon SB (2015) NBA (network balancing act): a high-performance packet processing framework for heterogeneous processors. In: Proceedings of the Tenth European Conference on Computer Systems, EuroSys ’15, pp. 22:1–22:14

  29. Li B, Ruan Z, Xiao W, Lu Y, Xiong Y, Putnam A, Chen E, Zhang L (2017) Kv-direct: High-performance in-memory key-value store with programmable nic. In: Proceedings of the 26th Symposium on Operating Systems Principles, SOSP ’17, pp. 137–152. ACM, New York, NY, USA

  30. Lim H, Han D, Andersen DG, Kaminsky M (2014) MICA: A holistic approach to fast in-memory key-value storage. In: Proceedings of the 11th USENIX Symposium on Networked Systems Design and Implementation, NSDI ’14, pp. 429–444

  31. Lin X, Chen Y, Li X, Mao J, He J, Xu W, Shi Y (2016) Scalable kernel TCP design and implementation for short-lived connections. In: Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’16, pp. 339–352

  32. Mitchell C, Geng Y, Li J (2013) Using one-sided RDMA reads to build a fast, cpu-efficient key-value store. In: USENIX Annual Technical Conference, ATC ’13, pp. 103–114

  33. Montazeri B, Li Y, Alizadeh M, Ousterhout J (2018) Homa: A receiver-driven low-latency transport protocol using network priorities. In: Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication, pp. 221–235

  34. Nishtala R, Fugal H, Grimm S, Kwiatkowski M, Lee H, Li HC, McElroy R, Paleczny M, Peek D, Saab P, Stafford D, Tung T, Venkataramani V (2013) Scaling memcache at facebook. In: Proceedings of the 10th USENIX Symposium on Networked Systems Design and Implementation, NSDI ’13, pp. 385–398

  35. Ousterhout JK, Gopalan A, Gupta A, Kejriwal A, Lee C, Montazeri B, Ongaro D, Park SJ, Qin H, Rosenblum M, Rumble SM, Stutsman R, Yang S (2015) The ramcloud storage system. ACM Trans Comput Syst 33(3):7:1–7:55

    Article  Google Scholar 

  36. Pesterev A, Strauss J, Zeldovich N, Morris RT (2012) Improving network connection locality on multicore systems. In: European Conference on Computer Systems, Proceedings of the Seventh EuroSys Conference 2012, EuroSys ’12, pp. 337–350

  37. Pesterev A, Strauss J, Zeldovich N, Morris RT (2012) Improving network connection locality on multicore systems. In: European Conference on Computer Systems, Proceedings of the Seventh EuroSys Conference, EuroSys ’12, pp. 337–350

  38. Rajashekhar M, Yue Y. Caching with twemcache. https://blog.twitter.com/engineering/en_us/a/2012/caching-with-twemcache.html. Accessed 18 Oct 2020

  39. Rizzo L (2012) netmap: A novel framework for fast packet I/O. In: USENIX Annual Technical Conference, ATC ’12, pp. 101–112

  40. Soares L, Stumm M (2010) Flexsc: Flexible system call scheduling with exception-less system calls. In: 9th USENIX Symposium on Operating Systems Design and Implementation, OSDI ’10, pp. 33–46

  41. Soares L, Stumm M (2011) Exception-less system calls for event-driven servers. In: USENIX Annual Technical Conference, ATC ’11

  42. Song P, Liu Y, Liu T, Qian D (2017) Controller-proxy: scaling network management for large-scale sdn networks. Comput Commun 108:52–63

    Article  Google Scholar 

  43. Thekkath CA, Nguyen TD, Moy E, Lazowska ED (1993) Implementing network protocols at user level. IEEE/ACM Trans Netw 1(5):554–565

    Article  Google Scholar 

  44. Turull D, Sjődin P, Olsson R (2016) Pktgen: measuring performance on high speed networks. Comput Commun 82:39–48

    Article  Google Scholar 

  45. Yang J, Minturn DB, Hady F (2012) When poll is better than interrupt. In: Proceedings of the 10th USENIX conference on File and Storage Technologies, FAST ’12, pp. 25–31

  46. Yasukata K, Honda M, Santry D, Eggert L (2016) Stackmap: low-latency networking with the OS stack and dedicated nics. In: USENIX Annual Technical Conference, USENIX ATC ’16, pp. 43–56

  47. Zhuang M, Aker B. memaslap: Load testing and benchmarking a server. http://docs.libmemcached.org/bin/memaslap.html. Accessed 18 Oct 2020

Download references

Acknowledgements

This work was partially supported by Grant ZDBS-LY-JSC038 from the Key Research Program of Frontier Sciences, Chinese Academy of Sciences, Grant No. 62002350 from the National Natural Science Foundation of China and Grant No. 61807033 from the National Natural Science Foundation of China. The authors thank to Dr.Lijie.Xu for the suggestions of the paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hongjun Zhang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, H., Zhang, H., Zhang, L. et al. FastUDP: a highly scalable user-level UDP framework in multi-core systems for fast packet I/O. J Supercomput 77, 5148–5175 (2021). https://doi.org/10.1007/s11227-020-03486-6

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-020-03486-6

Keywords

Navigation