LTNoT: Realizing the Trade-Offs Between Latency and Throughput in NVMe over TCP

Gu, Wenhao; Xie, Xuchao; Dong, Dezun

doi:10.1007/978-3-031-22677-9_22

LTNoT: Realizing the Trade-Offs Between Latency and Throughput in NVMe over TCP

Wenhao Gu¹¹,
Xuchao Xie¹¹ &
Dezun Dong¹¹

Conference paper
First Online: 11 January 2023

1476 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13777))

Abstract

NVMe over Fabrics (NVMeoF) is a new emerging storage disaggregation protocol specially designed for datacenters with high-performance NVMe SSDs and interconnection networks. However, existing NVMeoF implementations cannot meet the differentiated I/O demands of the diverse applications running in datacenters. This is because the applications usually show significantly different I/O characteristics and requirements, e.g., some applications (L-apps) are sensitive to latency while other applications (T-apps) show high throughput demands to storage systems. When L-apps and T-apps access remote NVMe SSDs via a same NVMeoF storage network, the I/O requests issued from these applications are equally treated and handled following the same I/O path in state-of-the-art NVMeoF implementations. This will finally incur severe I/O interference between L-apps and T-apps.

In this paper, we propose LTNoT, an end-to-end packet processing scheme with dedicated I/O pipelines for L-apps and T-apps in NVMe over TCP (NoT) implementation. Specifically, LTNoT separates T-apps and L-apps resources in each NVMeoF queue pair to achieve inter-queue I/O isolation, transfers capsule and data in batch along with the T-app pipeline to achieve interrupt-coalescing, and introduces immediate-delivery and workqueue-priority to optimize L-app request process. We implemented LTNoT in Linux Kernel and evaluated it using real-world benchmarks and applications. Our experimental results demonstrate that LTNoT can achieve 48.13% and 53.38% lower L-apps latency than i10 and NoT respectively, increase bandwidth by up to 33.31% than NoT on average, thus LTNoT can effectively alleviate the I/O interference issue in NVMe over TCP without introducing any negative performance impacts on either L-apps or T-apps.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Al-Fares, M., Loukissas, A., Vahdat, A.: A scalable, commodity data center network architecture. ACM SIGCOMM Comput. Commun. Rev. 38(4), 63–74 (2008)
Article Google Scholar
Axboe, J.: Fio-flexible IO tester (2014). http://freecode.com/projects/fio
Barroso, L.A., Hölzle, U.: The datacenter as a computer: an introduction to the design of warehouse-scale machines. Synth. Lect. Comput. Archit. 4(1), 1–108 (2009)
Google Scholar
Belay, A., Prekas, G., Klimovic, A., Grossman, S., Kozyrakis, C., Bugnion, E.: Ix: a protected dataplane operating system for high throughput and low latency. In: 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2014), pp. 49–65 (2014)
Google Scholar
Cai, Q., Chaudhary, S., Vuppalapati, M., Hwang, J., Agarwal, R.: Understanding host network stack overheads. In: Proceedings of the 2021 ACM SIGCOMM 2021 Conference (SIGCOMM 2021), pp. 65–77 (2021)
Google Scholar
Chapin, J., Rosenblum, M., Devine, S., Lahiri, T., Teodosiu, D., Gupta, A.: Hive: fault containment for shared-memory multiprocessors. In: Proceedings of the Fifteenth ACM Symposium on Operating Systems Principles (SOSP 1995), pp. 12–25 (1995)
Google Scholar
Cherkasova, L., Kotov, V., Rokicki, T.: Fibre channel fabrics: evaluation and design. In: Proceedings of HICSS-29: 29th Hawaii International Conference on System Sciences (HICSS 1996), pp. 53–62 (1996)
Google Scholar
Chung, I.H., Abali, B., Crumley, P.: Towards a composable computer system. In: Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region (HPC Asia 2018), pp. 137–147 (2018)
Google Scholar
Cobb, D., Huffman, A.: Nvm express and the PCI express SSD revolution. In: Intel Developer Forum (2012)
Google Scholar
Cohen, D., Talpey, T., Kanevsky, A., Cummings, U., Krause, M.: Remote direct memory access over the converged enhanced ethernet fabric: evaluating the options. In: 2009 17th IEEE Symposium on High Performance Interconnects (HOTI 2009), pp. 123–130 (2009)
Google Scholar
Dragojević, A., Narayanan, D., Castro, M., Hodson, O.: Farm: fast remote memory. In: 11th USENIX Symposium on Networked Systems Design and Implementation (NSDI 14), pp. 401–414 (2014)
Google Scholar
Faraboschi, P., Keeton, K., Marsland, T., Milojicic, D.: Beyond processor-centric operating systems. In: 15th Workshop on Hot Topics in Operating Systems (HotOS XV) (2015)
Google Scholar
Gao, P.X., Narayan, A., Karandikar, S., Carreira, J., Han: Network requirements for resource disaggregation. In: 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2016), pp. 249–264 (2016)
Google Scholar
Gu, J., Lee, Y., Zhang, Y., Chowdhury, M., Shin, K.G.: Efficient memory disaggregation with infiniswap. In: 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 2017), pp. 649–667 (2017)
Google Scholar
Guz, Z., Li, H., Shayesteh, A., Balakrishnan, V.: NVMe-over-fabrics performance characterization and the path to low-overhead flash disaggregation. In: Proceedings of the 10th ACM International Systems and Storage Conference (SYSTOR 2017), pp. 1–9 (2017)
Google Scholar
Han, D., Nam, B.: Improving access to HDFS using NVMeoF. In: 2019 IEEE International Conference on Cluster Computing (CLUSTER 2019), pp. 1–2 (2019)
Google Scholar
Hellwig, C.: High performance storage with blkmq and scsi-mq (2014)
Google Scholar
Hoff, B.: RDMA interconnects paving the way for NVMe over fabrics technology (2016)
Google Scholar
Hwang, J., Cai, Q., Tang, A., Agarwal, R.: Tcp\(\approx \)rdma: cpu-efficient remote storage access with i10. In: 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 2020), pp. 127–140 (2020)
Google Scholar
Hwang, J., Vuppalapati, M., Peter, S., Agarwal, R.: Rearchitecting linux storage stack for \(\mu \)s latency and high throughput (2021)
Google Scholar
Jeong, E., et al.: mTCP: a highly scalable user-level TCP stack for multicore systems. In: 11th USENIX Symposium on Networked Systems Design and Implementation (NSDI 2014), pp. 489–502 (2014)
Google Scholar
Kaufmann, A., Stamler, T., Peter, S., Sharma: TAS: TCP acceleration as an OS service. In: Proceedings of the Fourteenth EuroSys Conference 2019 (EuroSys 2019), pp. 1–16 (2019)
Google Scholar
Kim, S., Kim, H., Lee, J., Jeong, J.: Enlightening the i/o path: a holistic approach for application performance. In: 15th USENIX Conference on File and Storage Technologies (FAST 2017), pp. 345–358 (2017)
Google Scholar
Klimovic, A., Kozyrakis, C., Thereska, E., John, B., Kumar, S.: Flash storage disaggregation. In: Proceedings of the Eleventh European Conference on Computer Systems (EuroSys 2016), pp. 1–15 (2016)
Google Scholar
Klimovic, A., Litz, H., Kozyrakis, C.: Reflex: remote flash local flash. ACM SIGARCH Comput. Archit. News 45(1), 345–359 (2017)
Article Google Scholar
Lim, K., Chang, J., Mudge, T., Ranganathan, P., Reinhardt, S.K., Wenisch, T.F.: Disaggregated memory for expansion and sharing in blade servers. ACM SIGARCH Comput. Archit. News 37(3), 267–278 (2009)
Article Google Scholar
Lim, K., et al.: System-level implications of disaggregated memory. In: IEEE International Symposium on High-Performance Comp Architecture (HPCA 2012), pp. 1–12 (2012)
Google Scholar
Lin, X., et al.: Scalable kernel TCP design and implementation for short-lived connections. ACM SIGARCH Comput. Archit. News 44(2), 339–352 (2016)
Article Google Scholar
Love, R.: Linux kernel development. Pearson Education (2010)
Google Scholar
Marinos, I., Watson, R.N., Handley, M.: Network stack specialization for performance. ACM SIGCOMM Comput. Commun. Rev. 44(4), 175–186 (2014)
Article Google Scholar
Marty, M., Gribble, S., Kidd, N., Kononov, R., Evans, W.C.: Snap: a microkernel approach to host networking. In: 27th ACM Symposium on Operating Systems Principles (SOSP 2019), pp. 399–413 (2019)
Google Scholar
Minturn, D.: Nvm express over fabrics. In: 11th Annual OpenFabrics International OFS Developers’ Workshop (2015)
Google Scholar
Nanavati, M., Wires, J., Warfield, A.: Decibel: isolation and sharing in disaggregated rack-scale storage. In: 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17), pp. 17–33 (2017)
Google Scholar
Prekas, G., Kogias, M., Bugnion, E.: Zygos: achieving low tail latency for microsecond-scale networked tasks. In: Proceedings of the 26th Symposium on Operating Systems Principles (SOSP 17), pp. 325–341 (2017)
Google Scholar
Qiao, X., Xie, X., Xiao, L.: Load-aware transmission mechanism for NVMeoF storage networks. In: International Conference on High Performance Computing and Communication (HPCCE 2021), vol. 12162, pp. 105–112 SPIE (2022)
Google Scholar
Shan, Y., Huang, Y., Chen, Y., Zhang, Y.: LegoOS: a disseminated, distributed OS for hardware resource disaggregation. In: 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2018), pp. 69–87 (2018)
Google Scholar
Son, Y., Kang, H., Han, H., Yeom, H.Y.: An empirical evaluation of nvm express SSD. In: International Conference on Cloud & Autonomic Computing (ICCAC 15), pp. 275–282 (2015)
Google Scholar
Tai, A., Smolyar, I., Wei, M., Tsafrir, D.: Optimizing storage performance with calibrated interrupts. In: Proceedings of the 15th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2021), pp. 129–145 (2021)
Google Scholar
Yu, Y., et al.: Optimizing the block i/o subsystem for fast storage devices. ACM Trans. Comput. Syst. 32(2), 1–48 (2014)
Article Google Scholar
Zhu, Y., Yu, W., Jiao, B., Mohror, K., Moody, A., Chowdhury, F.: Efficient user-level storage disaggregation for deep learning. In: 2019 IEEE International Conference on Cluster Computing (CLUSTER 2019), pp. 1–12 (2019)
Google Scholar

Download references

Acknowledgements

We would like to thank the ICA3PP reviewers for their insightful feedback. This work was supported in part by Excellent Youth Foundation of Hunan Province under Grant No. 2021JJ10050 and the Science Foundation of NUDT under grant ZK21–03.

Author information

Authors and Affiliations

College of Computer, National University of Defense Technology, Changsha, 410073, China
Wenhao Gu, Xuchao Xie & Dezun Dong

Authors

Wenhao Gu
View author publications
You can also search for this author in PubMed Google Scholar
Xuchao Xie
View author publications
You can also search for this author in PubMed Google Scholar
Dezun Dong
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Xuchao Xie or Dezun Dong .

Editor information

Editors and Affiliations

Technical University of Denmark, Kongens Lyngby, Denmark
Weizhi Meng
University of New Brunswick, Fredericton, NB, Canada
Rongxing Lu
University of Exeter, Exeter, UK
Geyong Min
Rutgers University, Newark, NJ, USA
Jaideep Vaidya

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gu, W., Xie, X., Dong, D. (2023). LTNoT: Realizing the Trade-Offs Between Latency and Throughput in NVMe over TCP. In: Meng, W., Lu, R., Min, G., Vaidya, J. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2022. Lecture Notes in Computer Science, vol 13777. Springer, Cham. https://doi.org/10.1007/978-3-031-22677-9_22

Download citation

DOI: https://doi.org/10.1007/978-3-031-22677-9_22
Published: 11 January 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-22676-2
Online ISBN: 978-3-031-22677-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics