Skip to main content

LTNoT: Realizing the Trade-Offs Between Latency and Throughput in NVMe over TCP

  • Conference paper
  • First Online:
  • 1476 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13777))

Abstract

NVMe over Fabrics (NVMeoF) is a new emerging storage disaggregation protocol specially designed for datacenters with high-performance NVMe SSDs and interconnection networks. However, existing NVMeoF implementations cannot meet the differentiated I/O demands of the diverse applications running in datacenters. This is because the applications usually show significantly different I/O characteristics and requirements, e.g., some applications (L-apps) are sensitive to latency while other applications (T-apps) show high throughput demands to storage systems. When L-apps and T-apps access remote NVMe SSDs via a same NVMeoF storage network, the I/O requests issued from these applications are equally treated and handled following the same I/O path in state-of-the-art NVMeoF implementations. This will finally incur severe I/O interference between L-apps and T-apps.

In this paper, we propose LTNoT, an end-to-end packet processing scheme with dedicated I/O pipelines for L-apps and T-apps in NVMe over TCP (NoT) implementation. Specifically, LTNoT separates T-apps and L-apps resources in each NVMeoF queue pair to achieve inter-queue I/O isolation, transfers capsule and data in batch along with the T-app pipeline to achieve interrupt-coalescing, and introduces immediate-delivery and workqueue-priority to optimize L-app request process. We implemented LTNoT in Linux Kernel and evaluated it using real-world benchmarks and applications. Our experimental results demonstrate that LTNoT can achieve 48.13% and 53.38% lower L-apps latency than i10 and NoT respectively, increase bandwidth by up to 33.31% than NoT on average, thus LTNoT can effectively alleviate the I/O interference issue in NVMe over TCP without introducing any negative performance impacts on either L-apps or T-apps.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Al-Fares, M., Loukissas, A., Vahdat, A.: A scalable, commodity data center network architecture. ACM SIGCOMM Comput. Commun. Rev. 38(4), 63–74 (2008)

    Article  Google Scholar 

  2. Axboe, J.: Fio-flexible IO tester (2014). http://freecode.com/projects/fio

  3. Barroso, L.A., Hölzle, U.: The datacenter as a computer: an introduction to the design of warehouse-scale machines. Synth. Lect. Comput. Archit. 4(1), 1–108 (2009)

    Google Scholar 

  4. Belay, A., Prekas, G., Klimovic, A., Grossman, S., Kozyrakis, C., Bugnion, E.: Ix: a protected dataplane operating system for high throughput and low latency. In: 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2014), pp. 49–65 (2014)

    Google Scholar 

  5. Cai, Q., Chaudhary, S., Vuppalapati, M., Hwang, J., Agarwal, R.: Understanding host network stack overheads. In: Proceedings of the 2021 ACM SIGCOMM 2021 Conference (SIGCOMM 2021), pp. 65–77 (2021)

    Google Scholar 

  6. Chapin, J., Rosenblum, M., Devine, S., Lahiri, T., Teodosiu, D., Gupta, A.: Hive: fault containment for shared-memory multiprocessors. In: Proceedings of the Fifteenth ACM Symposium on Operating Systems Principles (SOSP 1995), pp. 12–25 (1995)

    Google Scholar 

  7. Cherkasova, L., Kotov, V., Rokicki, T.: Fibre channel fabrics: evaluation and design. In: Proceedings of HICSS-29: 29th Hawaii International Conference on System Sciences (HICSS 1996), pp. 53–62 (1996)

    Google Scholar 

  8. Chung, I.H., Abali, B., Crumley, P.: Towards a composable computer system. In: Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region (HPC Asia 2018), pp. 137–147 (2018)

    Google Scholar 

  9. Cobb, D., Huffman, A.: Nvm express and the PCI express SSD revolution. In: Intel Developer Forum (2012)

    Google Scholar 

  10. Cohen, D., Talpey, T., Kanevsky, A., Cummings, U., Krause, M.: Remote direct memory access over the converged enhanced ethernet fabric: evaluating the options. In: 2009 17th IEEE Symposium on High Performance Interconnects (HOTI 2009), pp. 123–130 (2009)

    Google Scholar 

  11. Dragojević, A., Narayanan, D., Castro, M., Hodson, O.: Farm: fast remote memory. In: 11th USENIX Symposium on Networked Systems Design and Implementation (NSDI 14), pp. 401–414 (2014)

    Google Scholar 

  12. Faraboschi, P., Keeton, K., Marsland, T., Milojicic, D.: Beyond processor-centric operating systems. In: 15th Workshop on Hot Topics in Operating Systems (HotOS XV) (2015)

    Google Scholar 

  13. Gao, P.X., Narayan, A., Karandikar, S., Carreira, J., Han: Network requirements for resource disaggregation. In: 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2016), pp. 249–264 (2016)

    Google Scholar 

  14. Gu, J., Lee, Y., Zhang, Y., Chowdhury, M., Shin, K.G.: Efficient memory disaggregation with infiniswap. In: 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 2017), pp. 649–667 (2017)

    Google Scholar 

  15. Guz, Z., Li, H., Shayesteh, A., Balakrishnan, V.: NVMe-over-fabrics performance characterization and the path to low-overhead flash disaggregation. In: Proceedings of the 10th ACM International Systems and Storage Conference (SYSTOR 2017), pp. 1–9 (2017)

    Google Scholar 

  16. Han, D., Nam, B.: Improving access to HDFS using NVMeoF. In: 2019 IEEE International Conference on Cluster Computing (CLUSTER 2019), pp. 1–2 (2019)

    Google Scholar 

  17. Hellwig, C.: High performance storage with blkmq and scsi-mq (2014)

    Google Scholar 

  18. Hoff, B.: RDMA interconnects paving the way for NVMe over fabrics technology (2016)

    Google Scholar 

  19. Hwang, J., Cai, Q., Tang, A., Agarwal, R.: Tcp\(\approx \)rdma: cpu-efficient remote storage access with i10. In: 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 2020), pp. 127–140 (2020)

    Google Scholar 

  20. Hwang, J., Vuppalapati, M., Peter, S., Agarwal, R.: Rearchitecting linux storage stack for \(\mu \)s latency and high throughput (2021)

    Google Scholar 

  21. Jeong, E., et al.: mTCP: a highly scalable user-level TCP stack for multicore systems. In: 11th USENIX Symposium on Networked Systems Design and Implementation (NSDI 2014), pp. 489–502 (2014)

    Google Scholar 

  22. Kaufmann, A., Stamler, T., Peter, S., Sharma: TAS: TCP acceleration as an OS service. In: Proceedings of the Fourteenth EuroSys Conference 2019 (EuroSys 2019), pp. 1–16 (2019)

    Google Scholar 

  23. Kim, S., Kim, H., Lee, J., Jeong, J.: Enlightening the i/o path: a holistic approach for application performance. In: 15th USENIX Conference on File and Storage Technologies (FAST 2017), pp. 345–358 (2017)

    Google Scholar 

  24. Klimovic, A., Kozyrakis, C., Thereska, E., John, B., Kumar, S.: Flash storage disaggregation. In: Proceedings of the Eleventh European Conference on Computer Systems (EuroSys 2016), pp. 1–15 (2016)

    Google Scholar 

  25. Klimovic, A., Litz, H., Kozyrakis, C.: Reflex: remote flash local flash. ACM SIGARCH Comput. Archit. News 45(1), 345–359 (2017)

    Article  Google Scholar 

  26. Lim, K., Chang, J., Mudge, T., Ranganathan, P., Reinhardt, S.K., Wenisch, T.F.: Disaggregated memory for expansion and sharing in blade servers. ACM SIGARCH Comput. Archit. News 37(3), 267–278 (2009)

    Article  Google Scholar 

  27. Lim, K., et al.: System-level implications of disaggregated memory. In: IEEE International Symposium on High-Performance Comp Architecture (HPCA 2012), pp. 1–12 (2012)

    Google Scholar 

  28. Lin, X., et al.: Scalable kernel TCP design and implementation for short-lived connections. ACM SIGARCH Comput. Archit. News 44(2), 339–352 (2016)

    Article  Google Scholar 

  29. Love, R.: Linux kernel development. Pearson Education (2010)

    Google Scholar 

  30. Marinos, I., Watson, R.N., Handley, M.: Network stack specialization for performance. ACM SIGCOMM Comput. Commun. Rev. 44(4), 175–186 (2014)

    Article  Google Scholar 

  31. Marty, M., Gribble, S., Kidd, N., Kononov, R., Evans, W.C.: Snap: a microkernel approach to host networking. In: 27th ACM Symposium on Operating Systems Principles (SOSP 2019), pp. 399–413 (2019)

    Google Scholar 

  32. Minturn, D.: Nvm express over fabrics. In: 11th Annual OpenFabrics International OFS Developers’ Workshop (2015)

    Google Scholar 

  33. Nanavati, M., Wires, J., Warfield, A.: Decibel: isolation and sharing in disaggregated rack-scale storage. In: 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17), pp. 17–33 (2017)

    Google Scholar 

  34. Prekas, G., Kogias, M., Bugnion, E.: Zygos: achieving low tail latency for microsecond-scale networked tasks. In: Proceedings of the 26th Symposium on Operating Systems Principles (SOSP 17), pp. 325–341 (2017)

    Google Scholar 

  35. Qiao, X., Xie, X., Xiao, L.: Load-aware transmission mechanism for NVMeoF storage networks. In: International Conference on High Performance Computing and Communication (HPCCE 2021), vol. 12162, pp. 105–112 SPIE (2022)

    Google Scholar 

  36. Shan, Y., Huang, Y., Chen, Y., Zhang, Y.: LegoOS: a disseminated, distributed OS for hardware resource disaggregation. In: 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2018), pp. 69–87 (2018)

    Google Scholar 

  37. Son, Y., Kang, H., Han, H., Yeom, H.Y.: An empirical evaluation of nvm express SSD. In: International Conference on Cloud & Autonomic Computing (ICCAC 15), pp. 275–282 (2015)

    Google Scholar 

  38. Tai, A., Smolyar, I., Wei, M., Tsafrir, D.: Optimizing storage performance with calibrated interrupts. In: Proceedings of the 15th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2021), pp. 129–145 (2021)

    Google Scholar 

  39. Yu, Y., et al.: Optimizing the block i/o subsystem for fast storage devices. ACM Trans. Comput. Syst. 32(2), 1–48 (2014)

    Article  Google Scholar 

  40. Zhu, Y., Yu, W., Jiao, B., Mohror, K., Moody, A., Chowdhury, F.: Efficient user-level storage disaggregation for deep learning. In: 2019 IEEE International Conference on Cluster Computing (CLUSTER 2019), pp. 1–12 (2019)

    Google Scholar 

Download references

Acknowledgements

We would like to thank the ICA3PP reviewers for their insightful feedback. This work was supported in part by Excellent Youth Foundation of Hunan Province under Grant No. 2021JJ10050 and the Science Foundation of NUDT under grant ZK21–03.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Xuchao Xie or Dezun Dong .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Gu, W., Xie, X., Dong, D. (2023). LTNoT: Realizing the Trade-Offs Between Latency and Throughput in NVMe over TCP. In: Meng, W., Lu, R., Min, G., Vaidya, J. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2022. Lecture Notes in Computer Science, vol 13777. Springer, Cham. https://doi.org/10.1007/978-3-031-22677-9_22

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-22677-9_22

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-22676-2

  • Online ISBN: 978-3-031-22677-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics