Skip to main content
Log in

RAMCI: a novel asynchronous memory copying mechanism based on I/OAT

  • Regular Paper
  • Published:
CCF Transactions on High Performance Computing Aims and scope Submit manuscript

Abstract

Memory copying is one of the most common operations in modern software. Usually, the operation reflects a synchronous (sync) CPU procedure of memory copying, incurring overheads such as cache pollution and CPU stalling, especially in the scenario of bulk copying with large data. To improve this issue, some works based on I/OAT, which is a dedicated and popular hardware copying engine on Intel platform, is proposed but still exists several problems: (1) lacking atomic allocation/revocation at the granularity of I/OAT channel; (2) deficiency of interrupt support and (3) complicated programming interfaces. We propose RAMCI, an asynchronous (async) memory copying mechanism based on Intel I/OAT engine, not only improves the sync overheads, but also overcomes the above three issues through (1) a lock mechanism by using low-level CAS instruction; (2) a lightweight interrupt mechanism for the completion of memory copying, instead of using the polling pattern which consuming large CPU resource and (3) a group of well-defined and abstract interfaces, allowing the programmers to utilize the underlying free I/OAT channels transparently. To support the interfaces, a novel scheduler of the I/OAT channels is introduced. It splits the source copying data into several pieces, and each of them can be allocated with a dedicated I/OAT channel intelligently to transfer the data with parallelism. We evaluate RAMCI and compare it with other memory copying mechanisms in four NUMA scenarios. The experimental results show that RAMCI improves memory copying performance up to 4.68\(\times \) while achieving almost full ability of parallel computing.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. https://github.com/torvalds/linux/tree/master/drivers/dma/ioat.

  2. https://github.com/spdk/spdk/tree/master/lib/ioat.

  3. https://github.com/DPDK/dpdk.

References

  • Atlidakis, V., Andrus, J., Geambasu, R., Mitropoulos, D., Nieh, J.: Posix abstractions in modern operating systems: The old, the new, and the missing. In: Proceedings of the Eleventh European Conference on Computer Systems, pp 1–17 (2016)

  • Chen, Q., Zheng, L., Liao, X., Jin, H., Wang, Q.: Effective runtime scheduling for high-performance graph processing on heterogeneous dataflow architecture. CCF Transactions on High Performance Computing pp 1–14 (2020a)

  • Chen, W., Chen, Z., Li, D., Liu, H., Tang, Y.: Low-overhead inline deduplication for persistent memory. Transactions on Emerging Telecommunications Technologies p e4079 (2020b)

  • Dong, M., Li, H., Ota, K., Xiao, J.: Rule caching in sdn-enabled mobile access networks. IEEE Netw. 29(4), 40–45 (2015)

    Article  Google Scholar 

  • Duarte, F., Wong, S.: Cache-based memory copy hardware accelerator for multicore systems. IEEE Trans. Comput. 59(11), 1494–1507 (2010)

    Article  MathSciNet  Google Scholar 

  • Fang, J., Huang, C., Tang, T., Wang, Z.: Parallel programming models for heterogeneous many-cores: a comprehensive survey. CCF Trans. High Perform. Comput. pp 1–19 (2020)

  • Govindaraju, R.K., Cheng, L., Ranganathan, P., Marty, M.R., Gallatin, A.: Asynchronous copying of data within memory. US Patent 10,191,672 (2019)

  • Gschwind, M.: Chip multiprocessing and the cell broadband engine. In: Proceedings of the 3rd conference on Computing frontiers, pp 1–8 (2006)

  • Harris, T.L., Fraser, K., Pratt, I.A.: A practical multi-word compare-and-swap operation. In: International Symposium on Distributed Computing, Springer, pp 265–279 (2002)

  • Hua, Y., Shi, X., Jin, H., Liu, W., Jiang, Y., Chen, Y., He, L.: Software-defined qos for i/o in exascale computing. CCF Trans. High Perform. Comput. 1(1), 49–59 (2019)

    Article  Google Scholar 

  • Huang, D., Lu, Y.: Improving the efficiency of hpc data movement on container-based virtual cluster. CCF Trans. High Perform. Comput. pp 1–14 (2020)

  • Intel (2014) Intel\(\textregistered \) Xeon\(\textregistered \) E7-2800, E7-4800, E7-8800 v2 Datasheet, Vol. 2, March 2014

  • Jiang, X., Solihin, Y., Zhao, L., Iyer, R.: Architecture support for improving bulk memory copying and initialization performance. In: 2009 18th International Conference on Parallel Architectures and Compilation Techniques, IEEE, pp 169–180 (2009)

  • Kanter, D.: Intel’s sandy bridge microarchitecture (2010)

  • Lepak, K., Talbot, G., White, S., Beck, N., Naffziger, S., et al. (2017) The next generation amd enterprise server product architecture. IEEE Hot Chips 29

  • Li, D., Liao, X., Jin, H., Zhou, B., Zhang, Q.: A new disk i/o model of virtualized cloud environment. IEEE Trans. Parallel Distrib. Syst. 24(6), 1129–1138 (2012)

    Article  Google Scholar 

  • Li, D., Dong, M., Yuan, Y., Chen, J., Ota, K., Tang, Y.: Seer-mcache: A prefetchable memory object caching system for iot real-time data processing. IEEE Internet Things J. 5(5), 3648–3660 (2018a)

    Article  Google Scholar 

  • Li, D., Ota, K., Zhong, Y., Dong, M., Tang, Y., Qiu, J.: Towards high-efficient transaction commitment in a virtualized and sustainable rdbms. IEEE Trans. Sustain. Comput. (2019a). https://doi.org/10.1109/TSUSC.2019.2890841

  • Li, H., Ota, K., Dong, M.: Eccn: Orchestration of edge-centric computing and content-centric networking in the 5g radio access network. IEEE Wirel. Commun. 25(3), 88–93 (2018b)

    Article  Google Scholar 

  • Li, H., Ota, K., Dong, M.: Deep reinforcement scheduling for mobile crowdsensing in fog computing. ACM Trans. Internet Technol. (TOIT) 19(2), 1–18 (2019b)

    Article  Google Scholar 

  • Seshadri, V., Kim, Y., Fallin, C., Lee, D., Ausavarungnirun, R., Pekhimenko, G., Luo, Y., Mutlu, O., Gibbons, P.B., Kozuch, M.A, et al. Rowclone: fast and energy-efficient in-dram bulk data copy and initialization. In: Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, pp 185–197 (2013)

  • Su, W., Wang, L., Su, M., Liu, S.: A processor-dma-based memory copy hardware accelerator. In: 2011 IEEE Sixth International Conference on Networking, Architecture, and Storage, IEEE, pp 225–229 (2011)

  • Sun, J., Chen, H., He, L., Tan, H.: Redundant network traffic elimination with gpu accelerated rabin fingerprinting. IEEE Trans. Parallel Distrib. Syst. 27(7), 2130–2142 (2015)

    Article  Google Scholar 

  • Vaidyanathan, K., Chai, L., Huang, W., Panda, D.K.: Efficient asynchronous memory copy operations on multi-core systems and i/oat. In: 2007 IEEE International Conference on Cluster Computing, IEEE, pp 159–168 (2007a)

  • Vaidyanathan, K., Huang, W., Chai, L., Panda, D.K.: Designing efficient asynchronous memory operations using hardware copy engine: A case study with i/oat. In: 2007 IEEE International Parallel and Distributed Processing Symposium, IEEE, pp 1–8 (2007b)

  • Valois, J.D.: Lock-free linked lists using compare-and-swap. In: Proceedings of the fourteenth annual ACM symposium on Principles of distributed computing, pp 214–222 (1995)

  • Vassiliadis ,S., Duarte, F., Wong, S.: A load/store unit for a memcpy hardware accelerator. In: 2007 International Conference on Field Programmable Logic and Applications, IEEE, pp 537–541 (2007)

  • Wong, S., Duarte, F., Vassiliadis, S.: A hardware cache memcpy accelerator. In: 2006 IEEE International Conference on Field Programmable Technology, IEEE, pp 141–148 (2006)

  • Yang, Z., Harris, J.R., Walker, B., Verkamp, D., Liu, C., Chang, C., Cao, G., Stern, J., Verma, V., Paul, L.E.: Spdk: A development kit to build high performance storage applications. In: 2017 IEEE International Conference on Cloud Computing Technology and Science (CloudCom), IEEE, pp 154–161 (2017)

  • Zhao, L., Iyer, R., Makineni, S., Bhuyan, L., Newell, D.: Hardware support for bulk data movement in server platforms. In: 2005 International Conference on Computer Design, IEEE, pp 53–60 (2005)

  • Zhao, L., Bhuyan, L.N., Iyer, R., Makineni, S., Newell, D.: Hardware support for accelerating data movement in server platform. IEEE Trans. Comput. 56(6), 740–753 (2007)

    Article  MathSciNet  Google Scholar 

  • Zhong, W., Sun, J., Chen, H., Xiao, J., Chen, Z., Cheng, C., Shi, X.: Optimizing graph processing on gpus. IEEE Trans. Parallel Distrib. Syst. 28(4), 1149–1162 (2016)

    Article  Google Scholar 

  • Zhou, Z., Chen, X., Li, E., Zeng, L., Luo, K., Zhang, J.: Edge intelligence: Paving the last mile of artificial intelligence with edge computing. Proc. IEEE 107(8), 1738–1762 (2019)

    Article  Google Scholar 

  • Zhou, Z., Yang, S., Pu, L.J., Yu, S.: Cefl: Online admission control, data scheduling and accuracy tuning for cost-efficient federated learning across edge nodes. IEEE Internet Things J. (2020)

Download references

Acknowledgements

This work was funded by the National Natural Science Foundation of China under grant number 61972164, 61772211 and U1811263, by the Guangdong Basic and Applied Basic Research Foundation under grant number 2019A1515011160, by the Guangzhou Key Laboratory of Big Data and Intelligent Education under grant number 201905010009.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dingding Li.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, Z., Li, D., Wang, Z. et al. RAMCI: a novel asynchronous memory copying mechanism based on I/OAT. CCF Trans. HPC 3, 129–143 (2021). https://doi.org/10.1007/s42514-021-00063-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s42514-021-00063-y

Keywords

Navigation