RAMCI: a novel asynchronous memory copying mechanism based on I/OAT

Chen, Zhenke; Li, Dingding; Wang, Zhiwen; Liu, Hai; Tang, Yong

doi:10.1007/s42514-021-00063-y

RAMCI: a novel asynchronous memory copying mechanism based on I/OAT

Regular Paper
Published: 04 March 2021

Volume 3, pages 129–143, (2021)
Cite this article

CCF Transactions on High Performance Computing Aims and scope Submit manuscript

Zhenke Chen¹,
Dingding Li ORCID: orcid.org/0000-0001-9092-9814¹,
Zhiwen Wang¹,
Hai Liu¹ &
…
Yong Tang¹

484 Accesses
8 Citations
Explore all metrics

Abstract

Memory copying is one of the most common operations in modern software. Usually, the operation reflects a synchronous (sync) CPU procedure of memory copying, incurring overheads such as cache pollution and CPU stalling, especially in the scenario of bulk copying with large data. To improve this issue, some works based on I/OAT, which is a dedicated and popular hardware copying engine on Intel platform, is proposed but still exists several problems: (1) lacking atomic allocation/revocation at the granularity of I/OAT channel; (2) deficiency of interrupt support and (3) complicated programming interfaces. We propose RAMCI, an asynchronous (async) memory copying mechanism based on Intel I/OAT engine, not only improves the sync overheads, but also overcomes the above three issues through (1) a lock mechanism by using low-level CAS instruction; (2) a lightweight interrupt mechanism for the completion of memory copying, instead of using the polling pattern which consuming large CPU resource and (3) a group of well-defined and abstract interfaces, allowing the programmers to utilize the underlying free I/OAT channels transparently. To support the interfaces, a novel scheduler of the I/OAT channels is introduced. It splits the source copying data into several pieces, and each of them can be allocated with a dedicated I/OAT channel intelligently to transfer the data with parallelism. We evaluate RAMCI and compare it with other memory copying mechanisms in four NUMA scenarios. The experimental results show that RAMCI improves memory copying performance up to 4.68\(\times \) while achieving almost full ability of parallel computing.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 5

A Lightweight Asynchronous I/O System for Non-volatile Memory

Symmetric Memory Partitions in OpenSHMEM: A Case Study with Intel KNL

Mimalloc: Free List Sharding in Action

Notes

References

Atlidakis, V., Andrus, J., Geambasu, R., Mitropoulos, D., Nieh, J.: Posix abstractions in modern operating systems: The old, the new, and the missing. In: Proceedings of the Eleventh European Conference on Computer Systems, pp 1–17 (2016)
Chen, Q., Zheng, L., Liao, X., Jin, H., Wang, Q.: Effective runtime scheduling for high-performance graph processing on heterogeneous dataflow architecture. CCF Transactions on High Performance Computing pp 1–14 (2020a)
Chen, W., Chen, Z., Li, D., Liu, H., Tang, Y.: Low-overhead inline deduplication for persistent memory. Transactions on Emerging Telecommunications Technologies p e4079 (2020b)
Dong, M., Li, H., Ota, K., Xiao, J.: Rule caching in sdn-enabled mobile access networks. IEEE Netw. 29(4), 40–45 (2015)
Article Google Scholar
Duarte, F., Wong, S.: Cache-based memory copy hardware accelerator for multicore systems. IEEE Trans. Comput. 59(11), 1494–1507 (2010)
Article MathSciNet Google Scholar
Fang, J., Huang, C., Tang, T., Wang, Z.: Parallel programming models for heterogeneous many-cores: a comprehensive survey. CCF Trans. High Perform. Comput. pp 1–19 (2020)
Govindaraju, R.K., Cheng, L., Ranganathan, P., Marty, M.R., Gallatin, A.: Asynchronous copying of data within memory. US Patent 10,191,672 (2019)
Gschwind, M.: Chip multiprocessing and the cell broadband engine. In: Proceedings of the 3rd conference on Computing frontiers, pp 1–8 (2006)
Harris, T.L., Fraser, K., Pratt, I.A.: A practical multi-word compare-and-swap operation. In: International Symposium on Distributed Computing, Springer, pp 265–279 (2002)
Hua, Y., Shi, X., Jin, H., Liu, W., Jiang, Y., Chen, Y., He, L.: Software-defined qos for i/o in exascale computing. CCF Trans. High Perform. Comput. 1(1), 49–59 (2019)
Article Google Scholar
Huang, D., Lu, Y.: Improving the efficiency of hpc data movement on container-based virtual cluster. CCF Trans. High Perform. Comput. pp 1–14 (2020)
Intel (2014) Intel\(\textregistered \) Xeon\(\textregistered \) E7-2800, E7-4800, E7-8800 v2 Datasheet, Vol. 2, March 2014
Jiang, X., Solihin, Y., Zhao, L., Iyer, R.: Architecture support for improving bulk memory copying and initialization performance. In: 2009 18th International Conference on Parallel Architectures and Compilation Techniques, IEEE, pp 169–180 (2009)
Kanter, D.: Intel’s sandy bridge microarchitecture (2010)
Lepak, K., Talbot, G., White, S., Beck, N., Naffziger, S., et al. (2017) The next generation amd enterprise server product architecture. IEEE Hot Chips 29
Li, D., Liao, X., Jin, H., Zhou, B., Zhang, Q.: A new disk i/o model of virtualized cloud environment. IEEE Trans. Parallel Distrib. Syst. 24(6), 1129–1138 (2012)
Article Google Scholar
Li, D., Dong, M., Yuan, Y., Chen, J., Ota, K., Tang, Y.: Seer-mcache: A prefetchable memory object caching system for iot real-time data processing. IEEE Internet Things J. 5(5), 3648–3660 (2018a)
Article Google Scholar
Li, D., Ota, K., Zhong, Y., Dong, M., Tang, Y., Qiu, J.: Towards high-efficient transaction commitment in a virtualized and sustainable rdbms. IEEE Trans. Sustain. Comput. (2019a). https://doi.org/10.1109/TSUSC.2019.2890841
Li, H., Ota, K., Dong, M.: Eccn: Orchestration of edge-centric computing and content-centric networking in the 5g radio access network. IEEE Wirel. Commun. 25(3), 88–93 (2018b)
Article Google Scholar
Li, H., Ota, K., Dong, M.: Deep reinforcement scheduling for mobile crowdsensing in fog computing. ACM Trans. Internet Technol. (TOIT) 19(2), 1–18 (2019b)
Article Google Scholar
Seshadri, V., Kim, Y., Fallin, C., Lee, D., Ausavarungnirun, R., Pekhimenko, G., Luo, Y., Mutlu, O., Gibbons, P.B., Kozuch, M.A, et al. Rowclone: fast and energy-efficient in-dram bulk data copy and initialization. In: Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, pp 185–197 (2013)
Su, W., Wang, L., Su, M., Liu, S.: A processor-dma-based memory copy hardware accelerator. In: 2011 IEEE Sixth International Conference on Networking, Architecture, and Storage, IEEE, pp 225–229 (2011)
Sun, J., Chen, H., He, L., Tan, H.: Redundant network traffic elimination with gpu accelerated rabin fingerprinting. IEEE Trans. Parallel Distrib. Syst. 27(7), 2130–2142 (2015)
Article Google Scholar
Vaidyanathan, K., Chai, L., Huang, W., Panda, D.K.: Efficient asynchronous memory copy operations on multi-core systems and i/oat. In: 2007 IEEE International Conference on Cluster Computing, IEEE, pp 159–168 (2007a)
Vaidyanathan, K., Huang, W., Chai, L., Panda, D.K.: Designing efficient asynchronous memory operations using hardware copy engine: A case study with i/oat. In: 2007 IEEE International Parallel and Distributed Processing Symposium, IEEE, pp 1–8 (2007b)
Valois, J.D.: Lock-free linked lists using compare-and-swap. In: Proceedings of the fourteenth annual ACM symposium on Principles of distributed computing, pp 214–222 (1995)
Vassiliadis ,S., Duarte, F., Wong, S.: A load/store unit for a memcpy hardware accelerator. In: 2007 International Conference on Field Programmable Logic and Applications, IEEE, pp 537–541 (2007)
Wong, S., Duarte, F., Vassiliadis, S.: A hardware cache memcpy accelerator. In: 2006 IEEE International Conference on Field Programmable Technology, IEEE, pp 141–148 (2006)
Yang, Z., Harris, J.R., Walker, B., Verkamp, D., Liu, C., Chang, C., Cao, G., Stern, J., Verma, V., Paul, L.E.: Spdk: A development kit to build high performance storage applications. In: 2017 IEEE International Conference on Cloud Computing Technology and Science (CloudCom), IEEE, pp 154–161 (2017)
Zhao, L., Iyer, R., Makineni, S., Bhuyan, L., Newell, D.: Hardware support for bulk data movement in server platforms. In: 2005 International Conference on Computer Design, IEEE, pp 53–60 (2005)
Zhao, L., Bhuyan, L.N., Iyer, R., Makineni, S., Newell, D.: Hardware support for accelerating data movement in server platform. IEEE Trans. Comput. 56(6), 740–753 (2007)
Article MathSciNet Google Scholar
Zhong, W., Sun, J., Chen, H., Xiao, J., Chen, Z., Cheng, C., Shi, X.: Optimizing graph processing on gpus. IEEE Trans. Parallel Distrib. Syst. 28(4), 1149–1162 (2016)
Article Google Scholar
Zhou, Z., Chen, X., Li, E., Zeng, L., Luo, K., Zhang, J.: Edge intelligence: Paving the last mile of artificial intelligence with edge computing. Proc. IEEE 107(8), 1738–1762 (2019)
Article Google Scholar
Zhou, Z., Yang, S., Pu, L.J., Yu, S.: Cefl: Online admission control, data scheduling and accuracy tuning for cost-efficient federated learning across edge nodes. IEEE Internet Things J. (2020)

Download references

Acknowledgements

This work was funded by the National Natural Science Foundation of China under grant number 61972164, 61772211 and U1811263, by the Guangdong Basic and Applied Basic Research Foundation under grant number 2019A1515011160, by the Guangzhou Key Laboratory of Big Data and Intelligent Education under grant number 201905010009.

Author information

Authors and Affiliations

School of Computer Science, South China Normal University, Guangzhou, 510613, China
Zhenke Chen, Dingding Li, Zhiwen Wang, Hai Liu & Yong Tang

Authors

Zhenke Chen
View author publications
You can also search for this author in PubMed Google Scholar
Dingding Li
View author publications
You can also search for this author in PubMed Google Scholar
Zhiwen Wang
View author publications
You can also search for this author in PubMed Google Scholar
Hai Liu
View author publications
You can also search for this author in PubMed Google Scholar
Yong Tang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dingding Li.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chen, Z., Li, D., Wang, Z. et al. RAMCI: a novel asynchronous memory copying mechanism based on I/OAT. CCF Trans. HPC 3, 129–143 (2021). https://doi.org/10.1007/s42514-021-00063-y

Download citation

Received: 29 August 2020
Accepted: 29 January 2021
Published: 04 March 2021
Issue Date: June 2021
DOI: https://doi.org/10.1007/s42514-021-00063-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

RAMCI: a novel asynchronous memory copying mechanism based on I/OAT

Abstract

Access this article

Similar content being viewed by others

A Lightweight Asynchronous I/O System for Non-volatile Memory

Symmetric Memory Partitions in OpenSHMEM: A Case Study with Intel KNL

Mimalloc: Free List Sharding in Action

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

RAMCI: a novel asynchronous memory copying mechanism based on I/OAT

Abstract

Access this article

Similar content being viewed by others

A Lightweight Asynchronous I/O System for Non-volatile Memory

Symmetric Memory Partitions in OpenSHMEM: A Case Study with Intel KNL

Mimalloc: Free List Sharding in Action

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation