Optimizing memory access traffic via runtime thread migration for on-chip distributed memory systems

Fu, Weiwei; Chen, Tianzhou; Wang, Chao; Liu, Li

doi:10.1007/s11227-014-1240-8

Optimizing memory access traffic via runtime thread migration for on-chip distributed memory systems

Published: 24 June 2014

Volume 69, pages 1491–1516, (2014)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Weiwei Fu¹,
Tianzhou Chen¹,
Chao Wang¹ &
…
Li Liu²

231 Accesses
3 Citations
3 Altmetric
Explore all metrics

Abstract

On-chip distributed memory system has become an attractive solution for massive parallel memory accesses found in future many-core processors. However, increasing number of on-chip cores and memory controllers inevitably introduce many remote memory accesses, which generate a large amount of on-chip traffic and put great pressure on the interconnection. This paper tries to optimize on-chip memory access traffic via runtime thread migration. We first analyze memory access behaviors in multi-threaded applications and find that the memory access targets and volumes are similar during short periods, which makes runtime prediction feasible. But the memory access targets exhibit great mobility during long periods, motivating us to dynamically move threads towards the data. Based on these observations, we propose a novel low-cost and distributed thread migration algorithm which adjusts thread placement in chains based on benefit estimation. We present details of the workflow, including the trigger and arbitration of migration requests and the procedures to determine the migration chains. Simulation results show that our algorithm achieves system performance speedup of 11.5 % and reduces average memory access latency by 11.0 %. It can find a few but effective thread migrations to optimize on-chip memory access traffic with acceptable hardware and runtime overheads.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Sharing-Aware Memory Management Unit for Online Mapping in Multi-core Architectures

Evaluating Controlled Memory Request Injection to Counter PREM Memory Underutilization

An energy-efficient scheduling approach for memory-intensive tasks in multi-core systems

Article 01 August 2022

Ashish Kumar Maurya, Anshul Meena, … Vinay Kumar

References

Wulf WA, McKee SA (1995) Hitting the memory wall: implications of the obvious. ACM SIGARCH Comput Archit News 23(1):20–24
Article Google Scholar
Dashti M, Fedorova A, Funston J, Gaud F, Lachaize R, Lepers B, and Roth M (2013) Traffic management: a holistic approach to memory placement on numa systems. In: Proceedings of the 18th international conference on architectural support for programming languages and operating systems, pp 381–394, ACM
Kamali A (2010) Sharing aware scheduling on multicore systems. Applied Science, School of Computing Science, USA
Google Scholar
Tam D, Azimi R, Stumm M (2007) Thread clustering: sharing-aware scheduling on SMP-CMP-SMT multiprocessors. ACM SIGOPS Oper Syst Rev 41(3):47–58
Article Google Scholar
Chen TS (1999) Task migration in 2D wormhole-routed mesh multicomputers. In: High performance computing. Springer, Berlin, pp 354–362
Misler M, Jerger NE (2013) Moths: mobile threads for on-chip networks. ACM Trans Embed Comput Syst (TECS) 12(1s):56
Google Scholar
Wang C, Yu L, Liu L, Chen T (2012) Packet triggered prediction based task migration for network-on-chip. In: 20th IEEE Euromicro international conference on parallel, distributed and network-based processing (PDP), pp 491–498
Dally WJ, Towles B (2001) Route packets, not wires: on-chip interconnection networks. In: Proceedings of the IEEE design automation conference, pp 684–689
Benini L, De Micheli G (2002) Networks on chips: a new SoC paradigm. Computer 35(1):70–78
Article Google Scholar
Dally WJ, Towles BP (2004) Principles and practices of interconnection networks. Access online via Elsevier, London
Bienia C, Kumar S, Singh JP, Li K (2008) The PARSEC benchmark suite: characterization and architectural implications. In: Proceedings of the 17th international conference on parallel architectures and compilation techniques, ACM, pp 72–81
Lachaize R, Lepers B, Quma V (2012) MemProf: a memory profiler for NUMA multicore systems. In: USENIX ATC 12
Shen X, Zhong Y, Ding C (2007) Predicting locality phases for dynamic memory optimization. J Parallel Distrib Comput 67(7):783–796
Article MATH Google Scholar
Abts D, Enright Jerger ND, Kim J, Gibson D, Lipasti MH (2009) Achieving predictable performance through better memory controller placement in many-core CMPs. ACM SIGARCH Comput Archit News 37(3):451–461
Rangan KK, Wei G, Brooks Y (2009) Thread motion: fine-grained power management for multi-core systems. In: Proceedings of the international symposium on computer architecture
Lei T, Kumar S (2003) A two-step genetic algorithm for mapping task graphs to a network on chip architecture. In: Proceedings of the Euromicro symposium on digital system design, pp 180C187
Rixner S, Dally WJ, Kapasi UJ, Mattson P, Owens JD (2000) Memory access scheduling. ACM SIGARCH Comput Archit News 28(2):128–138
Rixner S (2004) Memory controller optimizations for web servers. In: Proceedings of the 37th annual IEEE/ACM international symposium on microarchitecture, pp 355–366
Jog A, Bolotin E et al (2004) application-aware memory system for fair and efficient execution of concurrent GPGPU applications [C]. In: Proceedings of workshop on general purpose processing using GPUs
(2003) Micron, 1gb, x4, x8, x16, ddr3 sdram datasheet. http://www.micron.com/products/dram/ddr3-sdram. 25 Sep 2013
Binkert N, Beckmann B, Black G, Reinhardt SK, Saidi A, Basu A, Hestness J, Hower DR, Krishna T, Sardashti S et al (2011) The gem5 simulator. ACM SIGARCH Comput Archit News 39:1C7
Wang HS, Zhu X, Peh L-S, Malik S (2002) Orion: a power-performance simulator for interconnection networks. In: Proceedingsof the 35th annual IEEE/ACM international symposium on microarchitecture, MICRO-35, pp 294–305
Lozi JP, David F, Thomas G et al (2012) Remote core locking: migrating critical-section execution to improve the performance of multithreaded applications. In: Proceedings of the Usenix annual technical Conference, pp 65–76
Bertozzi S, Acquaviva A, Bertozzi D, Poggiali A (2006) Supporting task migration in multi-processor systems-on-chip: a feasibility study. In: Proceedings of the conference on design, automation and test in Europe, pp 15–20. European Design and Automation Association
Katre KM, Ramaprasad H, Sarkar A, Mueller F (2009) Policies for migration of real-time tasks in embedded multi-core systems. In: Real-time systems symposium, pp 17–20
Goodarzi B, Sarbazi-Azad H (2011) Task migration in mesh NoCs over virtual point-to-point connections. In: 19th IEEE Euromicro international conference on parallel, distributed and network-based processing (PDP), pp 463–469
Briao EW, Barcelos D, Wagner FR (2008) Dynamic task allocation strategies in MPSoC for soft real-time applications. In: Proceedings of the conference on design, automation and test in Europe, ACM, pp 1386–1389
Xie B, Chen T, Hu W, Tang X, Wang D (2013) An energy-aware online task mapping algorithm in NoC-based system. J Supercomput 64(3):1021–1037
Shim KS, Lis M, Cho MH, Khan O, Devadas S (2011) System-level optimizations for memory access in the execution migration machine (EM2), CAOS
Sarkar A, Mueller F, Ramaprasad H, Mohan S (2009) Push-assisted migration of real-time tasks in multi-core processors. ACM Sigplan Not 44(7):80–89
Hardy D, Puaut I (2009) Estimation of cache related migration delays for multi-core processors with shared instruction caches. In: 17th international conference on real-time and network systems, pp 45–54
Bastoni A, Brandenburg B, Anderson J (2010) Cache-related preemption and migration delays: empirical approximation and impact on schedulability. In: Proceedings of the 6th international workshop on operating systems platforms for embedded real-time apps, pp 33–44
Bakhoda A, Kim J, Aamodt TM (2010) Throughput-effective on-chip networks for manycore accelerators. In: Proceedings of the 2010 43rd annual IEEE/ACM international symposium on microarchitecture, pp 421–432, IEEE Computer Society
Kim D, Yoo S, Lee S (2010) A network congestion-aware memory controller. In: 2010 IEEE 4th ACM/IEEE international symposium on networks-on-chip (NOCS), pp 257–264
Kim D, Kim K, Kim JY, Lee SJ, Yoo HJ (2007) Solutions for real chip implementation issues of NoC and their application to memory-centric NoC. In: IEEE 1st international symposium on networks-on-chip, NOCS 2007, pp 30–39
Sharifi A, Kultursay E, Kandemir M, Das CR (2012) Addressing end-to-end memory access latency in NoC-based multicores. In: Proceedings of the 2012 45th annual IEEE/ACM international symposium on microarchitecture, pp 294–304, IEEE Computer Society
Chandra R, Devine S, Verghese B, Gupta A, Rosenblum M (1994) Scheduling and page migration for multiprocessor compute servers. ACM SIGPLAN Not 29(11):12–24
Corbalan J, Martorell X, Labarta J (2004) Page migration with dynamic space-sharing scheduling policies: the case of the SGIO 2000. Int J Parallel Program 32(4):263–288
Article Google Scholar
LaRowe RP Jr, Ellis CS (1991) Page placement policies for NUMA multiprocessors. J Parallel Distrib Comput 11(2):112–129
Article Google Scholar
LaRowe RP Jr, Ellis CS, Holliday MA (1992) Evaluation of NUMA memory management through modeling and measurements. IEEE Trans Parallel Distrib Syst 3(6):686–701
Article Google Scholar
Blagodurov S, Zhuravlev S, Fedorova A et al (2010) A case for NUMA-aware contention management on multicore systems. In: Proceedings of the 19th international conference on parallel architectures and compilation techniques, ACM, pp 557–558

Download references

Acknowledgments

This paper is supported by the National Natural Science Foundation of China under Grant No. 61379035, the National Natural Science Foundation of Zhejiang Province No. LY14F020005, Open Fund of Mobile Network Application Technology Key Laboratory of Zhejiang Province, Innovation Group of New Generation of Mobile Internet Software and Services of Zhejiang Province.

Author information

Authors and Affiliations

College of Computer Science, Zhejiang University, Hangzhou, 310027, Zhejiang, People’s Republic of China
Weiwei Fu, Tianzhou Chen & Chao Wang
School of Information, Zhejiang Sci-Tech University, Hangzhou, 310059, Zhejiang, People’s Republic of China
Li Liu

Authors

Weiwei Fu
View author publications
You can also search for this author in PubMed Google Scholar
Tianzhou Chen
View author publications
You can also search for this author in PubMed Google Scholar
Chao Wang
View author publications
You can also search for this author in PubMed Google Scholar
Li Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Weiwei Fu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Fu, W., Chen, T., Wang, C. et al. Optimizing memory access traffic via runtime thread migration for on-chip distributed memory systems. J Supercomput 69, 1491–1516 (2014). https://doi.org/10.1007/s11227-014-1240-8

Download citation

Published: 24 June 2014
Issue Date: September 2014
DOI: https://doi.org/10.1007/s11227-014-1240-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Optimizing memory access traffic via runtime thread migration for on-chip distributed memory systems

Abstract

Access this article

Similar content being viewed by others

A Sharing-Aware Memory Management Unit for Online Mapping in Multi-core Architectures

Evaluating Controlled Memory Request Injection to Counter PREM Memory Underutilization

An energy-efficient scheduling approach for memory-intensive tasks in multi-core systems

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Abstract

Access this article

Similar content being viewed by others

A Sharing-Aware Memory Management Unit for Online Mapping in Multi-core Architectures

Evaluating Controlled Memory Request Injection to Counter PREM Memory Underutilization

An energy-efficient scheduling approach for memory-intensive tasks in multi-core systems

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation