Abstract
Affinity-aware thread mapping is a method to effectively exploit cache resources in multicore processors. We propose an affinity- and architecture-aware thread mapping technique which maximizes data reuse and minimizes remote communications and cache coherency costs of multi-threaded applications. It consists of three main components: Data Sharing Estimator, Affine Mapping Finder and Maximum Speedup Predictor. Data Sharing Estimator creates application-specific data dependency signatures used by Affine Mapping Finder to determine the appropriate thread mapping of application for a given architecture. To prevent excessive thread migration, Maximum Speedup Predictor estimates the speedup of the obtained mapping and ignores it if it causes no significant performance improvement. The proposed framework is evaluated using Phoenix benchmark suite on two different multicore architectures. The proposed thread mapping approach gives 25% improvement in performance compared to default Linux scheduler. We also elucidate that affinity-based thread mapping approaches, which only consider the number of shared blocks, are not appropriate enough to accurately estimate data dependency between threads and determine the proper thread mapping.











Similar content being viewed by others
References
Gepner P, Kowalik MF (2006) Multi-core processors: new way to achieve high system performance. In: International Symposium on Parallel Computing in Electrical Engineering, 2006. PAR ELEC 2006. IEEE, pp 9–13
Shukla SK, Murthy C, Chande P (2015) A survey of approaches used in parallel architectures and multi-core processors, for performance improvement. In: Progress in Systems Engineering. Springer, pp 537–545
Khammassi N, Le Lann J-C (2014) Design and implementation of a cache hierarchy-aware task scheduling for parallel loops on multicore architectures. PDCTA, Sydney, Australia
Zhang L, Liu Y, Wang R, Qian D (2014) Lightweight dynamic partitioning for last-level cache of multicore processor on real system. J Supercomput 69(2):547–560
Sun Z, Wang R, Zhang L, Li Q, Chen L, Wu J, Liu Y (2012) Cache-aware scheduling for energy efficiency on multi-processors. In: 2012 International Conference on Computer Distributed Control and Intelligent Environmental Monitoring (CDCIEM). IEEE, pp 182–186
Ding W, Zhang Y, Kandemir M, Srinivas J, Yedlapalli P (2013) Locality-aware mapping and scheduling for multicores. In: 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). IEEE, pp 1–12
Zhang EZ, Jiang Y, Shen X (2010) Does cache sharing on modern CMP matter to the performance of contemporary multithreaded programs? In: ACM Sigplan Notices. ACM, vol 45, no 5, pp 203–212
Kazempour V, Fedorova A, Alagheband P (2008) Performance implications of cache affinity on multicore processors. Euro-Par 2008–Parallel Processing, pp 151–161
Valiant LG (2011) A bridging model for multi-core computing. J Comput Syst Sci 77(1):154–166
Girão G, de Oliveira BC, Soares R, Silva IS (2007) Cache coherency communication cost in a NoC-based MPSoC platform. In: Proceedings of the 20th Annual Conference on Integrated Circuits and Systems Design. ACM, pp 288–293
Ramos S, Hoefler T (2013) Modeling communication in cache-coherent SMP systems: a case-study with Xeon Phi. In: Proceedings of the 22nd International Symposium on High-Performance Parallel and Distributed Computing. ACM, pp 97–108
Song F, Moore S, Dongarra J (2009) Analytical modeling for affinity-based thread scheduling on multicore plataforms. In: Symposium on Principles and Practice of Parallel Programming
Terboven C, Schmidl D, Jin H, Reichstein T et al (2008) Data and thread affinity in OpenMP programs. In: Proceedings of the 2008 Workshop on Memory Access on Future Processors: A Solved Problem? ACM, pp 377–384
Anbar A, Serres O, Kayraklioglu E, Badawy A-HA, El-Ghazawi T (2016) Exploiting hierarchical locality in deep parallel architectures. ACM Trans Archit Code Optim TACO 13(2):16
Yang T-F, Lin C-H, Yang C-L (2010) Cache-aware task scheduling on multi-core architecture. In: 2010 International Symposium on VLSI Design Automation and Test (VLSI-DAT). IEEE, pp 139–142
Wang E, Ni F, Chen J, Wang H, Li Y (2016) Cache-aware cooperative task mapping in multi-core real-time systems. Int J Inf Electron Eng 6(2):72
Ghosh M, Nathuji R, Lee M, Schwan K, Lee H-HS (2011) Symbiotic scheduling for shared caches in multi-core systems using memory footprint signature. In: 2011 International Conference on Parallel Processing (ICPP). IEEE, pp 11–20
Saez JC, Shelepov D, Fedorova A, Prieto M (2011) Leveraging workload diversity through os scheduling to maximize performance on single-isa heterogeneous multicore systems. J Parallel Distrib Comput 71(1):114–131
Shelepov D, Saez Alcaide JC, Jeffery S, Fedorova A, Perez N, Huang ZF, Blagodurov S, Kumar V (2009) Hass: a scheduler for heterogeneous multicore systems. ACM SIGOPS Oper Syst Rev 43(2):66–75
Luo H, Li P, Ding C (2017) Thread data sharing in cache: theory and measurement. In: Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, pp 103–115
Luk C-K, Cohn R, Muth R, Patil H, Klauser A, Lowney G, Wallace S, Reddi VJ, Hazelwood K (2005) Pin: building customized program analysis tools with dynamic instrumentation. In: ACM SIGPLAN Notices. ACM, vol 40, no 6, pp 190–200
Mattson RL, Gecsei J, Slutz DR, Traiger IL (1970) Evaluation techniques for storage hierarchies. IBM Syst J 9(2):78–117
Shelepov D, Fedorova A (2008) Scheduling on heterogeneous multicore processors using architectural signatures
Karypis G, Kumar V (1998) A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J Sci Comput 20(1):359–392
Ranger C, Raghuraman R, Penmetsa A, Bradski G, Kozyrakis C (2007) Evaluating mapreduce for multi-core and multiprocessor systems. In: IEEE 13th International Symposium on High Performance Computer Architecture, 2007. HPCA 2007. IEEE, pp 13–24
Acknowledgements
This publication has emanated from research conducted with the financial support of Science Foundation Ireland (SFI) under Grant Number 14/IA/2474.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Khaleghzadeh, H., Deldari, H., Reddy, R. et al. Hierarchical multicore thread mapping via estimation of remote communication. J Supercomput 74, 1321–1340 (2018). https://doi.org/10.1007/s11227-017-2176-6
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-017-2176-6