skip to main content
survey

Affinity-Based Thread and Data Mapping in Shared Memory Systems

Authors Info & Claims
Published:05 December 2016Publication History
Skip Abstract Section

Abstract

Shared memory architectures have recently experienced a large increase in thread-level parallelism, leading to complex memory hierarchies with multiple cache memory levels and memory controllers. These new designs created a Non-Uniform Memory Access (NUMA) behavior, where the performance and energy consumption of memory accesses depend on the place where the data is located in the memory hierarchy. Accesses to local caches or memory controllers are generally more efficient than accesses to remote ones. A common way to improve the locality and balance of memory accesses is to determine the mapping of threads to cores and data to memory controllers based on the affinity between threads and data. Such mapping techniques can operate at different hardware and software levels, which impacts their complexity, applicability, and the resulting performance and energy consumption gains. In this article, we introduce a taxonomy to classify different mapping mechanisms and provide a comprehensive overview of existing solutions.

References

  1. L. Adhianto, S. Banerjee, M. Fagan, M. Krentel, G. Marin, J. Mellor-Crummey, and N. R. Tallent. 2010. HPCToolkit: Tools for performance analysis of optimized parallel programs. Concurrency and Computation: Practice and Experience 22, 6 (2010), 685--701. Google ScholarGoogle ScholarCross RefCross Ref
  2. Ahmad Anbar, Olivier Serres, Engin Kayraklioglu, Abdel-Hameed A. Badawy, and Tarek El-Ghazawi. 2016. Exploiting hierarchical locality in deep parallel architectures. ACM Transactions on Architecture and Code Optimization (TACO) 13, 2 (2016), 1--25. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Hideo Aochi, Thomas Ulrich, Ariane Ducellier, Fabrice Dupros, and David Michea. 2013. Finite difference simulations of seismic wave propagation for understanding earthquake physics and predicting ground motions: Advances and challenges. Journal of Physics: Conference Series 454, 1 (Aug 2013), 012010.Google ScholarGoogle ScholarCross RefCross Ref
  4. Argonne National Laboratory. 2014. Using the Hydra Process Manager. Retrieved 2015-06-08 from https://wiki.mpich.org/mpich/index.php/Using_the_Hydra_Process_Manager.Google ScholarGoogle Scholar
  5. ARM Limited. 2013. big.LITTLE Technology: The Future of Mobile. Technical Report. Retrieved 2016-09-01 from https://www.arm.com/files/pdf/big_LITTLE_Technology_the_Futue_of_Mobile.pdf.Google ScholarGoogle Scholar
  6. Manu Awasthi, David W. Nellans, Kshitij Sudan, Rajeev Balasubramonian, and Al Davis. 2010. Handling the problems and opportunities posed by multiple on-chip memory controllers. In International Conference on Parallel Architectures and Compilation Techniques (PACT). 319--330. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Reza Azimi, David K. Tam, Livio Soares, and Michael Stumm. 2009. Enhancing operating system support for multicore processors by using hardware performance monitoring. ACM SIGOPS Operating Systems Review 43, 2 (Apr 2009), 56--65. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. David H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter, L. Dagum, R. A. Fatoohi, P. O. Frederickson, T. A. Lasinski, R. S. Schreiber, H. D. Simon, V. Venkatakrishnan, and S. K. Weeratunga. 1991. The NAS parallel benchmarks. International Journal of Supercomputer Applications 5, 3 (1991), 66--73. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. G. Ballard, E. Carson, J. Demmel, M. Hoemmen, N. Knight, and O. Schwartz. 2014. Communication lower bounds and optimal algorithms for numerical linear algebra. Acta Numerica 23, May (2014), 1--155.Google ScholarGoogle Scholar
  10. Nick Barrow-Williams, Christian Fensch, and Simon Moore. 2009. A communication characterisation of splash-2 and parsec. In IEEE International Symposium on Workload Characterization (IISWC’09). 86--97. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Arkaprava Basu, Jayneel Gandhi, Jichuan Chang, Mark D. Hill, and Michael M. Swift. 2013. Efficient virtual memory for big memory servers. In International Symposium on Computer Architecture (ISCA’13). 237--248. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. David Beniamine, Matthias Diener, Guillaume Huard, and Philippe O. A. Navaux. 2015. TABARNAC: Visualizing and resolving memory access issues on NUMA architectures. In Workshop on Visual Performance Analysis (VPA’15). Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Abhinav Bhatele. 2010. Automating Topology Aware Mapping for Supercomputers. Ph.D. Dissertation. University of Illinois at Urbana-Champaign. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008. The PARSEC benchmark suite: Characterization and architectural implications. In International Conference on Parallel Architectures and Compilation Techniques (PACT’08). 72--81. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. John Bircsak, Peter Craig, RaeLyn Crowell, Zarka Cvetanovic, Jonathan Harris, C. Alexander Nelson, and Carl D. Offner. 2000. Extending OpenMP for NUMA machines. In ACM/IEEE Conference on Supercomputing (SC’00). 163--181. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Sergey Blagodurov, Sergey Zhuravlev, Mohammad Dashti, and Alexandra Fedorova. 2011. A case for NUMA-aware contention management on multicore systems. In USENIX Annual Technical Conference (ATC’11). 557--571. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Robert D. Blumofe and Charles E. Leiserson. 1994. Scheduling multithreaded computations by work stealing. In Symposium on Foundations of Computer Science (FOCS’94). 1--29. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Jacques E. Boillat and Peter G. Kropf. 1990. A fast distributed mapping algorithm. In Joint International Conference on Vector and Parallel Processing (CONPAR 90 -- VAPP IV). 405--416. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. William J. Bolosky and Michael L. Scott. 1992. Evaluation of multiprocessor memory systems using off-line optimal behavior. Journal of Parallel and Distributed Computing (JPDC) 15, 4 (1992), 382--398.Google ScholarGoogle ScholarCross RefCross Ref
  20. William J. Bolosky, Michael L. Scott, Robert P. Fitzgerald, Robert J. Fowler, and Alan L. Cox. 1991. NUMA policies and their relation to memory architecture. ACM SIGARCH Computer Architecture News 19, 2 (1991), 212--221. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Barbara Brandfass, Thomas Alrutz, and Thomas Gerhold. 2013. Rank reordering for MPI communication optimization. Computers 8 Fluids 80, July (2013), 372--380.Google ScholarGoogle Scholar
  22. Timothy Brecht. 1993. On the importance of parallel application placement in NUMA multiprocessors. In Symposium on Experiences with Distributed and Multiprocessor Systems (SEDMS). 1--18. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. David Brooks, Margaret Martonosi, John-David Wellman, and Pradip Bose. 2000. Power-performance modeling and tradeoff analysis for a high end microprocessor. In International Workshop on Power-Aware Computer Systems (PACS’00). 126--136. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. François Broquedis, Olivier Aumage, Brice Goglin, Samuel Thibault, Pierre-André Wacrenier, and Raymond Namyst. 2010a. Structuring the execution of OpenMP applications for multicore architectures. In IEEE International Parallel 8 Distributed Processing Symposium (IPDPS’10). 1--10.Google ScholarGoogle Scholar
  25. Franois Broquedis, Jerome Clet-Ortega, Stephanie Moreaud, Nathalie Furmento, Brice Goglin, Guillaume Mercier, Samuel Thibault, and Raymond Namyst. 2010b. hwloc: A generic framework for managing hardware affinities in HPC applications. In Euromicro Conference on Parallel, Distributed and Network-based Processing (PDP’10). 180--186. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. François Broquedis, Nathalie Furmento, Brice Goglin, Pierre-André Wacrenier, and Raymond Namyst. 2010c. ForestGOMP: An efficient OpenMP environment for NUMA architectures. International Journal of Parallel Programming 38, 5--6 (2010), 418--439.Google ScholarGoogle ScholarCross RefCross Ref
  27. Mats Brorsson. 1989. Performance Impact of Code and Data Placement on the IBM RP3. Technical Report.Google ScholarGoogle Scholar
  28. Bryan Buck and Jeffrey K. Hollingsworth. 2000. An API for runtime code patching. International Journal of High Performance Computing Applications (IJHPCA) 14, 4 (2000), 317--329. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. J. Mark Bull and Chris Johnson. 2002. Data distribution, migration and replication on a cc-NUMA architecture. In European Workshop on OpenMP (EWOMP’02). 1--5.Google ScholarGoogle Scholar
  30. Anthony Chan, William Gropp, and Ewing Lusk. 1998. User’s Guide for MPE Extensions for MPI Programs. Technical Report.Google ScholarGoogle Scholar
  31. Rohit Chandra, Scott Devine, Ben Verghese, Anoop Gupta, and Mendel Rosenblum. 1994. Scheduling and page migration for multiprocessor compute servers. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’94). 12--24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Hu Chen, Wenguang Chen, Jian Huang, Bob Robert, and H. Kuhn. 2006. MPIPP: An automatic profile-guided parallel process placement toolset for SMP clusters and multiclusters. In ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC’06). 353--360. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Zeshan Chishti, Michael D. Powell, and T. N. Vijaykumar. 2005. Optimizing replication, communication, and capacity allocation in CMPs. ACM SIGARCH Computer Architecture News 33, 2 (May 2005), 357--368. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Theofanis Constantinou, Yiannakis Sazeides, Pierre Michaud, Damien Fetis, and Andre Seznec. 2005. Performance implications of single thread migration on a chip multi-core. ACM SIGARCH Computer Architecture News 33, 4 (2005), 80--91. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Pat Conway. 2007. The AMD opteron northbridge architecture. IEEE Micro 27, 2 (2007), 10--21. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Julita Corbalan, Xavier Martorell, and Jesus Labarta. 2003. Evaluation of the memory page migration influence in the system performance: The case of the SGI O2000. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC’03). 121--129. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Jonathan Corbet. 2012a. AutoNUMA: The Other Approach to NUMA Scheduling. Retrieved 2015-06-08 from http://lwn.net/Articles/488709/Google ScholarGoogle Scholar
  38. Jonathan Corbet. 2012b. Toward Better NUMA Scheduling. Retrieved 2015-06-08 from http://lwn.net/Articles/486858/.Google ScholarGoogle Scholar
  39. Eduardo H. M. Cruz. 2012. Dynamic Detection of the Communication Pattern in Shared Memory Environments for Thread Mapping. Master’s thesis.Google ScholarGoogle Scholar
  40. Eduardo H. M. Cruz, Marco A. Z. Alves, Alexandre Carissimi, Philippe O. A. Navaux, Christiane Pousa Ribeiro, and Jean-François Méhaut. 2011. Using memory access traces to map threads and data on hierarchical multi-core platforms. In IEEE International Symposium on Parallel and Distributed Processing Workshops and PhD Forum. 551--558. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Eduardo H. M. Cruz, Matthias Diener, Marco A. Z. Alves, and Philippe O. A. Navaux. 2014a. Dynamic thread mapping of shared memory applications by exploiting cache coherence protocols. Journal of Parallel and Distributed Computing (JPDC) 74, 3 (Mar 2014), 2215--2228. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Eduardo H. M. Cruz, Matthias Diener, Marco A. Z. Alves, Laércio L. Pilla, and Philippe O. A. Navaux. 2014b. Optimizing memory locality using a locality-aware page table. In International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD’14). 198--205. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Eduardo H. M. Cruz, Matthias Diener, and Philippe O. A. Navaux. 2012. Using the translation lookaside buffer to map threads in parallel applications based on shared memory. In IEEE International Parallel 8 Distributed Processing Symposium (IPDPS’12). 532--543. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Eduardo H. M. Cruz, Matthias Diener, and Philippe O. A. Navaux. 2015a. Communication-aware thread mapping using the translation lookaside buffer. Concurrency Computation: Practice and Experience 22, 6 (2015), 685--701.Google ScholarGoogle Scholar
  45. Eduardo H. M. Cruz, Matthias Diener, Laércio L. Pilla, and Philippe O. A. Navaux. 2015b. An efficient algorithm for communication-based task mapping. In International Conference on Parallel, Distributed, and Network-Based Processing (PDP’15). 207--214. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. William J. Dally. 2010. GPU Computing to Exascale and Beyond. Technical Report.Google ScholarGoogle Scholar
  47. Reetuparna Das, Rachata Ausavarungnirun, Onur Mutlu, Akhilesh Kumar, and Mani Azimi. 2013. Application-to-core mapping policies to reduce memory system interference in multi-core systems. In International Symposium on High Performance Computer Architecture (HPCA’13). 107--118. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Mohammad Dashti, Alexandra Fedorova, Justin Funston, Fabien Gaud, Renaud Lachaize, Baptiste Lepers, Vivien Quéma, and Mark Roth. 2013. Traffic management: A holistic approach to memory placement on NUMA systems. In Architectural Support for Programming Languages and Operating Systems (ASPLOS’13). 381--393. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Karen D. Devine, Erik G. Boman, Robert T. Heaphy, Rob H. Bisseling, and Umit V. Catalyurek. 2006. Parallel hypergraph partitioning for scientific computing. In IEEE International Parallel 8 Distributed Processing Symposium (IPDPS’06). 124--133. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Matthias Diener. 2010. Evaluating Thread Placement Improvements in Multi-core Architectures. Master’s thesis. Berlin Institute of Technology.Google ScholarGoogle Scholar
  51. Matthias Diener. 2015. Automatic Task and Data Mapping in Shared Memory Architectures. Ph.D. Dissertation. Federal University of Rio Grande do Sul.Google ScholarGoogle Scholar
  52. Matthias Diener, Eduardo H. M. Cruz, Marco A. Z. Alves, Mohammad S. Alhakeem, and Philippe O. A. Navaux. 2015. Locality and balance for communication-aware thread mapping in multicore systems. In Euro-Par. 196--208.Google ScholarGoogle Scholar
  53. Matthias Diener, Eduardo H. M. Cruz, Marco A. Z. Alves, and Philippe O. A. Navaux. 2016. Communication in shared memory: Concepts, definitions, and efficient detection. In Euromicro International Conference on Parallel, Distributed, and Network-based Processing (PDP’16). 151--158.Google ScholarGoogle Scholar
  54. Matthias Diener, Eduardo H. M. Cruz, Marco A. Z. Alves, Philippe O. A. Navaux, Anselm Busse, and Hans-Ulrich Heiss. 2015. Kernel-based thread and data mapping for improved memory affinity. IEEE Transactions on Parallel and Distributed Systems (TPDS) 26, X (2015), 1--14.Google ScholarGoogle Scholar
  55. Matthias Diener, Eduardo H. M. Cruz, and Philippe O. A. Navaux. 2013. Communication-based mapping using shared pages. In IEEE International Parallel 8 Distributed Processing Symposium (IPDPS’13). 700--711. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Matthias Diener, Eduardo H. M. Cruz, and Philippe O. A. Navaux. 2015. Locality vs. balance: Exploring data mapping policies on NUMA systems. In International Conference on Parallel, Distributed, and Network-Based Processing (PDP’15). 9--16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Matthias Diener, Eduardo H. M. Cruz, Philippe O. A. Navaux, Anselm Busse, and Hans-Ulrich Heiß. 2014. kMAF: Automatic kernel-level management of thread and data affinity. In International Conference on Parallel Architectures and Compilation Techniques (PACT’14). 277--288. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Matthias Diener, Eduardo H. M. Cruz, Philippe O. A. Navaux, Anselm Busse, and Hans-Ulrich Heiß. 2015a. Communication-aware process and thread mapping using online communication detection. Parallel Computing 43, March (2015), 43--63. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Matthias Diener, Eduardo H. M. Cruz, Laércio L. Pilla, Fabrice Dupros, and Philippe O. A. Navaux. 2015b. Characterizing communication and page usage of parallel applications for thread and data mapping. Performance Evaluation 88-89, June (2015), 18--36. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Matthias Diener, Felipe L. Madruga, Eduardo R. Rodrigues, Marco A. Z. Alves, and Philippe O. A. Navaux. 2010. Evaluating thread placement based on memory access patterns for multi-core processors. In IEEE International Conference on High Performance Computing and Communications (HPCC). 491--496. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Wei Ding, Yuanrui Zhang, Mahmut Kandemir, Jithendra Srinivas, and Praveen Yedlapalli. 2013. Locality-aware mapping and scheduling for multicores. In IEEE/ACM International Symposium on Code Generation and Optimization (CGO’13). 1--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. Ulrich Drepper. 2007. What Every Programmer Should Know About Memory. Technical Report 4. Red Hat, Inc. 114 pages.Google ScholarGoogle Scholar
  63. Paul J. Drongowski. 2007. Instruction-Based Sampling: A New Performance Analysis Technique for AMD Family 10h Processors. Technical Report.Google ScholarGoogle Scholar
  64. Fabrice Dupros, Hideo Aochi, Ariane Ducellier, Dimitri Komatitsch, and Jean Roman. 2008. Exploiting intensive multithreading for the efficient simulation of 3D seismic wave propagation. In IEEE International Conference on Computational Science and Engineering (CSE’08). 253--260. Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. Fabrice Dupros, Christiane Pousa, Alexandre Carissimi, and Jean-François Méhaut. 2010. Parallel simulations of seismic wave propagation on NUMA architectures. In Parallel Computing: From Multicores and GPU’s to Petascale. 67--74.Google ScholarGoogle Scholar
  66. Alexandre E. Eichenberger, Christian Terboven, Michael Wong, and Dieter An Mey. 2012. The design of OpenMP thread affinity. Lecture Notes in Computer Science 7312 LNCS (2012), 15--28. Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. Steven Frank, Henry Burkhardt III, and James Rothnie. 1993. The KSR1: Bridging the gap between shared memory and MPPs. In IEEE Compcon. 285--294.Google ScholarGoogle Scholar
  68. S. R. Freitas, K. M. Longo, M. A. F. Silva Dias, R. Chatfield, P. Silva Dias, P. Artaxo, M. O. Andreae, G. Grell, L. F. Rodrigues, A. Fazenda, and J. Panetta. 2009. The coupled aerosol and tracer transport model to the brazilian developments on the regional atmospheric modeling system (CATT-BRAMS) part 1: Model description and evaluation. Atmospheric Chemistry and Physics 9, 8 (2009), 2843--2861.Google ScholarGoogle ScholarCross RefCross Ref
  69. Edgar Gabriel, Graham E. Fagg, George Bosilca, Thara Angskun, Jack J. Dongarra, Jeffrey M. Squyres, Vishal Sahay, Prabhanjan Kambadur, Brian Barrett, Andrew Lumsdaine, Ralph H. Castain, David J. Daniel, Richard L. Graham, and Timothy S. Woodall. 2004. Open MPI: Goals, concept, and design of a next generation MPI implementation. In Recent Advances in Parallel Virtual Machine and Message Passing Interface (PVMMPI’04). 97--104.Google ScholarGoogle Scholar
  70. Fabien Gaud, Baptiste Lepers, Justin Funston, Mohammad Dashti, Alexandra Fedorova, Vivien Quéma, Renaud Lachaize, and Mark Roth. 2015. Challenges of memory management on modern NUMA systems. Commununications of the ACM 58, 12 (2015), 59--66. Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. Ilaria Di Gennaro, Alessandro Pellegrini, and Francesco Quaglia. 2016. OS-based NUMA optimization: Tackling the case of truly multi-thread applications with non-partitioned virtual page accesses. In IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing (CCGRID’16). 291--300.Google ScholarGoogle ScholarDigital LibraryDigital Library
  72. Alfredo Giménez, Todd Gamblin, Barry Rountree, Abhinav Bhatele, Ilir Jusufi, Peer-Timo Bremer, and Bernd Hamann. 2014. Dissecting on-node memory access performance: A semantic approach. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC’14). 166--176. Google ScholarGoogle ScholarDigital LibraryDigital Library
  73. Alfredo Giménez, Ilir Jusufi, Abhinav Bhatele, Todd Gamblin, Martin Schulz, Peer-Timo Bremer, and Bernd Hamann. 2015. MemAxes: Interactive Visual Analysis of Memory Access Data. Technical Report.Google ScholarGoogle Scholar
  74. Roland Glantz, Henning Meyerhenke, and Alexander Noe. 2015. Algorithms for mapping parallel processes onto grid and torus architectures. In International Conference on Parallel, Distributed, and Network-Based Processing (PDP’15). 236--243. Google ScholarGoogle ScholarDigital LibraryDigital Library
  75. Brice Goglin and Nathalie Furmento. 2009. Enabling high-performance memory migration for multithreaded applications on linux. In International Symposium on Parallel 8 Distributed Processing (IPDPS’09). Google ScholarGoogle ScholarDigital LibraryDigital Library
  76. William Gropp. 2002. MPICH2: A new start for MPI implementations. In Recent Advances in Parallel Virtual Machine and Message Passing Interface. Google ScholarGoogle ScholarDigital LibraryDigital Library
  77. Erik Hagersten and Michael Koster. 1999. WildFire: A scalable path for SMPs. In International Symposium on High Performance Computer Architecture (HPCA’99). 172. Google ScholarGoogle ScholarDigital LibraryDigital Library
  78. Joshua Hursey, Jeffrey M. Squyres, and Terry Dontje. 2011. Locality-aware parallel process mapping for multi-core HPC systems. In IEEE International Conference on Cluster Computing (CLUSTER’11). 527--531. Google ScholarGoogle ScholarDigital LibraryDigital Library
  79. Intel. 2012. Using KMP_AFFINITY to create OpenMP thread mapping to OS proc IDs. Retrieved 2015-06-08 from https://software.intel.com/en-us/articles/using-kmp-affinity-to-create-openmp-thread-mapping-to-os-proc-ids.Google ScholarGoogle Scholar
  80. Intel. 2013. Intel Trace Analyzer and Collector. Retrieved from http://software.intel.com/en-us/intel-trace-analyzer.Google ScholarGoogle Scholar
  81. Satoshi Ito, Kazuya Goto, and Kenji Ono. 2013. Automatically optimized core mapping to subdomains of domain decomposition method on multicore parallel environments. Computers 8 Fluids 80 (jul 2013), 88--93.Google ScholarGoogle Scholar
  82. Emmanuel Jeannot and Guillaume Mercier. 2010. Near-optimal placement of MPI processes on hierarchical NUMA architectures. In Euro-Par Parallel Processing. 199--210. Google ScholarGoogle ScholarDigital LibraryDigital Library
  83. Emmanuel Jeannot, Guillaume Mercier, and François Tessier. 2014. Process placement in multicore clusters: Algorithmic issues and practical techniques. IEEE Transactions on Parallel and Distributed Systems 25, 4 (Apr 2014), 993--1002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  84. H. Jin, M. Frumkin, and J. Yan. 1999. The OpenMP Implementation of NAS Parallel Benchmarks and Its Performance. Technical Report October. NASA.Google ScholarGoogle Scholar
  85. C Karlsson, T Davies, and Zizhong Chen. 2012. Optimizing process-to-core mappings for application level multi-dimensional MPI communications. In IEEE International Conference on Cluster Computing (CLUSTER’12). 486--494. Google ScholarGoogle ScholarDigital LibraryDigital Library
  86. George Karypis and Vipin Kumar. 1996. Parallel multilevel graph partitioning. In International Parallel Processing Symposium (IPPS’96). 314--319. Google ScholarGoogle ScholarDigital LibraryDigital Library
  87. George Karypis and Vipin Kumar. 1998. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM Journal on Scientific Computing 20, 1 (Jan 1998), 359--392. Google ScholarGoogle ScholarDigital LibraryDigital Library
  88. Andi Kleen. 2004. An NUMA API for Linux. Technical Report. Retrieved rom http://andikleen.de/numaapi3.pdf.Google ScholarGoogle Scholar
  89. Tobias Klug, Michael Ott, Josef Weidendorfer, and Carsten Trinitis. 2008. autopin -- automated optimization of thread-to-core pinning on multicore systems. In Transactions on High-Performance Embedded Architectures and Compilers (HiPEAC). 219--235. Google ScholarGoogle ScholarDigital LibraryDigital Library
  90. Renaud Lachaize, Baptiste Lepers, and Vivien Quéma. 2012. MemProf: A memory profiler for NUMA multicore systems. In USENIX Annual Technical Conference (ATC’12). 53--64. Google ScholarGoogle ScholarDigital LibraryDigital Library
  91. Stefan Lankes, Boris Bierbaum, and Thomas Bemmerl. 2010. Affinity-on-next-touch: An extension to the linux kernel for NUMA architectures. Lecture Notes in Computer Science 6067 LNCS, PART 1 (2010), 576--585. Google ScholarGoogle ScholarDigital LibraryDigital Library
  92. Richard P. LaRowe, Mark A. Holliday, and Carla Schlatter Ellis. 1992. An analysis of dynamic page placement on a NUMA multiprocessor. ACM SIGMETRICS Performance Evaluation Review 20, 1 (1992), 23--34. Google ScholarGoogle ScholarDigital LibraryDigital Library
  93. Richard P. LaRowe Jr. 1991. Page Placement For Non-Uniform Memory Access Time (NUMA) Shared Memory Multiprocessors. Ph.D. Dissertation. Duke University. Google ScholarGoogle ScholarDigital LibraryDigital Library
  94. Chris Lattner. 2011. LLVM and clang: Advancing compiler technology. In Free and Open Source Developers European Meeting (FOSDEM’11).Google ScholarGoogle Scholar
  95. Daniel Lenoski, James Laudon, Truman Joe, David Nakahira, Luis Stevens, Anoop Gupta, and John Hennessy. 1992. The DASH prototype: Implementation and performance. In International Symposium on Computer Architecture (ISCA’92). 92--103. Google ScholarGoogle ScholarDigital LibraryDigital Library
  96. David Levinthal. 2009. Performance Analysis Guide for Intel® Core™ i7 Processor and Intel® Xeon™ 5500 processors. Technical Report.Google ScholarGoogle Scholar
  97. Tong Li, Dan P Baumberger, and Scott Hahn. 2009. Efficient and scalable multiprocessor fair scheduling using distributed weighted round-robin. In ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’09). 65--74. Google ScholarGoogle ScholarDigital LibraryDigital Library
  98. Xu Liu and John Mellor-Crummey. 2014. A tool to analyze the performance of multithreaded programs on NUMA architectures. In ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’14). 259--272. Google ScholarGoogle ScholarDigital LibraryDigital Library
  99. Henrik Löf and Sverker Holmgren. 2005. Affinity-on-next-touch: Increasing the performance of an industrial pde solver on a cc-NUMA System. In International Conference on Supercomputing (ICS’05). 387--392. Google ScholarGoogle ScholarDigital LibraryDigital Library
  100. Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. 2005. Pin: Building customized program analysis tools with dynamic instrumentation. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’05). 190--200. Google ScholarGoogle ScholarDigital LibraryDigital Library
  101. Peter S. Magnusson, Magnus Christensson, Jesper Eskilson, Daniel Forsgren, Gustav Hallberg, Johan Hogberg, Fredrik Larsson, Andreas Moestedt, and Bengt Werner. 2002. Simics: A full system simulation platform. IEEE Computer 35, 2 (2002), 50--58. Google ScholarGoogle ScholarDigital LibraryDigital Library
  102. Zoltan Majo and Thomas R. Gross. 2012. Matching memory access patterns and data placement for NUMA systems. In International Symposium on Code Generation and Optimization (CGO’12). 230--241. Google ScholarGoogle ScholarDigital LibraryDigital Library
  103. Zoltan Majo and Thomas R. Gross. 2015. A library for portable and composable data locality optimizations for NUMA systems. In ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’15). 227--238. Google ScholarGoogle ScholarDigital LibraryDigital Library
  104. Jaydeep Marathe and Frank Mueller. 2006. Hardware profile-guided automatic page placement for ccNUMA systems. In ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’06). 90--99. Google ScholarGoogle ScholarDigital LibraryDigital Library
  105. Jaydeep Marathe, Vivek Thakkar, and Frank Mueller. 2010. Feedback-directed page placement for ccNUMA via hardware-generated memory traces. Journal of Parallel and Distributed Computing (JPDC’10) 70, 12 (2010), 1204--1219. Google ScholarGoogle ScholarDigital LibraryDigital Library
  106. Michael Marchetti, Leonidas Kontothanassis, Ricardo Bianchini, and Michael L. Scott. 1995. Using simple page placement policies to reduce the cost of cache fills in coherent shared-memory systems. In International Parallel Processing Symposium (IPPS’95). 480--485. Google ScholarGoogle ScholarDigital LibraryDigital Library
  107. Artur Mariano, Matthias Diener, Christian Bischof, and Philippe O. A. Navaux. 2016. Analyzing and improving memory access patterns of large irregular applications on NUMA machines. In Euromicro International Conference on Parallel, Distributed, and Network-based Processing (PDP’16). 382--387.Google ScholarGoogle Scholar
  108. Artur Mariano, Thijs Laarhoven, and Christian Bischof. 2015. Parallel (probable) lock-free hashsieve: A practical sieving algorithm for the SVP. In International Conference on Parallel Processing (ICPP). 590--599. Google ScholarGoogle ScholarDigital LibraryDigital Library
  109. Collin McCurdy and Jeffrey Vetter. 2010. Memphis: Finding and fixing NUMA-related performance problems on multi-core platforms. In IEEE International Symposium on Performance Analysis of Systems 8 Software (ISPASS’10). 87--96.Google ScholarGoogle ScholarCross RefCross Ref
  110. Guillaume Mercier and Jérôme Clet-Ortega. 2009. Towards an efficient process placement policy for MPI applications in multicore environments. In Recent Advances in Parallel Virtual Machine and Message Passing Interface. 104--115. Google ScholarGoogle ScholarDigital LibraryDigital Library
  111. Guillaume Mercier and Emmanuel Jeannot. 2011. Improving MPI applications performance on multicore clusters with rank reordering. In European MPI Users’ Group Conference on Recent Advances in the Message Passing Interface (EuroMPI’11). 39--49. Google ScholarGoogle ScholarDigital LibraryDigital Library
  112. Dimitrios S. Nikolopoulos, Theodore S. Papatheodorou, Constantine D. Polychronopoulos, Jesús Labarta, and Eduard Ayguadé. 2000a. User-level dynamic page migration for multiprogrammed shared-memory multiprocessors. In International Conference on Parallel Processing (ICPP’00). 95--103. Google ScholarGoogle ScholarDigital LibraryDigital Library
  113. Dimitrios S. Nikolopoulos, Theodore S. Papatheodorou, Constantine D. Polychronopoulos, Jesús Labarta, and Eduard Ayguadé. 2000b. UPMLIB: A runtime system for tuning the memory performance of OpenMP programs on scalable shared-memory multiprocessors. In Languages, Compilers, and Run-Time Systems for Scalable Computers (LCR’00). 85--99. Google ScholarGoogle ScholarDigital LibraryDigital Library
  114. Dimitrios S. Nikolopoulos, Theodore S. Papatheodorou, Constantine D. Polychronopoulos, Jesús Labarta, and Eduard Ayguadé. 2000c. Is data distribution necessary in OpenMP? In ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC’00). Google ScholarGoogle ScholarDigital LibraryDigital Library
  115. Lisa Noordergraaf and Ruud van der Pas. 1999. Performance experiences on sun’s wildfire prototype. In ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC’99). 1--16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  116. Takeshi Ogasawara. 2009. NUMA-aware memory manager with dominant-thread-based copying GC. In ACM SIGPLAN Conference on Object Oriented Programming Systems Languages and Applications (OOPSLA’09). 377--390. Google ScholarGoogle ScholarDigital LibraryDigital Library
  117. OpenMP Architecture Review Board. 2013. OpenMP Application Program Interface, Version 4.0. (2013).Google ScholarGoogle Scholar
  118. Oracle. 2010. Solaris OS Tuning Features. Retrieved 2015-06-16 from http://docs.oracle.com/cd/E18659_01/html/821-1381/aewda.htmlGoogle ScholarGoogle Scholar
  119. Michael Ott, Tobias Klug, Josef Weidendorfer, and Carsten Trinitis. 2008. autopin -- automated optimization of thread-to-core pinning on multicore systems. In Workshop on Programmability Issues for Multi-Core Computers (MULTIPROG’08).Google ScholarGoogle Scholar
  120. François Pellegrini. 1994. Static mapping by dual recursive bipartitioning of process and architecture graphs. In Scalable High-Performance Computing Conference (SHPCC’94). 486--493.Google ScholarGoogle ScholarCross RefCross Ref
  121. François Pellegrini. 2010. Scotch and Libscotch 5.1 User’s Guide. Technical Report.Google ScholarGoogle Scholar
  122. Gregory F. Pfister, William C. Brantley, David A. George, Steve L. Harvey, Wally J. Kleinfelder, Kevin P. McAuliffe, Evelin S. Melton, V. Alan Norton, and Jodi Weiss. 1985. The IBM research parallel processor prototype (RP3): Introduction and architecture. In International Conference on Parallel Processing (ICPP’85). 764--771.Google ScholarGoogle Scholar
  123. Guilherme Piccoli, Henrique N. Santos, Raphael E. Rodrigues, Christiane Pousa, Edson Borin, and Fernando M. Quintão Pereira. 2014. Compiler support for selective page migration in NUMA architectures. In International Conference on Parallel Architectures and Compilation Techniques (PACT’14). 369--380. Google ScholarGoogle ScholarDigital LibraryDigital Library
  124. Petar Radojković, Vladimir Čakarević, Miquel Moretó, Javier Verdú, Alex Pajuelo, Francisco J. Cazorla, Mario Nemirovsky, and Mateo Valero. 2012. Optimal task assignment in multithreaded processors: A statistical approach. SIGARCH Computer Architecture News 40, 1 (2012), 235--248. Google ScholarGoogle ScholarDigital LibraryDigital Library
  125. Petar Radojković, Vladimir Cakarević, Javier Verdú, Alex Pajuelo, Francisco J. Cazorla, Mario Nemirovsky, and Mateo Valero. 2013. Thread assignment of multithreaded network applications in multicore/multithreaded processors. IEEE Transactions on Parallel and Distributed Systems (TPDS) 24, 12 (2013), 2513--2525. Google ScholarGoogle ScholarDigital LibraryDigital Library
  126. Christiane Pousa Ribeiro. 2011. Contributions on Memory Affinity Management for Hierarchical Shared Memory Multi-core Platforms. Ph.D. Dissertation. Univeristy of Grenoble.Google ScholarGoogle Scholar
  127. Christiane Pousa Ribeiro, Marcio Castro, Jean-François Méhaut, and Alexandre Carissimi. 2010. Improving memory affinity of geophysics applications on NUMA platforms using Minas. In International Conference on High Performance Computing for Computational Science (VECPAR’10). 279--292.Google ScholarGoogle Scholar
  128. Christiane Pousa Ribeiro, Jean-François Méhaut, Alexandre Carissimi, Marcio Castro, and Luiz Gustavo Fernandes. 2009. Memory affinity for hierarchical shared memory multiprocessors. In International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD’09). 59--66. Google ScholarGoogle ScholarDigital LibraryDigital Library
  129. Eduardo R. Rodrigues, Felipe L. Madruga, Philippe O. A. Navaux, and Jairo Panetta. 2009. Multi-core aware process mapping and its impact on communication overhead of parallel applications. In IEEE Symposium on Computers and Communications (ISCC’09). 811--817.Google ScholarGoogle ScholarCross RefCross Ref
  130. John Shalf, Sudip Dosanjh, and John Morrison. 2010. Exascale computing technology challenges. In High Performance Computing for Computational Science (VECPAR’10). 1--25. Google ScholarGoogle ScholarDigital LibraryDigital Library
  131. Jaswinder Pal Singh, Truman Joe, Anoop Gupta, and John L. Hennessy. 1993. An empirical comparison of the Kendall square research KSR-1 and stanford DASH multiprocessors. In ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC’93). 214--225. Google ScholarGoogle ScholarDigital LibraryDigital Library
  132. Mohsen Soryani, Morteza Analoui, and Ghobad Zarrinchian. 2013. Improving inter-node communications in multi-core clusters using a contention-free process mapping algorithm. Journal of Supercomputing 66, 1 (apr 2013), 488--513. Google ScholarGoogle ScholarDigital LibraryDigital Library
  133. Brad Spengler. 2003. PaX: The Guaranteed End of Arbitrary Code Execution. Technical Report.Google ScholarGoogle Scholar
  134. David Tam, Reza Azimi, and Michael Stumm. 2007. Thread clustering: Sharing-aware scheduling on SMP-CMP-SMT multiprocessors. In ACM SIGOPS/EuroSys European Conference on Computer Systems. 47--58. Google ScholarGoogle ScholarDigital LibraryDigital Library
  135. Christian Terboven, Dieter an Mey, Dirk Schmidl, Henry Jin, and Thomas Reichstein. 2008. Data and thread affinity in OpenMP programs. In Workshop on Memory Access on Future Processors: A Solved Problem? (MAW’08). 377--384. Google ScholarGoogle ScholarDigital LibraryDigital Library
  136. The Open MPI project. 2009. Portable Linux Processor Affinity (PLPA). Retrieved 2015-06-08 from https://www.open-mpi.org/projects/plpa/.Google ScholarGoogle Scholar
  137. The Open MPI Project. 2013. mpirun(1) man page (version 1.6.4). Retrieved 2016-02-08 from http://www.open-mpi.org/doc/v1.6/man1/mpirun.1.php#sect9.Google ScholarGoogle Scholar
  138. Samuel Thibault, Raymond Namyst, and Pierre-André Wacrenier. 2007. Building portable thread schedulers for hierarchical multiprocessors: The bubblesched framework. In Euro-Par. 42--51. Google ScholarGoogle ScholarDigital LibraryDigital Library
  139. Mustafa M. Tikir and Jeffrey K. Hollingsworth. 2004. Using hardware counters to automatically improve memory performance. In ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC’04). Google ScholarGoogle ScholarDigital LibraryDigital Library
  140. Mustafa M. Tikir and Jeffrey K. Hollingsworth. 2008. Hardware monitors for dynamic page migration. Journal of Parallel and Distributed Computing (JPDC) 68, 9 (Sep 2008), 1186--1200. Google ScholarGoogle ScholarDigital LibraryDigital Library
  141. Jesper Larsson Träff. 2002. Implementing the MPI process topology mechanism. In ACM/IEEE Conference on Supercomputing (SC’02). Google ScholarGoogle ScholarDigital LibraryDigital Library
  142. François Trahay, François Rue, Mathieu Faverge, Yutaka Ishikawa, Raymond Namyst, and Jack Dongarra. 2011. EZTrace: A generic framework for performance analysis. In International Symposium on Cluster, Cloud and Grid Computing (CCGrid’11). 618--619. Google ScholarGoogle ScholarDigital LibraryDigital Library
  143. Ruud van der Pas. 2009. Getting OpenMP Up To Speed. Technical Report.Google ScholarGoogle Scholar
  144. Goran Velkoski, Sasko Ristov, and Marjan Gusev. 2013. Loosely or tightly coupled affinity for matrix-vector multiplication. In International Convention on Information 8 Communication Technology Electronics 8 Microelectronics (MIPRO’13). 228--233.Google ScholarGoogle Scholar
  145. Ben Verghese, Scott Devine, Anoop Gupta, and Mendel Rosenblum. 1996b. Operating system support for improving data locality on CC-NUMA compute servers. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’96). 279--289. Google ScholarGoogle ScholarDigital LibraryDigital Library
  146. Ben Verghese, Scott Devine, Anoop Gupta, and Mendel Rosenblum. 1996a. OS Support for Improving Data Locality on CC-NUMA Compute Servers. Technical Report. Google ScholarGoogle Scholar
  147. Carlos Villavieja, Vasileios Karakostas, Lluis Vilanova, Yoav Etsion, Alex Ramirez, Avi Mendelson, Nacho Navarro, Adrian Cristal, and Osman S. Unsal. 2011. DiDi: Mitigating the performance impact of TLB shootdowns using a shared TLB directory. In International Conference on Parallel Architectures and Compilation Techniques (PACT’11). 340--349. Google ScholarGoogle ScholarDigital LibraryDigital Library
  148. Skef Wholey. 1991. Automatic data mapping for distributed-memory parallel computers. In International Conference on Supercomputing (ICS’91). 25--34. Google ScholarGoogle ScholarDigital LibraryDigital Library
  149. Chee Siang Wong, Ian Tan, Rosalind Deena Kumari, and Fun Wey. 2008. Towards achieving fairness in the linux scheduler. ACM SIGOPS Operating Systems Review 42, 5 (ul J2008), 34--43. Google ScholarGoogle ScholarDigital LibraryDigital Library
  150. Yang You, Jammes Demmel, Kenneth Czechowski, Le Song, and Richard Vuduc. 2015. CA-SVM: Communication-avoiding support vector machines on distributed systems. In IEEE International Parallel and Distributed Processing Symposium (IPDPS’15). 847--859. Google ScholarGoogle ScholarDigital LibraryDigital Library
  151. Jidong Zhai, Tianwei Sheng, and Jiangzhou He. 2011. Efficiently acquiring communication traces for large-scale parallel applications. IEEE Transactions on Parallel and Distributed Systems (TPDS) 22, 11 (2011), 1862--1870. Google ScholarGoogle ScholarDigital LibraryDigital Library
  152. Xing Zhou, Wenguang Chen, and Weimin Zheng. 2009. Cache sharing management for performance fairness in chip multiprocessors. In International Conference on Parallel Architectures and Compilation Techniques (PACT’09). 384--393. Google ScholarGoogle ScholarDigital LibraryDigital Library
  153. Sergey Zhuravlev, Juan Carlos Saez, Sergey Blagodurov, Alexandra Fedorova, and Manuel Prieto. 2012. Survey of scheduling techniques for addressing shared resources in multicore processors. ACM Computing Surveys (CSUR) 45, 1 (2012), 1--32. Google ScholarGoogle ScholarDigital LibraryDigital Library
  154. Dimitrios Ziakas, Allen Baum, Robert A. Maddox, and Robert J. Safranek. 2010. Intel quickpath interconnect - architectural features supporting scalable system architectures. In Symposium on High Performance Interconnects (HOTI’10). 1--6. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Affinity-Based Thread and Data Mapping in Shared Memory Systems

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Computing Surveys
      ACM Computing Surveys  Volume 49, Issue 4
      December 2017
      666 pages
      ISSN:0360-0300
      EISSN:1557-7341
      DOI:10.1145/3022634
      • Editor:
      • Sartaj Sahni
      Issue’s Table of Contents

      Copyright © 2016 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 5 December 2016
      • Accepted: 1 October 2016
      • Revised: 1 September 2016
      • Received: 1 June 2016
      Published in csur Volume 49, Issue 4

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • survey
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader