skip to main content
survey

Affinity-Based Thread and Data Mapping in Shared Memory Systems

Published: 05 December 2016 Publication History

Abstract

Shared memory architectures have recently experienced a large increase in thread-level parallelism, leading to complex memory hierarchies with multiple cache memory levels and memory controllers. These new designs created a Non-Uniform Memory Access (NUMA) behavior, where the performance and energy consumption of memory accesses depend on the place where the data is located in the memory hierarchy. Accesses to local caches or memory controllers are generally more efficient than accesses to remote ones. A common way to improve the locality and balance of memory accesses is to determine the mapping of threads to cores and data to memory controllers based on the affinity between threads and data. Such mapping techniques can operate at different hardware and software levels, which impacts their complexity, applicability, and the resulting performance and energy consumption gains. In this article, we introduce a taxonomy to classify different mapping mechanisms and provide a comprehensive overview of existing solutions.

References

[1]
L. Adhianto, S. Banerjee, M. Fagan, M. Krentel, G. Marin, J. Mellor-Crummey, and N. R. Tallent. 2010. HPCToolkit: Tools for performance analysis of optimized parallel programs. Concurrency and Computation: Practice and Experience 22, 6 (2010), 685--701.
[2]
Ahmad Anbar, Olivier Serres, Engin Kayraklioglu, Abdel-Hameed A. Badawy, and Tarek El-Ghazawi. 2016. Exploiting hierarchical locality in deep parallel architectures. ACM Transactions on Architecture and Code Optimization (TACO) 13, 2 (2016), 1--25.
[3]
Hideo Aochi, Thomas Ulrich, Ariane Ducellier, Fabrice Dupros, and David Michea. 2013. Finite difference simulations of seismic wave propagation for understanding earthquake physics and predicting ground motions: Advances and challenges. Journal of Physics: Conference Series 454, 1 (Aug 2013), 012010.
[4]
Argonne National Laboratory. 2014. Using the Hydra Process Manager. Retrieved 2015-06-08 from https://wiki.mpich.org/mpich/index.php/Using_the_Hydra_Process_Manager.
[5]
ARM Limited. 2013. big.LITTLE Technology: The Future of Mobile. Technical Report. Retrieved 2016-09-01 from https://www.arm.com/files/pdf/big_LITTLE_Technology_the_Futue_of_Mobile.pdf.
[6]
Manu Awasthi, David W. Nellans, Kshitij Sudan, Rajeev Balasubramonian, and Al Davis. 2010. Handling the problems and opportunities posed by multiple on-chip memory controllers. In International Conference on Parallel Architectures and Compilation Techniques (PACT). 319--330.
[7]
Reza Azimi, David K. Tam, Livio Soares, and Michael Stumm. 2009. Enhancing operating system support for multicore processors by using hardware performance monitoring. ACM SIGOPS Operating Systems Review 43, 2 (Apr 2009), 56--65.
[8]
David H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter, L. Dagum, R. A. Fatoohi, P. O. Frederickson, T. A. Lasinski, R. S. Schreiber, H. D. Simon, V. Venkatakrishnan, and S. K. Weeratunga. 1991. The NAS parallel benchmarks. International Journal of Supercomputer Applications 5, 3 (1991), 66--73.
[9]
G. Ballard, E. Carson, J. Demmel, M. Hoemmen, N. Knight, and O. Schwartz. 2014. Communication lower bounds and optimal algorithms for numerical linear algebra. Acta Numerica 23, May (2014), 1--155.
[10]
Nick Barrow-Williams, Christian Fensch, and Simon Moore. 2009. A communication characterisation of splash-2 and parsec. In IEEE International Symposium on Workload Characterization (IISWC’09). 86--97.
[11]
Arkaprava Basu, Jayneel Gandhi, Jichuan Chang, Mark D. Hill, and Michael M. Swift. 2013. Efficient virtual memory for big memory servers. In International Symposium on Computer Architecture (ISCA’13). 237--248.
[12]
David Beniamine, Matthias Diener, Guillaume Huard, and Philippe O. A. Navaux. 2015. TABARNAC: Visualizing and resolving memory access issues on NUMA architectures. In Workshop on Visual Performance Analysis (VPA’15).
[13]
Abhinav Bhatele. 2010. Automating Topology Aware Mapping for Supercomputers. Ph.D. Dissertation. University of Illinois at Urbana-Champaign.
[14]
Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008. The PARSEC benchmark suite: Characterization and architectural implications. In International Conference on Parallel Architectures and Compilation Techniques (PACT’08). 72--81.
[15]
John Bircsak, Peter Craig, RaeLyn Crowell, Zarka Cvetanovic, Jonathan Harris, C. Alexander Nelson, and Carl D. Offner. 2000. Extending OpenMP for NUMA machines. In ACM/IEEE Conference on Supercomputing (SC’00). 163--181.
[16]
Sergey Blagodurov, Sergey Zhuravlev, Mohammad Dashti, and Alexandra Fedorova. 2011. A case for NUMA-aware contention management on multicore systems. In USENIX Annual Technical Conference (ATC’11). 557--571.
[17]
Robert D. Blumofe and Charles E. Leiserson. 1994. Scheduling multithreaded computations by work stealing. In Symposium on Foundations of Computer Science (FOCS’94). 1--29.
[18]
Jacques E. Boillat and Peter G. Kropf. 1990. A fast distributed mapping algorithm. In Joint International Conference on Vector and Parallel Processing (CONPAR 90 -- VAPP IV). 405--416.
[19]
William J. Bolosky and Michael L. Scott. 1992. Evaluation of multiprocessor memory systems using off-line optimal behavior. Journal of Parallel and Distributed Computing (JPDC) 15, 4 (1992), 382--398.
[20]
William J. Bolosky, Michael L. Scott, Robert P. Fitzgerald, Robert J. Fowler, and Alan L. Cox. 1991. NUMA policies and their relation to memory architecture. ACM SIGARCH Computer Architecture News 19, 2 (1991), 212--221.
[21]
Barbara Brandfass, Thomas Alrutz, and Thomas Gerhold. 2013. Rank reordering for MPI communication optimization. Computers 8 Fluids 80, July (2013), 372--380.
[22]
Timothy Brecht. 1993. On the importance of parallel application placement in NUMA multiprocessors. In Symposium on Experiences with Distributed and Multiprocessor Systems (SEDMS). 1--18.
[23]
David Brooks, Margaret Martonosi, John-David Wellman, and Pradip Bose. 2000. Power-performance modeling and tradeoff analysis for a high end microprocessor. In International Workshop on Power-Aware Computer Systems (PACS’00). 126--136.
[24]
François Broquedis, Olivier Aumage, Brice Goglin, Samuel Thibault, Pierre-André Wacrenier, and Raymond Namyst. 2010a. Structuring the execution of OpenMP applications for multicore architectures. In IEEE International Parallel 8 Distributed Processing Symposium (IPDPS’10). 1--10.
[25]
Franois Broquedis, Jerome Clet-Ortega, Stephanie Moreaud, Nathalie Furmento, Brice Goglin, Guillaume Mercier, Samuel Thibault, and Raymond Namyst. 2010b. hwloc: A generic framework for managing hardware affinities in HPC applications. In Euromicro Conference on Parallel, Distributed and Network-based Processing (PDP’10). 180--186.
[26]
François Broquedis, Nathalie Furmento, Brice Goglin, Pierre-André Wacrenier, and Raymond Namyst. 2010c. ForestGOMP: An efficient OpenMP environment for NUMA architectures. International Journal of Parallel Programming 38, 5--6 (2010), 418--439.
[27]
Mats Brorsson. 1989. Performance Impact of Code and Data Placement on the IBM RP3. Technical Report.
[28]
Bryan Buck and Jeffrey K. Hollingsworth. 2000. An API for runtime code patching. International Journal of High Performance Computing Applications (IJHPCA) 14, 4 (2000), 317--329.
[29]
J. Mark Bull and Chris Johnson. 2002. Data distribution, migration and replication on a cc-NUMA architecture. In European Workshop on OpenMP (EWOMP’02). 1--5.
[30]
Anthony Chan, William Gropp, and Ewing Lusk. 1998. User’s Guide for MPE Extensions for MPI Programs. Technical Report.
[31]
Rohit Chandra, Scott Devine, Ben Verghese, Anoop Gupta, and Mendel Rosenblum. 1994. Scheduling and page migration for multiprocessor compute servers. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’94). 12--24.
[32]
Hu Chen, Wenguang Chen, Jian Huang, Bob Robert, and H. Kuhn. 2006. MPIPP: An automatic profile-guided parallel process placement toolset for SMP clusters and multiclusters. In ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC’06). 353--360.
[33]
Zeshan Chishti, Michael D. Powell, and T. N. Vijaykumar. 2005. Optimizing replication, communication, and capacity allocation in CMPs. ACM SIGARCH Computer Architecture News 33, 2 (May 2005), 357--368.
[34]
Theofanis Constantinou, Yiannakis Sazeides, Pierre Michaud, Damien Fetis, and Andre Seznec. 2005. Performance implications of single thread migration on a chip multi-core. ACM SIGARCH Computer Architecture News 33, 4 (2005), 80--91.
[35]
Pat Conway. 2007. The AMD opteron northbridge architecture. IEEE Micro 27, 2 (2007), 10--21.
[36]
Julita Corbalan, Xavier Martorell, and Jesus Labarta. 2003. Evaluation of the memory page migration influence in the system performance: The case of the SGI O2000. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC’03). 121--129.
[37]
Jonathan Corbet. 2012a. AutoNUMA: The Other Approach to NUMA Scheduling. Retrieved 2015-06-08 from http://lwn.net/Articles/488709/
[38]
Jonathan Corbet. 2012b. Toward Better NUMA Scheduling. Retrieved 2015-06-08 from http://lwn.net/Articles/486858/.
[39]
Eduardo H. M. Cruz. 2012. Dynamic Detection of the Communication Pattern in Shared Memory Environments for Thread Mapping. Master’s thesis.
[40]
Eduardo H. M. Cruz, Marco A. Z. Alves, Alexandre Carissimi, Philippe O. A. Navaux, Christiane Pousa Ribeiro, and Jean-François Méhaut. 2011. Using memory access traces to map threads and data on hierarchical multi-core platforms. In IEEE International Symposium on Parallel and Distributed Processing Workshops and PhD Forum. 551--558.
[41]
Eduardo H. M. Cruz, Matthias Diener, Marco A. Z. Alves, and Philippe O. A. Navaux. 2014a. Dynamic thread mapping of shared memory applications by exploiting cache coherence protocols. Journal of Parallel and Distributed Computing (JPDC) 74, 3 (Mar 2014), 2215--2228.
[42]
Eduardo H. M. Cruz, Matthias Diener, Marco A. Z. Alves, Laércio L. Pilla, and Philippe O. A. Navaux. 2014b. Optimizing memory locality using a locality-aware page table. In International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD’14). 198--205.
[43]
Eduardo H. M. Cruz, Matthias Diener, and Philippe O. A. Navaux. 2012. Using the translation lookaside buffer to map threads in parallel applications based on shared memory. In IEEE International Parallel 8 Distributed Processing Symposium (IPDPS’12). 532--543.
[44]
Eduardo H. M. Cruz, Matthias Diener, and Philippe O. A. Navaux. 2015a. Communication-aware thread mapping using the translation lookaside buffer. Concurrency Computation: Practice and Experience 22, 6 (2015), 685--701.
[45]
Eduardo H. M. Cruz, Matthias Diener, Laércio L. Pilla, and Philippe O. A. Navaux. 2015b. An efficient algorithm for communication-based task mapping. In International Conference on Parallel, Distributed, and Network-Based Processing (PDP’15). 207--214.
[46]
William J. Dally. 2010. GPU Computing to Exascale and Beyond. Technical Report.
[47]
Reetuparna Das, Rachata Ausavarungnirun, Onur Mutlu, Akhilesh Kumar, and Mani Azimi. 2013. Application-to-core mapping policies to reduce memory system interference in multi-core systems. In International Symposium on High Performance Computer Architecture (HPCA’13). 107--118.
[48]
Mohammad Dashti, Alexandra Fedorova, Justin Funston, Fabien Gaud, Renaud Lachaize, Baptiste Lepers, Vivien Quéma, and Mark Roth. 2013. Traffic management: A holistic approach to memory placement on NUMA systems. In Architectural Support for Programming Languages and Operating Systems (ASPLOS’13). 381--393.
[49]
Karen D. Devine, Erik G. Boman, Robert T. Heaphy, Rob H. Bisseling, and Umit V. Catalyurek. 2006. Parallel hypergraph partitioning for scientific computing. In IEEE International Parallel 8 Distributed Processing Symposium (IPDPS’06). 124--133.
[50]
Matthias Diener. 2010. Evaluating Thread Placement Improvements in Multi-core Architectures. Master’s thesis. Berlin Institute of Technology.
[51]
Matthias Diener. 2015. Automatic Task and Data Mapping in Shared Memory Architectures. Ph.D. Dissertation. Federal University of Rio Grande do Sul.
[52]
Matthias Diener, Eduardo H. M. Cruz, Marco A. Z. Alves, Mohammad S. Alhakeem, and Philippe O. A. Navaux. 2015. Locality and balance for communication-aware thread mapping in multicore systems. In Euro-Par. 196--208.
[53]
Matthias Diener, Eduardo H. M. Cruz, Marco A. Z. Alves, and Philippe O. A. Navaux. 2016. Communication in shared memory: Concepts, definitions, and efficient detection. In Euromicro International Conference on Parallel, Distributed, and Network-based Processing (PDP’16). 151--158.
[54]
Matthias Diener, Eduardo H. M. Cruz, Marco A. Z. Alves, Philippe O. A. Navaux, Anselm Busse, and Hans-Ulrich Heiss. 2015. Kernel-based thread and data mapping for improved memory affinity. IEEE Transactions on Parallel and Distributed Systems (TPDS) 26, X (2015), 1--14.
[55]
Matthias Diener, Eduardo H. M. Cruz, and Philippe O. A. Navaux. 2013. Communication-based mapping using shared pages. In IEEE International Parallel 8 Distributed Processing Symposium (IPDPS’13). 700--711.
[56]
Matthias Diener, Eduardo H. M. Cruz, and Philippe O. A. Navaux. 2015. Locality vs. balance: Exploring data mapping policies on NUMA systems. In International Conference on Parallel, Distributed, and Network-Based Processing (PDP’15). 9--16.
[57]
Matthias Diener, Eduardo H. M. Cruz, Philippe O. A. Navaux, Anselm Busse, and Hans-Ulrich Heiß. 2014. kMAF: Automatic kernel-level management of thread and data affinity. In International Conference on Parallel Architectures and Compilation Techniques (PACT’14). 277--288.
[58]
Matthias Diener, Eduardo H. M. Cruz, Philippe O. A. Navaux, Anselm Busse, and Hans-Ulrich Heiß. 2015a. Communication-aware process and thread mapping using online communication detection. Parallel Computing 43, March (2015), 43--63.
[59]
Matthias Diener, Eduardo H. M. Cruz, Laércio L. Pilla, Fabrice Dupros, and Philippe O. A. Navaux. 2015b. Characterizing communication and page usage of parallel applications for thread and data mapping. Performance Evaluation 88-89, June (2015), 18--36.
[60]
Matthias Diener, Felipe L. Madruga, Eduardo R. Rodrigues, Marco A. Z. Alves, and Philippe O. A. Navaux. 2010. Evaluating thread placement based on memory access patterns for multi-core processors. In IEEE International Conference on High Performance Computing and Communications (HPCC). 491--496.
[61]
Wei Ding, Yuanrui Zhang, Mahmut Kandemir, Jithendra Srinivas, and Praveen Yedlapalli. 2013. Locality-aware mapping and scheduling for multicores. In IEEE/ACM International Symposium on Code Generation and Optimization (CGO’13). 1--12.
[62]
Ulrich Drepper. 2007. What Every Programmer Should Know About Memory. Technical Report 4. Red Hat, Inc. 114 pages.
[63]
Paul J. Drongowski. 2007. Instruction-Based Sampling: A New Performance Analysis Technique for AMD Family 10h Processors. Technical Report.
[64]
Fabrice Dupros, Hideo Aochi, Ariane Ducellier, Dimitri Komatitsch, and Jean Roman. 2008. Exploiting intensive multithreading for the efficient simulation of 3D seismic wave propagation. In IEEE International Conference on Computational Science and Engineering (CSE’08). 253--260.
[65]
Fabrice Dupros, Christiane Pousa, Alexandre Carissimi, and Jean-François Méhaut. 2010. Parallel simulations of seismic wave propagation on NUMA architectures. In Parallel Computing: From Multicores and GPU’s to Petascale. 67--74.
[66]
Alexandre E. Eichenberger, Christian Terboven, Michael Wong, and Dieter An Mey. 2012. The design of OpenMP thread affinity. Lecture Notes in Computer Science 7312 LNCS (2012), 15--28.
[67]
Steven Frank, Henry Burkhardt III, and James Rothnie. 1993. The KSR1: Bridging the gap between shared memory and MPPs. In IEEE Compcon. 285--294.
[68]
S. R. Freitas, K. M. Longo, M. A. F. Silva Dias, R. Chatfield, P. Silva Dias, P. Artaxo, M. O. Andreae, G. Grell, L. F. Rodrigues, A. Fazenda, and J. Panetta. 2009. The coupled aerosol and tracer transport model to the brazilian developments on the regional atmospheric modeling system (CATT-BRAMS) part 1: Model description and evaluation. Atmospheric Chemistry and Physics 9, 8 (2009), 2843--2861.
[69]
Edgar Gabriel, Graham E. Fagg, George Bosilca, Thara Angskun, Jack J. Dongarra, Jeffrey M. Squyres, Vishal Sahay, Prabhanjan Kambadur, Brian Barrett, Andrew Lumsdaine, Ralph H. Castain, David J. Daniel, Richard L. Graham, and Timothy S. Woodall. 2004. Open MPI: Goals, concept, and design of a next generation MPI implementation. In Recent Advances in Parallel Virtual Machine and Message Passing Interface (PVMMPI’04). 97--104.
[70]
Fabien Gaud, Baptiste Lepers, Justin Funston, Mohammad Dashti, Alexandra Fedorova, Vivien Quéma, Renaud Lachaize, and Mark Roth. 2015. Challenges of memory management on modern NUMA systems. Commununications of the ACM 58, 12 (2015), 59--66.
[71]
Ilaria Di Gennaro, Alessandro Pellegrini, and Francesco Quaglia. 2016. OS-based NUMA optimization: Tackling the case of truly multi-thread applications with non-partitioned virtual page accesses. In IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing (CCGRID’16). 291--300.
[72]
Alfredo Giménez, Todd Gamblin, Barry Rountree, Abhinav Bhatele, Ilir Jusufi, Peer-Timo Bremer, and Bernd Hamann. 2014. Dissecting on-node memory access performance: A semantic approach. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC’14). 166--176.
[73]
Alfredo Giménez, Ilir Jusufi, Abhinav Bhatele, Todd Gamblin, Martin Schulz, Peer-Timo Bremer, and Bernd Hamann. 2015. MemAxes: Interactive Visual Analysis of Memory Access Data. Technical Report.
[74]
Roland Glantz, Henning Meyerhenke, and Alexander Noe. 2015. Algorithms for mapping parallel processes onto grid and torus architectures. In International Conference on Parallel, Distributed, and Network-Based Processing (PDP’15). 236--243.
[75]
Brice Goglin and Nathalie Furmento. 2009. Enabling high-performance memory migration for multithreaded applications on linux. In International Symposium on Parallel 8 Distributed Processing (IPDPS’09).
[76]
William Gropp. 2002. MPICH2: A new start for MPI implementations. In Recent Advances in Parallel Virtual Machine and Message Passing Interface.
[77]
Erik Hagersten and Michael Koster. 1999. WildFire: A scalable path for SMPs. In International Symposium on High Performance Computer Architecture (HPCA’99). 172.
[78]
Joshua Hursey, Jeffrey M. Squyres, and Terry Dontje. 2011. Locality-aware parallel process mapping for multi-core HPC systems. In IEEE International Conference on Cluster Computing (CLUSTER’11). 527--531.
[79]
Intel. 2012. Using KMP_AFFINITY to create OpenMP thread mapping to OS proc IDs. Retrieved 2015-06-08 from https://software.intel.com/en-us/articles/using-kmp-affinity-to-create-openmp-thread-mapping-to-os-proc-ids.
[80]
Intel. 2013. Intel Trace Analyzer and Collector. Retrieved from http://software.intel.com/en-us/intel-trace-analyzer.
[81]
Satoshi Ito, Kazuya Goto, and Kenji Ono. 2013. Automatically optimized core mapping to subdomains of domain decomposition method on multicore parallel environments. Computers 8 Fluids 80 (jul 2013), 88--93.
[82]
Emmanuel Jeannot and Guillaume Mercier. 2010. Near-optimal placement of MPI processes on hierarchical NUMA architectures. In Euro-Par Parallel Processing. 199--210.
[83]
Emmanuel Jeannot, Guillaume Mercier, and François Tessier. 2014. Process placement in multicore clusters: Algorithmic issues and practical techniques. IEEE Transactions on Parallel and Distributed Systems 25, 4 (Apr 2014), 993--1002.
[84]
H. Jin, M. Frumkin, and J. Yan. 1999. The OpenMP Implementation of NAS Parallel Benchmarks and Its Performance. Technical Report October. NASA.
[85]
C Karlsson, T Davies, and Zizhong Chen. 2012. Optimizing process-to-core mappings for application level multi-dimensional MPI communications. In IEEE International Conference on Cluster Computing (CLUSTER’12). 486--494.
[86]
George Karypis and Vipin Kumar. 1996. Parallel multilevel graph partitioning. In International Parallel Processing Symposium (IPPS’96). 314--319.
[87]
George Karypis and Vipin Kumar. 1998. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM Journal on Scientific Computing 20, 1 (Jan 1998), 359--392.
[88]
Andi Kleen. 2004. An NUMA API for Linux. Technical Report. Retrieved rom http://andikleen.de/numaapi3.pdf.
[89]
Tobias Klug, Michael Ott, Josef Weidendorfer, and Carsten Trinitis. 2008. autopin -- automated optimization of thread-to-core pinning on multicore systems. In Transactions on High-Performance Embedded Architectures and Compilers (HiPEAC). 219--235.
[90]
Renaud Lachaize, Baptiste Lepers, and Vivien Quéma. 2012. MemProf: A memory profiler for NUMA multicore systems. In USENIX Annual Technical Conference (ATC’12). 53--64.
[91]
Stefan Lankes, Boris Bierbaum, and Thomas Bemmerl. 2010. Affinity-on-next-touch: An extension to the linux kernel for NUMA architectures. Lecture Notes in Computer Science 6067 LNCS, PART 1 (2010), 576--585.
[92]
Richard P. LaRowe, Mark A. Holliday, and Carla Schlatter Ellis. 1992. An analysis of dynamic page placement on a NUMA multiprocessor. ACM SIGMETRICS Performance Evaluation Review 20, 1 (1992), 23--34.
[93]
Richard P. LaRowe Jr. 1991. Page Placement For Non-Uniform Memory Access Time (NUMA) Shared Memory Multiprocessors. Ph.D. Dissertation. Duke University.
[94]
Chris Lattner. 2011. LLVM and clang: Advancing compiler technology. In Free and Open Source Developers European Meeting (FOSDEM’11).
[95]
Daniel Lenoski, James Laudon, Truman Joe, David Nakahira, Luis Stevens, Anoop Gupta, and John Hennessy. 1992. The DASH prototype: Implementation and performance. In International Symposium on Computer Architecture (ISCA’92). 92--103.
[96]
David Levinthal. 2009. Performance Analysis Guide for Intel® Core™ i7 Processor and Intel® Xeon™ 5500 processors. Technical Report.
[97]
Tong Li, Dan P Baumberger, and Scott Hahn. 2009. Efficient and scalable multiprocessor fair scheduling using distributed weighted round-robin. In ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’09). 65--74.
[98]
Xu Liu and John Mellor-Crummey. 2014. A tool to analyze the performance of multithreaded programs on NUMA architectures. In ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’14). 259--272.
[99]
Henrik Löf and Sverker Holmgren. 2005. Affinity-on-next-touch: Increasing the performance of an industrial pde solver on a cc-NUMA System. In International Conference on Supercomputing (ICS’05). 387--392.
[100]
Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. 2005. Pin: Building customized program analysis tools with dynamic instrumentation. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’05). 190--200.
[101]
Peter S. Magnusson, Magnus Christensson, Jesper Eskilson, Daniel Forsgren, Gustav Hallberg, Johan Hogberg, Fredrik Larsson, Andreas Moestedt, and Bengt Werner. 2002. Simics: A full system simulation platform. IEEE Computer 35, 2 (2002), 50--58.
[102]
Zoltan Majo and Thomas R. Gross. 2012. Matching memory access patterns and data placement for NUMA systems. In International Symposium on Code Generation and Optimization (CGO’12). 230--241.
[103]
Zoltan Majo and Thomas R. Gross. 2015. A library for portable and composable data locality optimizations for NUMA systems. In ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’15). 227--238.
[104]
Jaydeep Marathe and Frank Mueller. 2006. Hardware profile-guided automatic page placement for ccNUMA systems. In ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’06). 90--99.
[105]
Jaydeep Marathe, Vivek Thakkar, and Frank Mueller. 2010. Feedback-directed page placement for ccNUMA via hardware-generated memory traces. Journal of Parallel and Distributed Computing (JPDC’10) 70, 12 (2010), 1204--1219.
[106]
Michael Marchetti, Leonidas Kontothanassis, Ricardo Bianchini, and Michael L. Scott. 1995. Using simple page placement policies to reduce the cost of cache fills in coherent shared-memory systems. In International Parallel Processing Symposium (IPPS’95). 480--485.
[107]
Artur Mariano, Matthias Diener, Christian Bischof, and Philippe O. A. Navaux. 2016. Analyzing and improving memory access patterns of large irregular applications on NUMA machines. In Euromicro International Conference on Parallel, Distributed, and Network-based Processing (PDP’16). 382--387.
[108]
Artur Mariano, Thijs Laarhoven, and Christian Bischof. 2015. Parallel (probable) lock-free hashsieve: A practical sieving algorithm for the SVP. In International Conference on Parallel Processing (ICPP). 590--599.
[109]
Collin McCurdy and Jeffrey Vetter. 2010. Memphis: Finding and fixing NUMA-related performance problems on multi-core platforms. In IEEE International Symposium on Performance Analysis of Systems 8 Software (ISPASS’10). 87--96.
[110]
Guillaume Mercier and Jérôme Clet-Ortega. 2009. Towards an efficient process placement policy for MPI applications in multicore environments. In Recent Advances in Parallel Virtual Machine and Message Passing Interface. 104--115.
[111]
Guillaume Mercier and Emmanuel Jeannot. 2011. Improving MPI applications performance on multicore clusters with rank reordering. In European MPI Users’ Group Conference on Recent Advances in the Message Passing Interface (EuroMPI’11). 39--49.
[112]
Dimitrios S. Nikolopoulos, Theodore S. Papatheodorou, Constantine D. Polychronopoulos, Jesús Labarta, and Eduard Ayguadé. 2000a. User-level dynamic page migration for multiprogrammed shared-memory multiprocessors. In International Conference on Parallel Processing (ICPP’00). 95--103.
[113]
Dimitrios S. Nikolopoulos, Theodore S. Papatheodorou, Constantine D. Polychronopoulos, Jesús Labarta, and Eduard Ayguadé. 2000b. UPMLIB: A runtime system for tuning the memory performance of OpenMP programs on scalable shared-memory multiprocessors. In Languages, Compilers, and Run-Time Systems for Scalable Computers (LCR’00). 85--99.
[114]
Dimitrios S. Nikolopoulos, Theodore S. Papatheodorou, Constantine D. Polychronopoulos, Jesús Labarta, and Eduard Ayguadé. 2000c. Is data distribution necessary in OpenMP? In ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC’00).
[115]
Lisa Noordergraaf and Ruud van der Pas. 1999. Performance experiences on sun’s wildfire prototype. In ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC’99). 1--16.
[116]
Takeshi Ogasawara. 2009. NUMA-aware memory manager with dominant-thread-based copying GC. In ACM SIGPLAN Conference on Object Oriented Programming Systems Languages and Applications (OOPSLA’09). 377--390.
[117]
OpenMP Architecture Review Board. 2013. OpenMP Application Program Interface, Version 4.0. (2013).
[118]
Oracle. 2010. Solaris OS Tuning Features. Retrieved 2015-06-16 from http://docs.oracle.com/cd/E18659_01/html/821-1381/aewda.html
[119]
Michael Ott, Tobias Klug, Josef Weidendorfer, and Carsten Trinitis. 2008. autopin -- automated optimization of thread-to-core pinning on multicore systems. In Workshop on Programmability Issues for Multi-Core Computers (MULTIPROG’08).
[120]
François Pellegrini. 1994. Static mapping by dual recursive bipartitioning of process and architecture graphs. In Scalable High-Performance Computing Conference (SHPCC’94). 486--493.
[121]
François Pellegrini. 2010. Scotch and Libscotch 5.1 User’s Guide. Technical Report.
[122]
Gregory F. Pfister, William C. Brantley, David A. George, Steve L. Harvey, Wally J. Kleinfelder, Kevin P. McAuliffe, Evelin S. Melton, V. Alan Norton, and Jodi Weiss. 1985. The IBM research parallel processor prototype (RP3): Introduction and architecture. In International Conference on Parallel Processing (ICPP’85). 764--771.
[123]
Guilherme Piccoli, Henrique N. Santos, Raphael E. Rodrigues, Christiane Pousa, Edson Borin, and Fernando M. Quintão Pereira. 2014. Compiler support for selective page migration in NUMA architectures. In International Conference on Parallel Architectures and Compilation Techniques (PACT’14). 369--380.
[124]
Petar Radojković, Vladimir Čakarević, Miquel Moretó, Javier Verdú, Alex Pajuelo, Francisco J. Cazorla, Mario Nemirovsky, and Mateo Valero. 2012. Optimal task assignment in multithreaded processors: A statistical approach. SIGARCH Computer Architecture News 40, 1 (2012), 235--248.
[125]
Petar Radojković, Vladimir Cakarević, Javier Verdú, Alex Pajuelo, Francisco J. Cazorla, Mario Nemirovsky, and Mateo Valero. 2013. Thread assignment of multithreaded network applications in multicore/multithreaded processors. IEEE Transactions on Parallel and Distributed Systems (TPDS) 24, 12 (2013), 2513--2525.
[126]
Christiane Pousa Ribeiro. 2011. Contributions on Memory Affinity Management for Hierarchical Shared Memory Multi-core Platforms. Ph.D. Dissertation. Univeristy of Grenoble.
[127]
Christiane Pousa Ribeiro, Marcio Castro, Jean-François Méhaut, and Alexandre Carissimi. 2010. Improving memory affinity of geophysics applications on NUMA platforms using Minas. In International Conference on High Performance Computing for Computational Science (VECPAR’10). 279--292.
[128]
Christiane Pousa Ribeiro, Jean-François Méhaut, Alexandre Carissimi, Marcio Castro, and Luiz Gustavo Fernandes. 2009. Memory affinity for hierarchical shared memory multiprocessors. In International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD’09). 59--66.
[129]
Eduardo R. Rodrigues, Felipe L. Madruga, Philippe O. A. Navaux, and Jairo Panetta. 2009. Multi-core aware process mapping and its impact on communication overhead of parallel applications. In IEEE Symposium on Computers and Communications (ISCC’09). 811--817.
[130]
John Shalf, Sudip Dosanjh, and John Morrison. 2010. Exascale computing technology challenges. In High Performance Computing for Computational Science (VECPAR’10). 1--25.
[131]
Jaswinder Pal Singh, Truman Joe, Anoop Gupta, and John L. Hennessy. 1993. An empirical comparison of the Kendall square research KSR-1 and stanford DASH multiprocessors. In ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC’93). 214--225.
[132]
Mohsen Soryani, Morteza Analoui, and Ghobad Zarrinchian. 2013. Improving inter-node communications in multi-core clusters using a contention-free process mapping algorithm. Journal of Supercomputing 66, 1 (apr 2013), 488--513.
[133]
Brad Spengler. 2003. PaX: The Guaranteed End of Arbitrary Code Execution. Technical Report.
[134]
David Tam, Reza Azimi, and Michael Stumm. 2007. Thread clustering: Sharing-aware scheduling on SMP-CMP-SMT multiprocessors. In ACM SIGOPS/EuroSys European Conference on Computer Systems. 47--58.
[135]
Christian Terboven, Dieter an Mey, Dirk Schmidl, Henry Jin, and Thomas Reichstein. 2008. Data and thread affinity in OpenMP programs. In Workshop on Memory Access on Future Processors: A Solved Problem? (MAW’08). 377--384.
[136]
The Open MPI project. 2009. Portable Linux Processor Affinity (PLPA). Retrieved 2015-06-08 from https://www.open-mpi.org/projects/plpa/.
[137]
The Open MPI Project. 2013. mpirun(1) man page (version 1.6.4). Retrieved 2016-02-08 from http://www.open-mpi.org/doc/v1.6/man1/mpirun.1.php#sect9.
[138]
Samuel Thibault, Raymond Namyst, and Pierre-André Wacrenier. 2007. Building portable thread schedulers for hierarchical multiprocessors: The bubblesched framework. In Euro-Par. 42--51.
[139]
Mustafa M. Tikir and Jeffrey K. Hollingsworth. 2004. Using hardware counters to automatically improve memory performance. In ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC’04).
[140]
Mustafa M. Tikir and Jeffrey K. Hollingsworth. 2008. Hardware monitors for dynamic page migration. Journal of Parallel and Distributed Computing (JPDC) 68, 9 (Sep 2008), 1186--1200.
[141]
Jesper Larsson Träff. 2002. Implementing the MPI process topology mechanism. In ACM/IEEE Conference on Supercomputing (SC’02).
[142]
François Trahay, François Rue, Mathieu Faverge, Yutaka Ishikawa, Raymond Namyst, and Jack Dongarra. 2011. EZTrace: A generic framework for performance analysis. In International Symposium on Cluster, Cloud and Grid Computing (CCGrid’11). 618--619.
[143]
Ruud van der Pas. 2009. Getting OpenMP Up To Speed. Technical Report.
[144]
Goran Velkoski, Sasko Ristov, and Marjan Gusev. 2013. Loosely or tightly coupled affinity for matrix-vector multiplication. In International Convention on Information 8 Communication Technology Electronics 8 Microelectronics (MIPRO’13). 228--233.
[145]
Ben Verghese, Scott Devine, Anoop Gupta, and Mendel Rosenblum. 1996b. Operating system support for improving data locality on CC-NUMA compute servers. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’96). 279--289.
[146]
Ben Verghese, Scott Devine, Anoop Gupta, and Mendel Rosenblum. 1996a. OS Support for Improving Data Locality on CC-NUMA Compute Servers. Technical Report.
[147]
Carlos Villavieja, Vasileios Karakostas, Lluis Vilanova, Yoav Etsion, Alex Ramirez, Avi Mendelson, Nacho Navarro, Adrian Cristal, and Osman S. Unsal. 2011. DiDi: Mitigating the performance impact of TLB shootdowns using a shared TLB directory. In International Conference on Parallel Architectures and Compilation Techniques (PACT’11). 340--349.
[148]
Skef Wholey. 1991. Automatic data mapping for distributed-memory parallel computers. In International Conference on Supercomputing (ICS’91). 25--34.
[149]
Chee Siang Wong, Ian Tan, Rosalind Deena Kumari, and Fun Wey. 2008. Towards achieving fairness in the linux scheduler. ACM SIGOPS Operating Systems Review 42, 5 (ul J2008), 34--43.
[150]
Yang You, Jammes Demmel, Kenneth Czechowski, Le Song, and Richard Vuduc. 2015. CA-SVM: Communication-avoiding support vector machines on distributed systems. In IEEE International Parallel and Distributed Processing Symposium (IPDPS’15). 847--859.
[151]
Jidong Zhai, Tianwei Sheng, and Jiangzhou He. 2011. Efficiently acquiring communication traces for large-scale parallel applications. IEEE Transactions on Parallel and Distributed Systems (TPDS) 22, 11 (2011), 1862--1870.
[152]
Xing Zhou, Wenguang Chen, and Weimin Zheng. 2009. Cache sharing management for performance fairness in chip multiprocessors. In International Conference on Parallel Architectures and Compilation Techniques (PACT’09). 384--393.
[153]
Sergey Zhuravlev, Juan Carlos Saez, Sergey Blagodurov, Alexandra Fedorova, and Manuel Prieto. 2012. Survey of scheduling techniques for addressing shared resources in multicore processors. ACM Computing Surveys (CSUR) 45, 1 (2012), 1--32.
[154]
Dimitrios Ziakas, Allen Baum, Robert A. Maddox, and Robert J. Safranek. 2010. Intel quickpath interconnect - architectural features supporting scalable system architectures. In Symposium on High Performance Interconnects (HOTI’10). 1--6.

Cited By

View all
  • (2024)Optimization of NUMA Aware DNN Computing SystemAdvanced Intelligent Computing Technology and Applications10.1007/978-981-97-5591-2_11(124-136)Online publication date: 14-Aug-2024
  • (2023)Optimizing performance and energy across problem sizes through a search space exploration and machine learningJournal of Parallel and Distributed Computing10.1016/j.jpdc.2023.104720180(104720)Online publication date: Oct-2023
  • (2022)Design of AI System for National Fitness Sports Competition Action Based on Association Rules AlgorithmComputational Intelligence and Neuroscience10.1155/2022/13750092022Online publication date: 1-Jan-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Computing Surveys
ACM Computing Surveys  Volume 49, Issue 4
December 2017
666 pages
ISSN:0360-0300
EISSN:1557-7341
DOI:10.1145/3022634
  • Editor:
  • Sartaj Sahni
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 December 2016
Accepted: 01 October 2016
Revised: 01 September 2016
Received: 01 June 2016
Published in CSUR Volume 49, Issue 4

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. NUMA
  2. Survey
  3. cache memories
  4. communication
  5. data mapping
  6. shared memory
  7. thread mapping

Qualifiers

  • Survey
  • Research
  • Refereed

Funding Sources

  • MCTI/RNP Brazil under the HPC4E project
  • CAPES

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)78
  • Downloads (Last 6 weeks)11
Reflects downloads up to 16 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Optimization of NUMA Aware DNN Computing SystemAdvanced Intelligent Computing Technology and Applications10.1007/978-981-97-5591-2_11(124-136)Online publication date: 14-Aug-2024
  • (2023)Optimizing performance and energy across problem sizes through a search space exploration and machine learningJournal of Parallel and Distributed Computing10.1016/j.jpdc.2023.104720180(104720)Online publication date: Oct-2023
  • (2022)Design of AI System for National Fitness Sports Competition Action Based on Association Rules AlgorithmComputational Intelligence and Neuroscience10.1155/2022/13750092022Online publication date: 1-Jan-2022
  • (2022)Learning Intermediate Representations using Graph Neural Networks for NUMA and Prefetchers Optimization2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS53621.2022.00120(1206-1216)Online publication date: May-2022
  • (2022)COMPROF and COMPLACE: Shared-Memory Communication Profiling and Automated Thread Placement via Dynamic Binary Instrumentation2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC)10.1109/HiPC56025.2022.00040(236-245)Online publication date: Dec-2022
  • (2022)An enhanced deadline constraint based task scheduling mechanism for cloud environmentJournal of King Saud University - Computer and Information Sciences10.1016/j.jksuci.2018.10.00934:2(282-294)Online publication date: 1-Feb-2022
  • (2022)Sharing-Aware Data Mapping in Software Transactional MemoryEmbedded Computer Systems: Architectures, Modeling, and Simulation10.1007/978-3-031-04580-6_32(481-492)Online publication date: 27-Apr-2022
  • (2021)Boosting Graph Analytics by Tuning Threads and Data Affinity on NUMA Systems2021 29th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)10.1109/PDP52278.2021.00033(161-168)Online publication date: Mar-2021
  • (2021)CMLB: a Communication-aware and Memory Load Balance Mapping Optimization for Modern NUMA Systems2021 IEEE 23rd Int Conf on High Performance Computing & Communications; 7th Int Conf on Data Science & Systems; 19th Int Conf on Smart City; 7th Int Conf on Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCity/DependSys)10.1109/HPCC-DSS-SmartCity-DependSys53884.2021.00099(579-586)Online publication date: Dec-2021
  • (2021)Dynamic concurrency throttling on NUMA systems and data migration impactsDesign Automation for Embedded Systems10.1007/s10617-020-09243-525:2(135-160)Online publication date: 1-Jun-2021
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media