survey

Affinity-Based Thread and Data Mapping in Shared Memory Systems

Authors:

Matthias Diener,

Eduardo H. M. Cruz,

Marco A. Z. Alves,

Philippe O. A. Navaux,

Israel KorenAuthors Info & Claims

ACM Computing Surveys (CSUR), Volume 49, Issue 4

Article No.: 64, Pages 1 - 38

https://doi.org/10.1145/3006385

Published: 05 December 2016 Publication History

Abstract

Shared memory architectures have recently experienced a large increase in thread-level parallelism, leading to complex memory hierarchies with multiple cache memory levels and memory controllers. These new designs created a Non-Uniform Memory Access (NUMA) behavior, where the performance and energy consumption of memory accesses depend on the place where the data is located in the memory hierarchy. Accesses to local caches or memory controllers are generally more efficient than accesses to remote ones. A common way to improve the locality and balance of memory accesses is to determine the mapping of threads to cores and data to memory controllers based on the affinity between threads and data. Such mapping techniques can operate at different hardware and software levels, which impacts their complexity, applicability, and the resulting performance and energy consumption gains. In this article, we introduce a taxonomy to classify different mapping mechanisms and provide a comprehensive overview of existing solutions.

References

[1]

L. Adhianto, S. Banerjee, M. Fagan, M. Krentel, G. Marin, J. Mellor-Crummey, and N. R. Tallent. 2010. HPCToolkit: Tools for performance analysis of optimized parallel programs. Concurrency and Computation: Practice and Experience 22, 6 (2010), 685--701.

[2]

Ahmad Anbar, Olivier Serres, Engin Kayraklioglu, Abdel-Hameed A. Badawy, and Tarek El-Ghazawi. 2016. Exploiting hierarchical locality in deep parallel architectures. ACM Transactions on Architecture and Code Optimization (TACO) 13, 2 (2016), 1--25.

Digital Library

[3]

Hideo Aochi, Thomas Ulrich, Ariane Ducellier, Fabrice Dupros, and David Michea. 2013. Finite difference simulations of seismic wave propagation for understanding earthquake physics and predicting ground motions: Advances and challenges. Journal of Physics: Conference Series 454, 1 (Aug 2013), 012010.

[4]

Argonne National Laboratory. 2014. Using the Hydra Process Manager. Retrieved 2015-06-08 from https://wiki.mpich.org/mpich/index.php/Using_the_Hydra_Process_Manager.

[5]

ARM Limited. 2013. big.LITTLE Technology: The Future of Mobile. Technical Report. Retrieved 2016-09-01 from https://www.arm.com/files/pdf/big_LITTLE_Technology_the_Futue_of_Mobile.pdf.

[6]

Manu Awasthi, David W. Nellans, Kshitij Sudan, Rajeev Balasubramonian, and Al Davis. 2010. Handling the problems and opportunities posed by multiple on-chip memory controllers. In International Conference on Parallel Architectures and Compilation Techniques (PACT). 319--330.

Digital Library

[7]

Reza Azimi, David K. Tam, Livio Soares, and Michael Stumm. 2009. Enhancing operating system support for multicore processors by using hardware performance monitoring. ACM SIGOPS Operating Systems Review 43, 2 (Apr 2009), 56--65.

Digital Library

[8]

David H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter, L. Dagum, R. A. Fatoohi, P. O. Frederickson, T. A. Lasinski, R. S. Schreiber, H. D. Simon, V. Venkatakrishnan, and S. K. Weeratunga. 1991. The NAS parallel benchmarks. International Journal of Supercomputer Applications 5, 3 (1991), 66--73.

Digital Library

[9]

G. Ballard, E. Carson, J. Demmel, M. Hoemmen, N. Knight, and O. Schwartz. 2014. Communication lower bounds and optimal algorithms for numerical linear algebra. Acta Numerica 23, May (2014), 1--155.

[10]

Nick Barrow-Williams, Christian Fensch, and Simon Moore. 2009. A communication characterisation of splash-2 and parsec. In IEEE International Symposium on Workload Characterization (IISWC’09). 86--97.

Digital Library

[11]

Arkaprava Basu, Jayneel Gandhi, Jichuan Chang, Mark D. Hill, and Michael M. Swift. 2013. Efficient virtual memory for big memory servers. In International Symposium on Computer Architecture (ISCA’13). 237--248.

Digital Library

[12]

David Beniamine, Matthias Diener, Guillaume Huard, and Philippe O. A. Navaux. 2015. TABARNAC: Visualizing and resolving memory access issues on NUMA architectures. In Workshop on Visual Performance Analysis (VPA’15).

Digital Library

[13]

Abhinav Bhatele. 2010. Automating Topology Aware Mapping for Supercomputers. Ph.D. Dissertation. University of Illinois at Urbana-Champaign.

Digital Library

[14]

Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008. The PARSEC benchmark suite: Characterization and architectural implications. In International Conference on Parallel Architectures and Compilation Techniques (PACT’08). 72--81.

Digital Library

[15]

John Bircsak, Peter Craig, RaeLyn Crowell, Zarka Cvetanovic, Jonathan Harris, C. Alexander Nelson, and Carl D. Offner. 2000. Extending OpenMP for NUMA machines. In ACM/IEEE Conference on Supercomputing (SC’00). 163--181.

Digital Library

[16]

Sergey Blagodurov, Sergey Zhuravlev, Mohammad Dashti, and Alexandra Fedorova. 2011. A case for NUMA-aware contention management on multicore systems. In USENIX Annual Technical Conference (ATC’11). 557--571.

Digital Library

[17]

Robert D. Blumofe and Charles E. Leiserson. 1994. Scheduling multithreaded computations by work stealing. In Symposium on Foundations of Computer Science (FOCS’94). 1--29.

Digital Library

[18]

Jacques E. Boillat and Peter G. Kropf. 1990. A fast distributed mapping algorithm. In Joint International Conference on Vector and Parallel Processing (CONPAR 90 -- VAPP IV). 405--416.

Digital Library

[19]

William J. Bolosky and Michael L. Scott. 1992. Evaluation of multiprocessor memory systems using off-line optimal behavior. Journal of Parallel and Distributed Computing (JPDC) 15, 4 (1992), 382--398.

[20]

William J. Bolosky, Michael L. Scott, Robert P. Fitzgerald, Robert J. Fowler, and Alan L. Cox. 1991. NUMA policies and their relation to memory architecture. ACM SIGARCH Computer Architecture News 19, 2 (1991), 212--221.

Digital Library

[21]

Barbara Brandfass, Thomas Alrutz, and Thomas Gerhold. 2013. Rank reordering for MPI communication optimization. Computers 8 Fluids 80, July (2013), 372--380.

[22]

Timothy Brecht. 1993. On the importance of parallel application placement in NUMA multiprocessors. In Symposium on Experiences with Distributed and Multiprocessor Systems (SEDMS). 1--18.

Digital Library

[23]

David Brooks, Margaret Martonosi, John-David Wellman, and Pradip Bose. 2000. Power-performance modeling and tradeoff analysis for a high end microprocessor. In International Workshop on Power-Aware Computer Systems (PACS’00). 126--136.

Digital Library

[24]

François Broquedis, Olivier Aumage, Brice Goglin, Samuel Thibault, Pierre-André Wacrenier, and Raymond Namyst. 2010a. Structuring the execution of OpenMP applications for multicore architectures. In IEEE International Parallel 8 Distributed Processing Symposium (IPDPS’10). 1--10.

[25]

Franois Broquedis, Jerome Clet-Ortega, Stephanie Moreaud, Nathalie Furmento, Brice Goglin, Guillaume Mercier, Samuel Thibault, and Raymond Namyst. 2010b. hwloc: A generic framework for managing hardware affinities in HPC applications. In Euromicro Conference on Parallel, Distributed and Network-based Processing (PDP’10). 180--186.

Digital Library

[26]

François Broquedis, Nathalie Furmento, Brice Goglin, Pierre-André Wacrenier, and Raymond Namyst. 2010c. ForestGOMP: An efficient OpenMP environment for NUMA architectures. International Journal of Parallel Programming 38, 5--6 (2010), 418--439.

[27]

Mats Brorsson. 1989. Performance Impact of Code and Data Placement on the IBM RP3. Technical Report.

[28]

Bryan Buck and Jeffrey K. Hollingsworth. 2000. An API for runtime code patching. International Journal of High Performance Computing Applications (IJHPCA) 14, 4 (2000), 317--329.

Digital Library

[29]

J. Mark Bull and Chris Johnson. 2002. Data distribution, migration and replication on a cc-NUMA architecture. In European Workshop on OpenMP (EWOMP’02). 1--5.

[30]

Anthony Chan, William Gropp, and Ewing Lusk. 1998. User’s Guide for MPE Extensions for MPI Programs. Technical Report.

[31]

Rohit Chandra, Scott Devine, Ben Verghese, Anoop Gupta, and Mendel Rosenblum. 1994. Scheduling and page migration for multiprocessor compute servers. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’94). 12--24.

Digital Library

[32]

Hu Chen, Wenguang Chen, Jian Huang, Bob Robert, and H. Kuhn. 2006. MPIPP: An automatic profile-guided parallel process placement toolset for SMP clusters and multiclusters. In ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC’06). 353--360.

Digital Library

[33]

Zeshan Chishti, Michael D. Powell, and T. N. Vijaykumar. 2005. Optimizing replication, communication, and capacity allocation in CMPs. ACM SIGARCH Computer Architecture News 33, 2 (May 2005), 357--368.

Digital Library

[34]

Theofanis Constantinou, Yiannakis Sazeides, Pierre Michaud, Damien Fetis, and Andre Seznec. 2005. Performance implications of single thread migration on a chip multi-core. ACM SIGARCH Computer Architecture News 33, 4 (2005), 80--91.

Digital Library

[35]

Pat Conway. 2007. The AMD opteron northbridge architecture. IEEE Micro 27, 2 (2007), 10--21.

Digital Library

[36]

Julita Corbalan, Xavier Martorell, and Jesus Labarta. 2003. Evaluation of the memory page migration influence in the system performance: The case of the SGI O2000. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC’03). 121--129.

Digital Library

[37]

Jonathan Corbet. 2012a. AutoNUMA: The Other Approach to NUMA Scheduling. Retrieved 2015-06-08 from http://lwn.net/Articles/488709/

[38]

Jonathan Corbet. 2012b. Toward Better NUMA Scheduling. Retrieved 2015-06-08 from http://lwn.net/Articles/486858/.

[39]

Eduardo H. M. Cruz. 2012. Dynamic Detection of the Communication Pattern in Shared Memory Environments for Thread Mapping. Master’s thesis.

[40]

Eduardo H. M. Cruz, Marco A. Z. Alves, Alexandre Carissimi, Philippe O. A. Navaux, Christiane Pousa Ribeiro, and Jean-François Méhaut. 2011. Using memory access traces to map threads and data on hierarchical multi-core platforms. In IEEE International Symposium on Parallel and Distributed Processing Workshops and PhD Forum. 551--558.

Digital Library

[41]

Eduardo H. M. Cruz, Matthias Diener, Marco A. Z. Alves, and Philippe O. A. Navaux. 2014a. Dynamic thread mapping of shared memory applications by exploiting cache coherence protocols. Journal of Parallel and Distributed Computing (JPDC) 74, 3 (Mar 2014), 2215--2228.

Digital Library

[42]

Eduardo H. M. Cruz, Matthias Diener, Marco A. Z. Alves, Laércio L. Pilla, and Philippe O. A. Navaux. 2014b. Optimizing memory locality using a locality-aware page table. In International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD’14). 198--205.

Digital Library

[43]

Eduardo H. M. Cruz, Matthias Diener, and Philippe O. A. Navaux. 2012. Using the translation lookaside buffer to map threads in parallel applications based on shared memory. In IEEE International Parallel 8 Distributed Processing Symposium (IPDPS’12). 532--543.

Digital Library

[44]

Eduardo H. M. Cruz, Matthias Diener, and Philippe O. A. Navaux. 2015a. Communication-aware thread mapping using the translation lookaside buffer. Concurrency Computation: Practice and Experience 22, 6 (2015), 685--701.

[45]

Eduardo H. M. Cruz, Matthias Diener, Laércio L. Pilla, and Philippe O. A. Navaux. 2015b. An efficient algorithm for communication-based task mapping. In International Conference on Parallel, Distributed, and Network-Based Processing (PDP’15). 207--214.

Digital Library

[46]

William J. Dally. 2010. GPU Computing to Exascale and Beyond. Technical Report.

[47]

Reetuparna Das, Rachata Ausavarungnirun, Onur Mutlu, Akhilesh Kumar, and Mani Azimi. 2013. Application-to-core mapping policies to reduce memory system interference in multi-core systems. In International Symposium on High Performance Computer Architecture (HPCA’13). 107--118.

Digital Library

[48]

Mohammad Dashti, Alexandra Fedorova, Justin Funston, Fabien Gaud, Renaud Lachaize, Baptiste Lepers, Vivien Quéma, and Mark Roth. 2013. Traffic management: A holistic approach to memory placement on NUMA systems. In Architectural Support for Programming Languages and Operating Systems (ASPLOS’13). 381--393.

Digital Library

[49]

Karen D. Devine, Erik G. Boman, Robert T. Heaphy, Rob H. Bisseling, and Umit V. Catalyurek. 2006. Parallel hypergraph partitioning for scientific computing. In IEEE International Parallel 8 Distributed Processing Symposium (IPDPS’06). 124--133.

Digital Library

[50]

Matthias Diener. 2010. Evaluating Thread Placement Improvements in Multi-core Architectures. Master’s thesis. Berlin Institute of Technology.

[51]

Matthias Diener. 2015. Automatic Task and Data Mapping in Shared Memory Architectures. Ph.D. Dissertation. Federal University of Rio Grande do Sul.

[52]

Matthias Diener, Eduardo H. M. Cruz, Marco A. Z. Alves, Mohammad S. Alhakeem, and Philippe O. A. Navaux. 2015. Locality and balance for communication-aware thread mapping in multicore systems. In Euro-Par. 196--208.

[53]

Matthias Diener, Eduardo H. M. Cruz, Marco A. Z. Alves, and Philippe O. A. Navaux. 2016. Communication in shared memory: Concepts, definitions, and efficient detection. In Euromicro International Conference on Parallel, Distributed, and Network-based Processing (PDP’16). 151--158.

[54]

Matthias Diener, Eduardo H. M. Cruz, Marco A. Z. Alves, Philippe O. A. Navaux, Anselm Busse, and Hans-Ulrich Heiss. 2015. Kernel-based thread and data mapping for improved memory affinity. IEEE Transactions on Parallel and Distributed Systems (TPDS) 26, X (2015), 1--14.

[55]

Matthias Diener, Eduardo H. M. Cruz, and Philippe O. A. Navaux. 2013. Communication-based mapping using shared pages. In IEEE International Parallel 8 Distributed Processing Symposium (IPDPS’13). 700--711.

Digital Library

[56]

Matthias Diener, Eduardo H. M. Cruz, and Philippe O. A. Navaux. 2015. Locality vs. balance: Exploring data mapping policies on NUMA systems. In International Conference on Parallel, Distributed, and Network-Based Processing (PDP’15). 9--16.

Digital Library

[57]

Matthias Diener, Eduardo H. M. Cruz, Philippe O. A. Navaux, Anselm Busse, and Hans-Ulrich Heiß. 2014. kMAF: Automatic kernel-level management of thread and data affinity. In International Conference on Parallel Architectures and Compilation Techniques (PACT’14). 277--288.

Digital Library

[58]

Matthias Diener, Eduardo H. M. Cruz, Philippe O. A. Navaux, Anselm Busse, and Hans-Ulrich Heiß. 2015a. Communication-aware process and thread mapping using online communication detection. Parallel Computing 43, March (2015), 43--63.

Digital Library

[59]

Matthias Diener, Eduardo H. M. Cruz, Laércio L. Pilla, Fabrice Dupros, and Philippe O. A. Navaux. 2015b. Characterizing communication and page usage of parallel applications for thread and data mapping. Performance Evaluation 88-89, June (2015), 18--36.

Digital Library

[60]

Matthias Diener, Felipe L. Madruga, Eduardo R. Rodrigues, Marco A. Z. Alves, and Philippe O. A. Navaux. 2010. Evaluating thread placement based on memory access patterns for multi-core processors. In IEEE International Conference on High Performance Computing and Communications (HPCC). 491--496.

Digital Library

[61]

Wei Ding, Yuanrui Zhang, Mahmut Kandemir, Jithendra Srinivas, and Praveen Yedlapalli. 2013. Locality-aware mapping and scheduling for multicores. In IEEE/ACM International Symposium on Code Generation and Optimization (CGO’13). 1--12.

Digital Library

[62]

Ulrich Drepper. 2007. What Every Programmer Should Know About Memory. Technical Report 4. Red Hat, Inc. 114 pages.

[63]

Paul J. Drongowski. 2007. Instruction-Based Sampling: A New Performance Analysis Technique for AMD Family 10h Processors. Technical Report.

[64]

Fabrice Dupros, Hideo Aochi, Ariane Ducellier, Dimitri Komatitsch, and Jean Roman. 2008. Exploiting intensive multithreading for the efficient simulation of 3D seismic wave propagation. In IEEE International Conference on Computational Science and Engineering (CSE’08). 253--260.

Digital Library

[65]

Fabrice Dupros, Christiane Pousa, Alexandre Carissimi, and Jean-François Méhaut. 2010. Parallel simulations of seismic wave propagation on NUMA architectures. In Parallel Computing: From Multicores and GPU’s to Petascale. 67--74.

[66]

Alexandre E. Eichenberger, Christian Terboven, Michael Wong, and Dieter An Mey. 2012. The design of OpenMP thread affinity. Lecture Notes in Computer Science 7312 LNCS (2012), 15--28.

Digital Library

[67]

Steven Frank, Henry Burkhardt III, and James Rothnie. 1993. The KSR1: Bridging the gap between shared memory and MPPs. In IEEE Compcon. 285--294.

[68]

S. R. Freitas, K. M. Longo, M. A. F. Silva Dias, R. Chatfield, P. Silva Dias, P. Artaxo, M. O. Andreae, G. Grell, L. F. Rodrigues, A. Fazenda, and J. Panetta. 2009. The coupled aerosol and tracer transport model to the brazilian developments on the regional atmospheric modeling system (CATT-BRAMS) part 1: Model description and evaluation. Atmospheric Chemistry and Physics 9, 8 (2009), 2843--2861.

[69]

Edgar Gabriel, Graham E. Fagg, George Bosilca, Thara Angskun, Jack J. Dongarra, Jeffrey M. Squyres, Vishal Sahay, Prabhanjan Kambadur, Brian Barrett, Andrew Lumsdaine, Ralph H. Castain, David J. Daniel, Richard L. Graham, and Timothy S. Woodall. 2004. Open MPI: Goals, concept, and design of a next generation MPI implementation. In Recent Advances in Parallel Virtual Machine and Message Passing Interface (PVMMPI’04). 97--104.

[70]

Fabien Gaud, Baptiste Lepers, Justin Funston, Mohammad Dashti, Alexandra Fedorova, Vivien Quéma, Renaud Lachaize, and Mark Roth. 2015. Challenges of memory management on modern NUMA systems. Commununications of the ACM 58, 12 (2015), 59--66.

Digital Library

[71]

Ilaria Di Gennaro, Alessandro Pellegrini, and Francesco Quaglia. 2016. OS-based NUMA optimization: Tackling the case of truly multi-thread applications with non-partitioned virtual page accesses. In IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing (CCGRID’16). 291--300.

Digital Library

[72]

Alfredo Giménez, Todd Gamblin, Barry Rountree, Abhinav Bhatele, Ilir Jusufi, Peer-Timo Bremer, and Bernd Hamann. 2014. Dissecting on-node memory access performance: A semantic approach. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC’14). 166--176.

Digital Library

[73]

Alfredo Giménez, Ilir Jusufi, Abhinav Bhatele, Todd Gamblin, Martin Schulz, Peer-Timo Bremer, and Bernd Hamann. 2015. MemAxes: Interactive Visual Analysis of Memory Access Data. Technical Report.

[74]

Roland Glantz, Henning Meyerhenke, and Alexander Noe. 2015. Algorithms for mapping parallel processes onto grid and torus architectures. In International Conference on Parallel, Distributed, and Network-Based Processing (PDP’15). 236--243.

Digital Library

[75]

Brice Goglin and Nathalie Furmento. 2009. Enabling high-performance memory migration for multithreaded applications on linux. In International Symposium on Parallel 8 Distributed Processing (IPDPS’09).

Digital Library

[76]

William Gropp. 2002. MPICH2: A new start for MPI implementations. In Recent Advances in Parallel Virtual Machine and Message Passing Interface.

Digital Library

[77]

Erik Hagersten and Michael Koster. 1999. WildFire: A scalable path for SMPs. In International Symposium on High Performance Computer Architecture (HPCA’99). 172.

Digital Library

[78]

Joshua Hursey, Jeffrey M. Squyres, and Terry Dontje. 2011. Locality-aware parallel process mapping for multi-core HPC systems. In IEEE International Conference on Cluster Computing (CLUSTER’11). 527--531.

Digital Library

[79]

Intel. 2012. Using KMP_AFFINITY to create OpenMP thread mapping to OS proc IDs. Retrieved 2015-06-08 from https://software.intel.com/en-us/articles/using-kmp-affinity-to-create-openmp-thread-mapping-to-os-proc-ids.

[80]

Intel. 2013. Intel Trace Analyzer and Collector. Retrieved from http://software.intel.com/en-us/intel-trace-analyzer.

[81]

Satoshi Ito, Kazuya Goto, and Kenji Ono. 2013. Automatically optimized core mapping to subdomains of domain decomposition method on multicore parallel environments. Computers 8 Fluids 80 (jul 2013), 88--93.

[82]

Emmanuel Jeannot and Guillaume Mercier. 2010. Near-optimal placement of MPI processes on hierarchical NUMA architectures. In Euro-Par Parallel Processing. 199--210.

Digital Library

[83]

Emmanuel Jeannot, Guillaume Mercier, and François Tessier. 2014. Process placement in multicore clusters: Algorithmic issues and practical techniques. IEEE Transactions on Parallel and Distributed Systems 25, 4 (Apr 2014), 993--1002.

Digital Library

[84]

H. Jin, M. Frumkin, and J. Yan. 1999. The OpenMP Implementation of NAS Parallel Benchmarks and Its Performance. Technical Report October. NASA.

[85]

C Karlsson, T Davies, and Zizhong Chen. 2012. Optimizing process-to-core mappings for application level multi-dimensional MPI communications. In IEEE International Conference on Cluster Computing (CLUSTER’12). 486--494.

Digital Library

[86]

George Karypis and Vipin Kumar. 1996. Parallel multilevel graph partitioning. In International Parallel Processing Symposium (IPPS’96). 314--319.

Digital Library

[87]

George Karypis and Vipin Kumar. 1998. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM Journal on Scientific Computing 20, 1 (Jan 1998), 359--392.

Digital Library

[88]

Andi Kleen. 2004. An NUMA API for Linux. Technical Report. Retrieved rom http://andikleen.de/numaapi3.pdf.

[89]

Tobias Klug, Michael Ott, Josef Weidendorfer, and Carsten Trinitis. 2008. autopin -- automated optimization of thread-to-core pinning on multicore systems. In Transactions on High-Performance Embedded Architectures and Compilers (HiPEAC). 219--235.

Digital Library

[90]

Renaud Lachaize, Baptiste Lepers, and Vivien Quéma. 2012. MemProf: A memory profiler for NUMA multicore systems. In USENIX Annual Technical Conference (ATC’12). 53--64.

Digital Library

[91]

Stefan Lankes, Boris Bierbaum, and Thomas Bemmerl. 2010. Affinity-on-next-touch: An extension to the linux kernel for NUMA architectures. Lecture Notes in Computer Science 6067 LNCS, PART 1 (2010), 576--585.

Digital Library

[92]

Richard P. LaRowe, Mark A. Holliday, and Carla Schlatter Ellis. 1992. An analysis of dynamic page placement on a NUMA multiprocessor. ACM SIGMETRICS Performance Evaluation Review 20, 1 (1992), 23--34.

Digital Library

[93]

Richard P. LaRowe Jr. 1991. Page Placement For Non-Uniform Memory Access Time (NUMA) Shared Memory Multiprocessors. Ph.D. Dissertation. Duke University.

Digital Library

[94]

Chris Lattner. 2011. LLVM and clang: Advancing compiler technology. In Free and Open Source Developers European Meeting (FOSDEM’11).

[95]

Daniel Lenoski, James Laudon, Truman Joe, David Nakahira, Luis Stevens, Anoop Gupta, and John Hennessy. 1992. The DASH prototype: Implementation and performance. In International Symposium on Computer Architecture (ISCA’92). 92--103.

Digital Library

[96]

David Levinthal. 2009. Performance Analysis Guide for Intel^® Core™ i7 Processor and Intel^® Xeon™ 5500 processors. Technical Report.

[97]

Tong Li, Dan P Baumberger, and Scott Hahn. 2009. Efficient and scalable multiprocessor fair scheduling using distributed weighted round-robin. In ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’09). 65--74.

Digital Library

[98]

Xu Liu and John Mellor-Crummey. 2014. A tool to analyze the performance of multithreaded programs on NUMA architectures. In ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’14). 259--272.

Digital Library

[99]

Henrik Löf and Sverker Holmgren. 2005. Affinity-on-next-touch: Increasing the performance of an industrial pde solver on a cc-NUMA System. In International Conference on Supercomputing (ICS’05). 387--392.

Digital Library

[100]

Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. 2005. Pin: Building customized program analysis tools with dynamic instrumentation. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’05). 190--200.

Digital Library

[101]

Peter S. Magnusson, Magnus Christensson, Jesper Eskilson, Daniel Forsgren, Gustav Hallberg, Johan Hogberg, Fredrik Larsson, Andreas Moestedt, and Bengt Werner. 2002. Simics: A full system simulation platform. IEEE Computer 35, 2 (2002), 50--58.

Digital Library

[102]

Zoltan Majo and Thomas R. Gross. 2012. Matching memory access patterns and data placement for NUMA systems. In International Symposium on Code Generation and Optimization (CGO’12). 230--241.

Digital Library

[103]

Zoltan Majo and Thomas R. Gross. 2015. A library for portable and composable data locality optimizations for NUMA systems. In ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’15). 227--238.

Digital Library

[104]

Jaydeep Marathe and Frank Mueller. 2006. Hardware profile-guided automatic page placement for ccNUMA systems. In ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’06). 90--99.

Digital Library

[105]

Jaydeep Marathe, Vivek Thakkar, and Frank Mueller. 2010. Feedback-directed page placement for ccNUMA via hardware-generated memory traces. Journal of Parallel and Distributed Computing (JPDC’10) 70, 12 (2010), 1204--1219.

Digital Library

[106]

Michael Marchetti, Leonidas Kontothanassis, Ricardo Bianchini, and Michael L. Scott. 1995. Using simple page placement policies to reduce the cost of cache fills in coherent shared-memory systems. In International Parallel Processing Symposium (IPPS’95). 480--485.

Digital Library

[107]

Artur Mariano, Matthias Diener, Christian Bischof, and Philippe O. A. Navaux. 2016. Analyzing and improving memory access patterns of large irregular applications on NUMA machines. In Euromicro International Conference on Parallel, Distributed, and Network-based Processing (PDP’16). 382--387.

[108]

Artur Mariano, Thijs Laarhoven, and Christian Bischof. 2015. Parallel (probable) lock-free hashsieve: A practical sieving algorithm for the SVP. In International Conference on Parallel Processing (ICPP). 590--599.

Digital Library

[109]

Collin McCurdy and Jeffrey Vetter. 2010. Memphis: Finding and fixing NUMA-related performance problems on multi-core platforms. In IEEE International Symposium on Performance Analysis of Systems 8 Software (ISPASS’10). 87--96.

[110]

Guillaume Mercier and Jérôme Clet-Ortega. 2009. Towards an efficient process placement policy for MPI applications in multicore environments. In Recent Advances in Parallel Virtual Machine and Message Passing Interface. 104--115.

Digital Library

[111]

Guillaume Mercier and Emmanuel Jeannot. 2011. Improving MPI applications performance on multicore clusters with rank reordering. In European MPI Users’ Group Conference on Recent Advances in the Message Passing Interface (EuroMPI’11). 39--49.

Digital Library

[112]

Dimitrios S. Nikolopoulos, Theodore S. Papatheodorou, Constantine D. Polychronopoulos, Jesús Labarta, and Eduard Ayguadé. 2000a. User-level dynamic page migration for multiprogrammed shared-memory multiprocessors. In International Conference on Parallel Processing (ICPP’00). 95--103.

Digital Library

[113]

Dimitrios S. Nikolopoulos, Theodore S. Papatheodorou, Constantine D. Polychronopoulos, Jesús Labarta, and Eduard Ayguadé. 2000b. UPMLIB: A runtime system for tuning the memory performance of OpenMP programs on scalable shared-memory multiprocessors. In Languages, Compilers, and Run-Time Systems for Scalable Computers (LCR’00). 85--99.

Digital Library

[114]

Dimitrios S. Nikolopoulos, Theodore S. Papatheodorou, Constantine D. Polychronopoulos, Jesús Labarta, and Eduard Ayguadé. 2000c. Is data distribution necessary in OpenMP? In ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC’00).

Digital Library

[115]

Lisa Noordergraaf and Ruud van der Pas. 1999. Performance experiences on sun’s wildfire prototype. In ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC’99). 1--16.

Digital Library

[116]

Takeshi Ogasawara. 2009. NUMA-aware memory manager with dominant-thread-based copying GC. In ACM SIGPLAN Conference on Object Oriented Programming Systems Languages and Applications (OOPSLA’09). 377--390.

Digital Library

[117]

OpenMP Architecture Review Board. 2013. OpenMP Application Program Interface, Version 4.0. (2013).

[118]

Oracle. 2010. Solaris OS Tuning Features. Retrieved 2015-06-16 from http://docs.oracle.com/cd/E18659_01/html/821-1381/aewda.html

[119]

Michael Ott, Tobias Klug, Josef Weidendorfer, and Carsten Trinitis. 2008. autopin -- automated optimization of thread-to-core pinning on multicore systems. In Workshop on Programmability Issues for Multi-Core Computers (MULTIPROG’08).

[120]

François Pellegrini. 1994. Static mapping by dual recursive bipartitioning of process and architecture graphs. In Scalable High-Performance Computing Conference (SHPCC’94). 486--493.

[121]

François Pellegrini. 2010. Scotch and Libscotch 5.1 User’s Guide. Technical Report.

[122]

Gregory F. Pfister, William C. Brantley, David A. George, Steve L. Harvey, Wally J. Kleinfelder, Kevin P. McAuliffe, Evelin S. Melton, V. Alan Norton, and Jodi Weiss. 1985. The IBM research parallel processor prototype (RP3): Introduction and architecture. In International Conference on Parallel Processing (ICPP’85). 764--771.

[123]

Guilherme Piccoli, Henrique N. Santos, Raphael E. Rodrigues, Christiane Pousa, Edson Borin, and Fernando M. Quintão Pereira. 2014. Compiler support for selective page migration in NUMA architectures. In International Conference on Parallel Architectures and Compilation Techniques (PACT’14). 369--380.

Digital Library

[124]

Petar Radojković, Vladimir Čakarević, Miquel Moretó, Javier Verdú, Alex Pajuelo, Francisco J. Cazorla, Mario Nemirovsky, and Mateo Valero. 2012. Optimal task assignment in multithreaded processors: A statistical approach. SIGARCH Computer Architecture News 40, 1 (2012), 235--248.

Digital Library

[125]

Petar Radojković, Vladimir Cakarević, Javier Verdú, Alex Pajuelo, Francisco J. Cazorla, Mario Nemirovsky, and Mateo Valero. 2013. Thread assignment of multithreaded network applications in multicore/multithreaded processors. IEEE Transactions on Parallel and Distributed Systems (TPDS) 24, 12 (2013), 2513--2525.

Digital Library

[126]

Christiane Pousa Ribeiro. 2011. Contributions on Memory Affinity Management for Hierarchical Shared Memory Multi-core Platforms. Ph.D. Dissertation. Univeristy of Grenoble.

[127]

Christiane Pousa Ribeiro, Marcio Castro, Jean-François Méhaut, and Alexandre Carissimi. 2010. Improving memory affinity of geophysics applications on NUMA platforms using Minas. In International Conference on High Performance Computing for Computational Science (VECPAR’10). 279--292.

[128]

Christiane Pousa Ribeiro, Jean-François Méhaut, Alexandre Carissimi, Marcio Castro, and Luiz Gustavo Fernandes. 2009. Memory affinity for hierarchical shared memory multiprocessors. In International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD’09). 59--66.

Digital Library

[129]

Eduardo R. Rodrigues, Felipe L. Madruga, Philippe O. A. Navaux, and Jairo Panetta. 2009. Multi-core aware process mapping and its impact on communication overhead of parallel applications. In IEEE Symposium on Computers and Communications (ISCC’09). 811--817.

[130]

John Shalf, Sudip Dosanjh, and John Morrison. 2010. Exascale computing technology challenges. In High Performance Computing for Computational Science (VECPAR’10). 1--25.

Digital Library

[131]

Jaswinder Pal Singh, Truman Joe, Anoop Gupta, and John L. Hennessy. 1993. An empirical comparison of the Kendall square research KSR-1 and stanford DASH multiprocessors. In ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC’93). 214--225.

Digital Library

[132]

Mohsen Soryani, Morteza Analoui, and Ghobad Zarrinchian. 2013. Improving inter-node communications in multi-core clusters using a contention-free process mapping algorithm. Journal of Supercomputing 66, 1 (apr 2013), 488--513.

Digital Library

[133]

Brad Spengler. 2003. PaX: The Guaranteed End of Arbitrary Code Execution. Technical Report.

[134]

David Tam, Reza Azimi, and Michael Stumm. 2007. Thread clustering: Sharing-aware scheduling on SMP-CMP-SMT multiprocessors. In ACM SIGOPS/EuroSys European Conference on Computer Systems. 47--58.

Digital Library

[135]

Christian Terboven, Dieter an Mey, Dirk Schmidl, Henry Jin, and Thomas Reichstein. 2008. Data and thread affinity in OpenMP programs. In Workshop on Memory Access on Future Processors: A Solved Problem? (MAW’08). 377--384.

Digital Library

[136]

The Open MPI project. 2009. Portable Linux Processor Affinity (PLPA). Retrieved 2015-06-08 from https://www.open-mpi.org/projects/plpa/.

[137]

The Open MPI Project. 2013. mpirun(1) man page (version 1.6.4). Retrieved 2016-02-08 from http://www.open-mpi.org/doc/v1.6/man1/mpirun.1.php#sect9.

[138]

Samuel Thibault, Raymond Namyst, and Pierre-André Wacrenier. 2007. Building portable thread schedulers for hierarchical multiprocessors: The bubblesched framework. In Euro-Par. 42--51.

Digital Library

[139]

Mustafa M. Tikir and Jeffrey K. Hollingsworth. 2004. Using hardware counters to automatically improve memory performance. In ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC’04).

Digital Library

[140]

Mustafa M. Tikir and Jeffrey K. Hollingsworth. 2008. Hardware monitors for dynamic page migration. Journal of Parallel and Distributed Computing (JPDC) 68, 9 (Sep 2008), 1186--1200.

Digital Library

[141]

Jesper Larsson Träff. 2002. Implementing the MPI process topology mechanism. In ACM/IEEE Conference on Supercomputing (SC’02).

Digital Library

[142]

François Trahay, François Rue, Mathieu Faverge, Yutaka Ishikawa, Raymond Namyst, and Jack Dongarra. 2011. EZTrace: A generic framework for performance analysis. In International Symposium on Cluster, Cloud and Grid Computing (CCGrid’11). 618--619.

Digital Library

[143]

Ruud van der Pas. 2009. Getting OpenMP Up To Speed. Technical Report.

[144]

Goran Velkoski, Sasko Ristov, and Marjan Gusev. 2013. Loosely or tightly coupled affinity for matrix-vector multiplication. In International Convention on Information 8 Communication Technology Electronics 8 Microelectronics (MIPRO’13). 228--233.

[145]

Ben Verghese, Scott Devine, Anoop Gupta, and Mendel Rosenblum. 1996b. Operating system support for improving data locality on CC-NUMA compute servers. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’96). 279--289.

Digital Library

[146]

Ben Verghese, Scott Devine, Anoop Gupta, and Mendel Rosenblum. 1996a. OS Support for Improving Data Locality on CC-NUMA Compute Servers. Technical Report.

[147]

Carlos Villavieja, Vasileios Karakostas, Lluis Vilanova, Yoav Etsion, Alex Ramirez, Avi Mendelson, Nacho Navarro, Adrian Cristal, and Osman S. Unsal. 2011. DiDi: Mitigating the performance impact of TLB shootdowns using a shared TLB directory. In International Conference on Parallel Architectures and Compilation Techniques (PACT’11). 340--349.

Digital Library

[148]

Skef Wholey. 1991. Automatic data mapping for distributed-memory parallel computers. In International Conference on Supercomputing (ICS’91). 25--34.

Digital Library

[149]

Chee Siang Wong, Ian Tan, Rosalind Deena Kumari, and Fun Wey. 2008. Towards achieving fairness in the linux scheduler. ACM SIGOPS Operating Systems Review 42, 5 (ul J2008), 34--43.

Digital Library

[150]

Yang You, Jammes Demmel, Kenneth Czechowski, Le Song, and Richard Vuduc. 2015. CA-SVM: Communication-avoiding support vector machines on distributed systems. In IEEE International Parallel and Distributed Processing Symposium (IPDPS’15). 847--859.

Digital Library

[151]

Jidong Zhai, Tianwei Sheng, and Jiangzhou He. 2011. Efficiently acquiring communication traces for large-scale parallel applications. IEEE Transactions on Parallel and Distributed Systems (TPDS) 22, 11 (2011), 1862--1870.

Digital Library

[152]

Xing Zhou, Wenguang Chen, and Weimin Zheng. 2009. Cache sharing management for performance fairness in chip multiprocessors. In International Conference on Parallel Architectures and Compilation Techniques (PACT’09). 384--393.

Digital Library

[153]

Sergey Zhuravlev, Juan Carlos Saez, Sergey Blagodurov, Alexandra Fedorova, and Manuel Prieto. 2012. Survey of scheduling techniques for addressing shared resources in multicore processors. ACM Computing Surveys (CSUR) 45, 1 (2012), 1--32.

Digital Library

[154]

Dimitrios Ziakas, Allen Baum, Robert A. Maddox, and Robert J. Safranek. 2010. Intel quickpath interconnect - architectural features supporting scalable system architectures. In Symposium on High Performance Interconnects (HOTI’10). 1--6.

Digital Library

Cited By

Fang XDong PLuo JLi LDing YJiang Z(2024)Optimization of NUMA Aware DNN Computing SystemAdvanced Intelligent Computing Technology and Applications10.1007/978-981-97-5591-2_11(124-136)Online publication date: 14-Aug-2024
https://doi.org/10.1007/978-981-97-5591-2_11
Scravaglieri LPopov MLima Pilla LGuermouche AAumage OSaillard E(2023)Optimizing performance and energy across problem sizes through a search space exploration and machine learningJournal of Parallel and Distributed Computing10.1016/j.jpdc.2023.104720180(104720)Online publication date: Oct-2023
https://doi.org/10.1016/j.jpdc.2023.104720
Xiang JTong LZhou S(2022)Design of AI System for National Fitness Sports Competition Action Based on Association Rules AlgorithmComputational Intelligence and Neuroscience10.1155/2022/13750092022Online publication date: 1-Jan-2022
https://dl.acm.org/doi/10.1155/2022/1375009
Show More Cited By

Index Terms

Affinity-Based Thread and Data Mapping in Shared Memory Systems
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Multicore architectures
2. Software and its engineering
  1. Software organization and properties
    1. Contextual software domains
      1. Operating systems
        Memory management
        Main memory
        Process management
        Scheduling

Recommendations

Online Thread and Data Mapping Using a Sharing-Aware Memory Management Unit

Current and future architectures rely on thread-level parallelism to sustain performance growth. These architectures have introduced a complex memory hierarchy, consisting of several cores organized hierarchically with multiple cache levels and NUMA ...
Hardware-Assisted Thread and Data Mapping in Hierarchical Multicore Architectures

The performance and energy efficiency of modern architectures depend on memory locality, which can be improved by thread and data mappings considering the memory access behavior of parallel applications. In this article, we propose intense pages mapping,...
LAPT

We detect the memory access patterns in shared memory applications.Using the detected access patterns, we map the threads and data to improve performance.Provide a better usage of hardware resources.We reduce execution time, cache misses and traffic on ...

Comments

Information & Contributors

Information

Published In

cover image ACM Computing Surveys

ACM Computing Surveys Volume 49, Issue 4

December 2017

666 pages

ISSN:0360-0300

EISSN:1557-7341

DOI:10.1145/3022634

Editor:
Sartaj Sahni
Department of Computer and Information Science and Engineering / University of Florida / Gainesville, FL 32611

Issue’s Table of Contents

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 December 2016

Accepted: 01 October 2016

Revised: 01 September 2016

Received: 01 June 2016

Published in CSUR Volume 49, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Survey
Research
Refereed

Funding Sources

MCTI/RNP Brazil under the HPC4E project
CAPES

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

25
Total Citations
View Citations
1,042
Total Downloads

Downloads (Last 12 months)78
Downloads (Last 6 weeks)11

Reflects downloads up to 16 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Fang XDong PLuo JLi LDing YJiang Z(2024)Optimization of NUMA Aware DNN Computing SystemAdvanced Intelligent Computing Technology and Applications10.1007/978-981-97-5591-2_11(124-136)Online publication date: 14-Aug-2024
https://doi.org/10.1007/978-981-97-5591-2_11
Scravaglieri LPopov MLima Pilla LGuermouche AAumage OSaillard E(2023)Optimizing performance and energy across problem sizes through a search space exploration and machine learningJournal of Parallel and Distributed Computing10.1016/j.jpdc.2023.104720180(104720)Online publication date: Oct-2023
https://doi.org/10.1016/j.jpdc.2023.104720
Xiang JTong LZhou S(2022)Design of AI System for National Fitness Sports Competition Action Based on Association Rules AlgorithmComputational Intelligence and Neuroscience10.1155/2022/13750092022Online publication date: 1-Jan-2022
https://dl.acm.org/doi/10.1155/2022/1375009
TehraniJamsaz APopov MDutta ASaillard EJannesari A(2022)Learning Intermediate Representations using Graph Neural Networks for NUMA and Prefetchers Optimization2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS53621.2022.00120(1206-1216)Online publication date: May-2022
https://doi.org/10.1109/IPDPS53621.2022.00120
Kirkpatrick RBrown CJanjic V(2022)COMPROF and COMPLACE: Shared-Memory Communication Profiling and Automated Thread Placement via Dynamic Binary Instrumentation2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC)10.1109/HiPC56025.2022.00040(236-245)Online publication date: Dec-2022
https://doi.org/10.1109/HiPC56025.2022.00040
Nayak SParida STripathy CPattnaik P(2022)An enhanced deadline constraint based task scheduling mechanism for cloud environmentJournal of King Saud University - Computer and Information Sciences10.1016/j.jksuci.2018.10.00934:2(282-294)Online publication date: 1-Feb-2022
https://dl.acm.org/doi/10.1016/j.jksuci.2018.10.009
Pasqualin DDiener MDu Bois APilla M(2022)Sharing-Aware Data Mapping in Software Transactional MemoryEmbedded Computer Systems: Architectures, Modeling, and Simulation10.1007/978-3-031-04580-6_32(481-492)Online publication date: 27-Apr-2022
https://doi.org/10.1007/978-3-031-04580-6_32
Rocha HSchwarzrock JLorenzon ABeck A(2021)Boosting Graph Analytics by Tuning Threads and Data Affinity on NUMA Systems2021 29th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)10.1109/PDP52278.2021.00033(161-168)Online publication date: Mar-2021
https://doi.org/10.1109/PDP52278.2021.00033
Li JZhang YZhang X(2021)CMLB: a Communication-aware and Memory Load Balance Mapping Optimization for Modern NUMA Systems2021 IEEE 23rd Int Conf on High Performance Computing & Communications; 7th Int Conf on Data Science & Systems; 19th Int Conf on Smart City; 7th Int Conf on Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCity/DependSys)10.1109/HPCC-DSS-SmartCity-DependSys53884.2021.00099(579-586)Online publication date: Dec-2021
https://doi.org/10.1109/HPCC-DSS-SmartCity-DependSys53884.2021.00099
Schwarzrock JJordan MKorol GOliveira CLorenzon ABeck Rutzig MS. Beck A(2021)Dynamic concurrency throttling on NUMA systems and data migration impactsDesign Automation for Embedded Systems10.1007/s10617-020-09243-525:2(135-160)Online publication date: 1-Jun-2021
https://dl.acm.org/doi/10.1007/s10617-020-09243-5
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents