Scalable SIMD-parallel memory allocation for many-core machines

Huang, Xiaohuang; Rodrigues, Christopher I.; Jones, Stephen; Buck, Ian; Hwu, Wen-mei

doi:10.1007/s11227-011-0680-7

Scalable SIMD-parallel memory allocation for many-core machines

Published: 23 September 2011

Volume 64, pages 1008–1020, (2013)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Xiaohuang Huang¹,
Christopher I. Rodrigues¹,
Stephen Jones²,
Ian Buck² &
…
Wen-mei Hwu¹

301 Accesses
8 Citations
6 Altmetric
Explore all metrics

Abstract

Dynamic memory allocation is an important feature of modern programming systems. However, the cost of memory allocation in massively parallel execution environments such as CUDA has been too high for many types of kernels. This paper presents XMalloc, a high-throughput memory allocation mechanism that dramatically magnifies the allocation throughput of an underlying memory allocator. XMalloc embodies two key techniques: allocation coalescing and buffering using efficient queues. This paper describes these two techniques and presents our implementation of XMalloc as a memory allocator library. The library is designed to be called from kernels executed by massive numbers of threads. Our experimental results based on the NVIDIA G480 GPU show that XMalloc magnifies the allocation throughput of the underlying memory allocator by a factor of 48.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Berger E, McKinley K, Blumofe R, Wilson P (2000) Hoard: a scalable memory allocator for multithreaded applications. In: Proceedings of the 9th international conference on architectural support for programming languages and operating systems, pp 117–128
Google Scholar
Bigler B, Allan S, Oldehoeft R (1985) Parallel dynamic storage allocation. In: Proceedings of the international conference on parallel processing, pp 272–275
Google Scholar
NVIDIA Corporation (2010) NVIDIA CUDA C programming guide
Google Scholar
Dechev D, Pirkelbauer P, Stroustrup B (2010) Understanding and effectively preventing the ABA problem in descriptor-based lock-free designs. In: Proceedings of the 13th IEEE international symposium on object/component/service-oriented real-time distributed computing, pp 185–192
Chapter Google Scholar
Dice D, Garthwaite A (2002) Mostly lock-free malloc. In: Proceedings of the 3rd international symposium on memory management. ACM, New York, pp 163–174
Chapter Google Scholar
Herlihy M (1991) Wait-free synchronization. ACM Trans Program Lang Syst 13(1):124–149
Article Google Scholar
Huang X, Rodrigues C, Jones S, Buck I, Hwu W-M (2010) XMalloc: A scalable lock-free dynamic memory allocator for many-core machines. In: Proceedings of the 10th IEEE international conference on computer and information technology, pp 1134–1139
Google Scholar
Iyengar A (1993) Parallel dynamic storage allocation algorithms. In: Proceedings of the 5th IEEE symposium on parallel and distributed processing, pp 82–91
Google Scholar
Johnson T, Davis T (1992) Space efficient parallel buddy memory management. In: Proceedings of the 1992 international conference on computing and information, pp 128–132
Google Scholar
Mellor-Crummey J, Scott M (1991) Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Trans Comput Syst 9(1):21–65
Article Google Scholar
Michael M (2004) Scalable lock-free dynamic memory allocation. In: Proceedings of the ACM SIGPLAN 2004 conference on programming language design and implementation
Google Scholar
Tsigas P, Zhang Y (2001) A simple, fast and scalable non-blocking concurrent FIFO queue for shared memory multiprocessor systems. In: Proceedings of the 13th Annual ACM symposium on parallel algorithms and architectures. ACM, New York, pp 134–143
Google Scholar

Download references

Author information

Authors and Affiliations

University of Illinois at Urbana-Champaign, Urbana, IL, 61801, USA
Xiaohuang Huang, Christopher I. Rodrigues & Wen-mei Hwu
NVIDIA Corporation, 2701 San Tomas Expressway, Santa Clara, CA, 95050, USA
Stephen Jones & Ian Buck

Authors

Xiaohuang Huang
View author publications
You can also search for this author in PubMed Google Scholar
Christopher I. Rodrigues
View author publications
You can also search for this author in PubMed Google Scholar
Stephen Jones
View author publications
You can also search for this author in PubMed Google Scholar
Ian Buck
View author publications
You can also search for this author in PubMed Google Scholar
Wen-mei Hwu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Christopher I. Rodrigues.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Huang, X., Rodrigues, C.I., Jones, S. et al. Scalable SIMD-parallel memory allocation for many-core machines. J Supercomput 64, 1008–1020 (2013). https://doi.org/10.1007/s11227-011-0680-7

Download citation

Published: 23 September 2011
Issue Date: June 2013
DOI: https://doi.org/10.1007/s11227-011-0680-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Scalable SIMD-parallel memory allocation for many-core machines

Abstract

Access this article

Similar content being viewed by others

DISBench: Benchmark for Memory Performance Evaluation of Multicore Multiprocessors

Towards optimal scheduling policy for heterogeneous memory architecture in many-core system

Balancing Shared and Distributed Heaps on NUMA Architectures

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Scalable SIMD-parallel memory allocation for many-core machines

Abstract

Access this article

Similar content being viewed by others

DISBench: Benchmark for Memory Performance Evaluation of Multicore Multiprocessors

Towards optimal scheduling policy for heterogeneous memory architecture in many-core system

Balancing Shared and Distributed Heaps on NUMA Architectures

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation