Skip to main content
Log in

Scalable SIMD-parallel memory allocation for many-core machines

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Dynamic memory allocation is an important feature of modern programming systems. However, the cost of memory allocation in massively parallel execution environments such as CUDA has been too high for many types of kernels. This paper presents XMalloc, a high-throughput memory allocation mechanism that dramatically magnifies the allocation throughput of an underlying memory allocator. XMalloc embodies two key techniques: allocation coalescing and buffering using efficient queues. This paper describes these two techniques and presents our implementation of XMalloc as a memory allocator library. The library is designed to be called from kernels executed by massive numbers of threads. Our experimental results based on the NVIDIA G480 GPU show that XMalloc magnifies the allocation throughput of the underlying memory allocator by a factor of 48.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Berger E, McKinley K, Blumofe R, Wilson P (2000) Hoard: a scalable memory allocator for multithreaded applications. In: Proceedings of the 9th international conference on architectural support for programming languages and operating systems, pp 117–128

    Google Scholar 

  2. Bigler B, Allan S, Oldehoeft R (1985) Parallel dynamic storage allocation. In: Proceedings of the international conference on parallel processing, pp 272–275

    Google Scholar 

  3. NVIDIA Corporation (2010) NVIDIA CUDA C programming guide

    Google Scholar 

  4. Dechev D, Pirkelbauer P, Stroustrup B (2010) Understanding and effectively preventing the ABA problem in descriptor-based lock-free designs. In: Proceedings of the 13th IEEE international symposium on object/component/service-oriented real-time distributed computing, pp 185–192

    Chapter  Google Scholar 

  5. Dice D, Garthwaite A (2002) Mostly lock-free malloc. In: Proceedings of the 3rd international symposium on memory management. ACM, New York, pp 163–174

    Chapter  Google Scholar 

  6. Herlihy M (1991) Wait-free synchronization. ACM Trans Program Lang Syst 13(1):124–149

    Article  Google Scholar 

  7. Huang X, Rodrigues C, Jones S, Buck I, Hwu W-M (2010) XMalloc: A scalable lock-free dynamic memory allocator for many-core machines. In: Proceedings of the 10th IEEE international conference on computer and information technology, pp 1134–1139

    Google Scholar 

  8. Iyengar A (1993) Parallel dynamic storage allocation algorithms. In: Proceedings of the 5th IEEE symposium on parallel and distributed processing, pp 82–91

    Google Scholar 

  9. Johnson T, Davis T (1992) Space efficient parallel buddy memory management. In: Proceedings of the 1992 international conference on computing and information, pp 128–132

    Google Scholar 

  10. Mellor-Crummey J, Scott M (1991) Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Trans Comput Syst 9(1):21–65

    Article  Google Scholar 

  11. Michael M (2004) Scalable lock-free dynamic memory allocation. In: Proceedings of the ACM SIGPLAN 2004 conference on programming language design and implementation

    Google Scholar 

  12. Tsigas P, Zhang Y (2001) A simple, fast and scalable non-blocking concurrent FIFO queue for shared memory multiprocessor systems. In: Proceedings of the 13th Annual ACM symposium on parallel algorithms and architectures. ACM, New York, pp 134–143

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Christopher I. Rodrigues.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Huang, X., Rodrigues, C.I., Jones, S. et al. Scalable SIMD-parallel memory allocation for many-core machines. J Supercomput 64, 1008–1020 (2013). https://doi.org/10.1007/s11227-011-0680-7

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-011-0680-7

Keywords

Navigation