Abstract
Dynamic memory allocation is an important feature of modern programming systems. However, the cost of memory allocation in massively parallel execution environments such as CUDA has been too high for many types of kernels. This paper presents XMalloc, a high-throughput memory allocation mechanism that dramatically magnifies the allocation throughput of an underlying memory allocator. XMalloc embodies two key techniques: allocation coalescing and buffering using efficient queues. This paper describes these two techniques and presents our implementation of XMalloc as a memory allocator library. The library is designed to be called from kernels executed by massive numbers of threads. Our experimental results based on the NVIDIA G480 GPU show that XMalloc magnifies the allocation throughput of the underlying memory allocator by a factor of 48.
Similar content being viewed by others
References
Berger E, McKinley K, Blumofe R, Wilson P (2000) Hoard: a scalable memory allocator for multithreaded applications. In: Proceedings of the 9th international conference on architectural support for programming languages and operating systems, pp 117–128
Bigler B, Allan S, Oldehoeft R (1985) Parallel dynamic storage allocation. In: Proceedings of the international conference on parallel processing, pp 272–275
NVIDIA Corporation (2010) NVIDIA CUDA C programming guide
Dechev D, Pirkelbauer P, Stroustrup B (2010) Understanding and effectively preventing the ABA problem in descriptor-based lock-free designs. In: Proceedings of the 13th IEEE international symposium on object/component/service-oriented real-time distributed computing, pp 185–192
Dice D, Garthwaite A (2002) Mostly lock-free malloc. In: Proceedings of the 3rd international symposium on memory management. ACM, New York, pp 163–174
Herlihy M (1991) Wait-free synchronization. ACM Trans Program Lang Syst 13(1):124–149
Huang X, Rodrigues C, Jones S, Buck I, Hwu W-M (2010) XMalloc: A scalable lock-free dynamic memory allocator for many-core machines. In: Proceedings of the 10th IEEE international conference on computer and information technology, pp 1134–1139
Iyengar A (1993) Parallel dynamic storage allocation algorithms. In: Proceedings of the 5th IEEE symposium on parallel and distributed processing, pp 82–91
Johnson T, Davis T (1992) Space efficient parallel buddy memory management. In: Proceedings of the 1992 international conference on computing and information, pp 128–132
Mellor-Crummey J, Scott M (1991) Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Trans Comput Syst 9(1):21–65
Michael M (2004) Scalable lock-free dynamic memory allocation. In: Proceedings of the ACM SIGPLAN 2004 conference on programming language design and implementation
Tsigas P, Zhang Y (2001) A simple, fast and scalable non-blocking concurrent FIFO queue for shared memory multiprocessor systems. In: Proceedings of the 13th Annual ACM symposium on parallel algorithms and architectures. ACM, New York, pp 134–143