ABSTRACT
Graphics Processing Units (GPUs) have become ideal candidates for the development of fine-grain parallel algorithms as the number of processing elements per GPU increases. In addition to the increase in cores per system, new memory hierarchies and increased bandwidth have been developed that allow for significant performance improvement when computation is performed using certain types of memory access patterns.
Merging two sorted arrays is a useful primitive and is a basic building block for numerous applications such as joining database queries, merging adjacency lists in graphs, and set intersection. An efficient parallel merging algorithm partitions the sorted input arrays into sets of non-overlapping sub-arrays that can be independently merged on multiple cores. For optimal performance, the partitioning should be done in parallel and should divide the input arrays such that each core receives an equal size of data to merge.
In this paper, we present an algorithm that partitions the workload equally amongst the GPU Streaming Multi-processors (SM). Following this, we show how each SM performs a parallel merge and how to divide the work so that all the GPU's Streaming Processors (SP) are utilized. All stages in this algorithm are parallel. The new algorithm demonstrates good utilization of the GPU memory hierarchy. This approach demonstrates an average of 20X and 50X speedup over a sequential merge on the x86 platform for integer and floating point, respectively. Our implementation is 10X faster than the fast parallel merge supplied in the CUDA Thrust library.
- S. Chen, J. Qin, Y. Xie, J. Zhao, and P. Heng. An efficient sorting algorithm with cuda. Journal of the Chinese Institute of Engineers, 32(7):915--921, 2009.Google ScholarCross Ref
- T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms. The MIT Press, New York, 2001. Google ScholarDigital Library
- N. Deo, A. Jain, and M. Medidi. An optimal parallel algorithm for merging using multiselection. Information Processing Letters, 1994. Google ScholarDigital Library
- N. K. Govindaraju, N. Raghuvanshi, M. Henson, D. Tuft, and D. Manocha. A cache-efficient sorting algorithm for database and data mining computations using graphics processors. Technical report, 2005.Google Scholar
- J. Hoberock and N. Bell. Thrust: A parallel template library, 2010. Version 1.3.0.Google Scholar
- NVIDIA Corporation. Nvidia cuda programming guide. 2011.Google Scholar
- S. Odeh, O. Green, Z. Mwassi, O. Shmueli, and Y. Birk. Merge path - cache-efficient parallel merge and sort. Technical report, CCIT Report No. 802, EE Pub. No. 1759, Electrical Engr. Dept., Technion, Israel, Jan. 2012.Google Scholar
- S. Odeh, O. Green, Z. Mwassi, O. Shmueli, and Y. Birk. Merge path - parallel merging made simple. In Parallel and Distributed Processing Symposium, International, May 2012. Google ScholarDigital Library
- N. Satish, M. Harris, and M. Garland. Designing efficient sorting algorithms for manycore gpus. Parallel and Distributed Processing Symposium, International, 0:1--10, 2009. Google ScholarDigital Library
- Y. Shiloach and U. Vishkin. Finding the maximum, merging, and sorting in a parallel computation model. Journal of Algorithms, 2:88--102, 1981.Google ScholarCross Ref
- E. Sintorn and U. Assarsson. Fast parallel gpu-sorting using a hybrid algorithm. Journal of Parallel and Distributed Computing, 68(10):1381--1388, 2008. General-Purpose Processing using Graphics Processing Units. Google ScholarDigital Library
Index Terms
- GPU merge path: a GPU merging algorithm
Recommendations
Parallel Sparse Approximate Inverse Preconditioning on Graphic Processing Units
Accelerating numerical algorithms for solving sparse linear systems on parallel architectures has attracted the attention of many researchers due to their applicability to many engineering and scientific problems. The solution of sparse systems often ...
A performance study of general-purpose applications on graphics processors using CUDA
Graphics processors (GPUs) provide a vast number of simple, data-parallel, deeply multithreaded cores and high memory bandwidths. GPU architectures are becoming increasingly programmable, offering the potential for dramatic speedups for a variety of ...
Relational query coprocessing on graphics processors
Graphics processors (GPUs) have recently emerged as powerful coprocessors for general purpose computation. Compared with commodity CPUs, GPUs have an order of magnitude higher computation power as well as memory bandwidth. Moreover, new-generation GPUs ...
Comments