Loading [a11y]/accessibility-menu.js
GPU-Accelerated Large-Scale Distributed Sorting Coping with Device Memory Capacity | IEEE Journals & Magazine | IEEE Xplore

GPU-Accelerated Large-Scale Distributed Sorting Coping with Device Memory Capacity


Abstract:

Splitter-based parallel sorting algorithms are known to be highly efficient for distributed sorting due to their low communication complexity. Although using GPU accelera...Show More

Abstract:

Splitter-based parallel sorting algorithms are known to be highly efficient for distributed sorting due to their low communication complexity. Although using GPU accelerators could help to reduce the computation cost in general, their effectiveness in distributed sorting algorithms remains unclear. We investigate applicability of using GPU devices to the splitter-based algorithms and extend HykSort, an existing splitter-based algorithm by offloading costly computation phases to GPUs. To cope with the volumes of data exceeding the GPU memory capacity, out-of-core local sort is used with small overhead about 7.5 percent when the data size is tripled. We evaluate the performance of our implementation on the TSUBAME2.5 supercomputer that comprises over 4,000 NVIDIA K20x GPUs. Weak scaling analysis shows 389 times speedup with 0.25 TB/s throughput when sorting 4 TB of 64 bit integer values on 1,024 nodes compared to running on one node; this is 1.40 times faster than the reference CPU implementation. Detailed analysis however reveals that the performance is mostly bottlenecked by the CPU-GPU host-to-device bandwidth. With orders of magnitude improvements announced for next generation GPUs, the performance boost will be tremendous in accordance with other successful GPU accelerations.
Published in: IEEE Transactions on Big Data ( Volume: 2, Issue: 1, 01 March 2016)
Page(s): 57 - 69
Date of Publication: 05 January 2016

ISSN Information:

Funding Agency:


Contact IEEE to Subscribe

References

References is not available for this document.