ABSTRACT
Concurrent data structures play a critical role in the overall performance of GPGPU applications. Stack is one of the basic data structures and finds numerous applications where data is processed in a Last In First Out (LIFO) fashion. Although concurrent stack is well researched for multi-core CPUs, there is little research pointing to the conversion of CPU stacks into a GPU-friendly form. In this paper, we propose a concurrent search-based GPU stack named Scan Stack. The proposed stack is designed to take advantage of GPU memory access patterns, memory coalescence, and thread structures (i.e., warps) to increase throughput. Our experiments on an NVIDIA RTX 3090 show that our proposed scan stack significantly improves the throughput and scalability for all benchmarks when reducing the search area. However, the greatest improvements are shown when elimination is possible, and this improvement reaches nearly 39 times what a non-optimized structure is capable of.
- Andrey Borisenko, Michael Haidl, and Sergei Gorlatch. 2017. A GPU Parallelization of Branch-and-Bound for Multiproduct Batch Plants Optimization. The Journal of Supercomputing 73, 2 (2017), 639--651.Google ScholarDigital Library
- Robert Colvin and Lindsay Groves. 2007. A Scalable Lock-free Stack Algorithm and Its Verification. In Fifth IEEE International Conference on Software Engineering and Formal Methods (SEFM 2007). IEEE, 339--348.Google ScholarDigital Library
- Danny Hendler, Nir Shavit, and Lena Yerushalmi. 2004. A Scalable Lock-free Stack Algorithm. In Proceedings of the sixteenth annual ACM symposium on Parallelism in algorithms and architectures. 206--215.Google ScholarDigital Library
- Danny Hendler, Nir Shavit, and Lena Yerushalmi. 2010. A Scalable Lock-free Stack Algorithm. J. Parallel and Distrib. Comput. 70, 1 (2010), 1--12.Google ScholarDigital Library
- Abhinav Jangda and Rupesh Nasre. 2016. FastCollect: Offloading Generational Garbage Collection to Integrated GPUs. In Proceedings of the International Conference on Compilers, Architectures and Synthesis for Embedded Systems, CASES, Vol. 16. 1--10.Google Scholar
- Henry Massalin and Calton Pu. 1992. A Lock-free Multiprocessor OS Kernel. ACM SIGOPS Operating Systems Review 26, 2 (1992), 108.Google ScholarCross Ref
- Maged M Michael. 2003. CAS-based Lock-free Algorithm for Shared Deques. In European Conference on Parallel Processing. Springer, 651--660.Google Scholar
- Maged M Michael and Michael L Scott. 1996. Simple, Fast, and Practical Nonblocking and Blocking Concurrent Queue Algorithms. In Proceedings of the fifteenth annual ACM symposium on Principles of distributed computing. 267--275.Google ScholarDigital Library
- Maged M Michael and Michael L Scott. 1998. Nonblocking Algorithms and Preemption-safe Locking on Multiprogrammed Shared Memory Multiprocessors. journal of parallel and distributed computing 51, 1 (1998), 1--26.Google Scholar
- Prabhakar Misra and Mainak Chaudhuri. 2012. Performance Evaluation of Concurrent Lock-free Data Structures on GPUs. In 2012 IEEE 18th International Conference on Parallel and Distributed Systems. IEEE, 53--60.Google Scholar
- Heejin Park and Felix Xiaozhu Lin. 2021. Tinystack: A Minimal GPU Stack for Client ML. arXiv preprint arXiv:2105.05085 (2021).Google Scholar
- Yaqiong Peng and Zhiyu Hao. 2017. FA-Stack: A Fast Array-based Stack with Wait-free Progress Guarantee. IEEE Transactions on Parallel and Distributed Systems 29, 4 (2017), 843--857.Google ScholarCross Ref
- Niloufar Shafiei. 2009. Non-blocking Array-based Algorithms for Stacks and Queues. In International Conference on Distributed Computing and Networking. Springer, 55--66.Google Scholar
- Noah South. 2022. Scan Stack: A Search-based Concurrent Stack for GPU. Master's thesis. The University of Mississippi. https://egrove.olemiss.edu/etd/2459/Google Scholar
- David Troendle, Tuan Ta, and Byunghyun Jang. 2019. A Specialized Concurrent Queue for Scheduling Irregular Workloads on GPUs. In Proceedings of the 48th International Conference on Parallel Processing. 1--11.Google ScholarDigital Library
Index Terms
- Scan Stack: A Search-based Concurrent Stack for GPU
Recommendations
Stack-based parallel recursion on graphics processors
PPoPP '09: Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programmingRecent research has shown promising results on using graphics processing units (GPUs) to accelerate general-purpose computation. However, today's GPUs do not support recursive functions. As a result, for inherently recursive algorithms such as tree ...
Faster GPU-based genetic programming using a two-dimensional stack
Genetic programming (GP) is a computationally intensive technique which also has a high degree of natural parallelism. Parallel computing architectures have become commonplace especially with regards to Graphics Processing Units (GPU). Hence, versions ...
Lock-based synchronization for GPU architectures
CF '16: Proceedings of the ACM International Conference on Computing FrontiersModern GPUs have shown promising results in accelerating compute-intensive and numerical workloads with limited data sharing. However, emerging GPU applications manifest ample amount of data sharing among concurrently executing threads. Often data ...
Comments