Almost Optimal Column-wise Prefix-sum Computation on the GPU

Tokura, Hiroki; Fujita, Toru; Nakano, Koji; Ito, Yasuaki

doi:10.1007/978-3-319-78054-2_21

Almost Optimal Column-wise Prefix-sum Computation on the GPU

Hiroki Tokura¹⁷,
Toru Fujita¹⁷,
Koji Nakano¹⁷ &
…
Yasuaki Ito¹⁷

Conference paper
First Online: 23 March 2018

1017 Accesses
1 Citations

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10778))

Abstract

The row-wise and column-wise prefix-sum computation of a matrix has many applications in the area of image processing such as computation of the summed area table and the Euclidean distance map. It is known that the prefix-sums of a 1-dimensional array can be computed efficiently on the GPU. Hence, the row-wise prefix-sums of a matrix can also be computed efficiently on the GPU by executing this prefix-sum algorithm for every row in parallel. However, the same approach does not work well for computing the column-wise prefix-sums, because inefficient stride memory access to the global memory is performed. The main contribution of this paper is to present an almost optimal column-wise prefix-sum algorithm on the GPU. Since all elements in an input matrix must be read and the resulting prefix-sums must be written, computation of the column-wise prefix-sums cannot be faster than simple matrix duplication in the global memory of the GPU. Quite surprisingly, experimental results using NVIDIA TITAN X show that our column-wise prefix-sum algorithm runs only 2–6% slower than matrix duplication. Thus, our column-wise prefix-sum algorithm is almost optimal.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Harris, M., Sengupta, S., Owens, J.D.: Parallel prefix sum (scan) with CUDA. In: GPU Gems 3. Addison-Wesley (2007). Chapter 39
Google Scholar
Hwu, W.W.: GPU Computing Gems Emerald Edition. Morgan Kaufmann, Burlington (2011)
Google Scholar
Kasagi, A., Nakano, K., Ito, Y.: Parallel algorithms for the summed area table on the asynchronous hierarchical memory machine, with GPU implementations. In: Proceedings of International Conference on Parallel Processing (ICPP), pp. 251–260, September 2014
Google Scholar
Lauritzen, A.: Summed-area variance shadow maps. In: GPU Gems 3. Addison-Wesley (2007). Chapter 8
Google Scholar
Man, D., Uda, K., Ueyama, H., Ito, Y., Nakano, K.: Implementations of a parallel algorithm for computing Euclidean distance map in multicore processors and GPUs. Int. J. Netw. Comput. 1(2), 260–276 (2011)
Article Google Scholar
Merrill, D.: CUB: a library of warp-wide, block-wide, and device-wide GPU parallel primitives (2017). https://nvlabs.github.io/cub/
Merrill, D., Garland, M.: Single-pass parallel prefix scan with decoupled look-back. Technical report NVR-2016-002, NVIDIA, March 2016
Google Scholar
Nakano, K.: An optimal parallel prefix-sums algorithm on the memory machine models for GPUs. In: Xiang, Y., Stojmenovic, I., Apduhan, B.O., Wang, G., Nakano, K., Zomaya, A. (eds.) ICA3PP 2012. LNCS, vol. 7439, pp. 99–113. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33078-0_8
Chapter Google Scholar
Nakano, K.: Optimal parallel algorithms for computing the sum, the prefix-sums, and the summed area table on the memory machine models. IEICE Trans. Inf. Syst. E96–D(12), 2626–2634 (2013)
Article Google Scholar
Nakano, K.: Simple memory machine models for GPUs. Int. J. Parallel Emerg. Distrib. Syst. 29(1), 17–37 (2014)
Article Google Scholar
Nehab, D., Maximo, A., Lima, R.S., Hoppe, H.: GPU-efficient recursive filtering and summed-area tables. ACM Trans. Graph. 30(6), 176 (2011)
Article Google Scholar
NVIDIA Corporation: NVIDIA CUDA C best practice guide version 3.1 (2010)
Google Scholar
NVIDIA Corporation: NVIDIA CUDA C programming guide version 8.0, March 2017
Google Scholar
Takeuchi, Y., Takafuji, D., Ito, Y., Nakano, K.: ASCII art generation using the local exhaustive search on the GPU. In: Proceedings of International Symposium on Computing and Networking, pp. 194–200, December 2013
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Information Engineering, Hiroshima University, Kagamiyama 1-4-1, Higashihiroshima, 739-8527, Japan
Hiroki Tokura, Toru Fujita, Koji Nakano & Yasuaki Ito

Authors

Hiroki Tokura
View author publications
You can also search for this author in PubMed Google Scholar
Toru Fujita
View author publications
You can also search for this author in PubMed Google Scholar
Koji Nakano
View author publications
You can also search for this author in PubMed Google Scholar
Yasuaki Ito
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Koji Nakano .

Editor information

Editors and Affiliations

Czestochowa University of Technology, Czestochowa, Poland
Roman Wyrzykowski
University of Tennessee, Knoxville, Tennessee, USA
Jack Dongarra
University of Southern California, Marina Del Rey, California, USA
Ewa Deelman
Czestochowa University of Technology, Czestochowa, Poland
Konrad Karczewski

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tokura, H., Fujita, T., Nakano, K., Ito, Y. (2018). Almost Optimal Column-wise Prefix-sum Computation on the GPU. In: Wyrzykowski, R., Dongarra, J., Deelman, E., Karczewski, K. (eds) Parallel Processing and Applied Mathematics. PPAM 2017. Lecture Notes in Computer Science(), vol 10778. Springer, Cham. https://doi.org/10.1007/978-3-319-78054-2_21

Download citation

DOI: https://doi.org/10.1007/978-3-319-78054-2_21
Published: 23 March 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-78053-5
Online ISBN: 978-3-319-78054-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics