An Optimal Parallel Prefix-Sums Algorithm on the Memory Machine Models for GPUs

Nakano, Koji

doi:10.1007/978-3-642-33078-0_8

Koji Nakano²²

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7439))

Included in the following conference series:

International Conference on Algorithms and Architectures for Parallel Processing

2024 Accesses

Abstract

The main contribution of this paper is to show optimal algorithms computing the sum and the prefix-sums on two memory machine models, the Discrete Memory Machine (DMM) and the Unified Memory Machine (UMM). The DMM and the UMM are theoretical parallel computing models that capture the essence of the shared memory and the global memory of GPUs. These models have three parameters, the number p of threads, the width w of the memory, and the memory access latency l. We first show that the sum of n numbers can be computed in $O({n\over w}+{nl\over p}+l\log n)$ time units on the DMM and the UMM. We then go on to show that $\Omega({n\over w}+{nl\over p}+l\log n)$ time units are necessary to compute the sum. Finally, we show an optimal parallel algorithm that computes the prefix-sums of n numbers in $O({n\over w}+{nl\over p}+l\log n)$ time units on the DMM and the UMM.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

GPU Computations and Memory Access Model Based on Petri Nets

A Methodology Approach to Compare Performance of Parallel Programming Models for Shared-Memory Architectures

Scalable Parallelization of Stencils Using MODA

References

Aho, A.V., Ullman, J.D., Hopcroft, J.E.: Data Structures and Algorithms. Addison Wesley (1983)
Google Scholar
Akl, S.G.: Parallel Sorting Algorithms. Academic Press (1985)
Google Scholar
Batcher, K.E.: Sorting networks and their applications. In: Proc. AFIPS Spring Joint Comput. Conf., vol. 32, pp. 307–314 (1968)
Google Scholar
Flynn, M.J.: Some computer organizations and their effectiveness. IEEE Transactions on Computers C-21, 948–960 (1872)
Article Google Scholar
Gibbons, A., Rytter, W.: Efficient Parallel Algorithms. Cambridge University Press (1988)
Google Scholar
Gottlieb, A., Grishman, R., Kruskal, C.P., McAuliffe, K.P., Rudolph, L., Snir, M.: The nyu ultracomputer–designing an MIMD shared memory parallel computer. IEEE Trans. on Computers C-32(2), 175–189 (1983)
Article Google Scholar
Grama, A., Karypis, G., Kumar, V., Gupta, A.: Introduction to Parallel Computing. Addison Wesley (2003)
Google Scholar
Harris, M., Sengupta, S., Owens, J.D.: Chapter 39. parallel prefix sum (scan) with CUDA. In: GPU Gems 3. Addison-Wesley (2007)
Google Scholar
Hillis, W.D., Steele Jr., G.L.: Data parallel algorithms. Commun. ACM 29(12), 1170–1183 (1986), http://doi.acm.org/10.1145/7902.7903
Article Google Scholar
Hwu, W.W.: GPU Computing Gems Emerald Edition. Morgan Kaufmann (2011)
Google Scholar
Ito, Y., Ogawa, K., Nakano, K.: Fast ellipse detection algorithm using hough transform on the GPU. In: Proc. of International Conference on Networking and Computing, pp. 313–319 (December 2011)
Google Scholar
Lawrie, D.H.: Access and alignment of data in an array processor. IEEE Trans. on Computers C-24(12), 1145–1155 (1975)
Article MathSciNet Google Scholar
Man, D., Uda, K., Ito, Y., Nakano, K.: A GPU implementation of computing euclidean distance map with efficient memory access. In: Proc. of International Conference on Networking and Computing, pp. 68–76 (December 2011)
Google Scholar
Man, D., Uda, K., Ueyama, H., Ito, Y., Nakano, K.: Implementations of a parallel algorithm for computing euclidean distance map in multicore processors and GPUs. International Journal of Networking and Computing 1, 260–276 (2011)
Google Scholar
Nakano, K.: Simple memory machine models for GPUs. In: Proc. of International Parallel and Distributed Processing Symposium Workshops, pp. 788–797 (May 2012)
Google Scholar
Nishida, K., Ito, Y., Nakano, K.: Accelerating the dynamic programming for the matrix chain product on the GPU. In: Proc. of International Conference on Networking and Computing, pp. 320–326 (December 2011)
Google Scholar
NVIDIA Corporation: NVIDIA CUDA C best practice guide version 3.1 (2010)
Google Scholar
NVIDIA Corporation: NVIDIA CUDA C programming guide version 4.0 (2011)
Google Scholar
Quinn, M.J.: Parallel Computing: Theory and Practice. McGraw-Hill (1994)
Google Scholar
Uchida, A., Ito, Y., Nakano, K.: Fast and accurate template matching using pixel rearrangement on the GPU. In: Proc. of International Conference on Networking and Computing, pp. 153–159 (December 2011)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Information Engineering, Hiroshima University, Kagamiyama 1-4-1, Higashi Hiroshima, 739-8527, Japan
Koji Nakano

Authors

Koji Nakano
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Information Technology, Deakin University, Melbourne Burwood Campus, 221 Burwood Highway, 3125, Burwood, VIC, Australia
Yang Xiang
SEECS, University of Ottawa, 8, King Edward Ave, K1N 6N5, Ottawa, ON, Canada
Ivan Stojmenovic
Department of Intelligent Informatics, Kyushu Sangyo University, 2-3-1 Matsukadai, Higashi-ku, 813-8503, Fukuoka, Japan
Bernady O. Apduhan
School of Information Science and Engineering, Central South University, 410083, Changsha, Hunan Province, P.R. China
Guojun Wang
Department of Information Engineering, Hiroshima University, 1-4-1, Kagamiyama, 739-8527, Higashi-Hiroshima, Japan
Koji Nakano
School of Information Technologies, University of Sydney, Building J12, 2006, Sydney, NSW, Australia
Albert Zomaya

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Nakano, K. (2012). An Optimal Parallel Prefix-Sums Algorithm on the Memory Machine Models for GPUs. In: Xiang, Y., Stojmenovic, I., Apduhan, B.O., Wang, G., Nakano, K., Zomaya, A. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2012. Lecture Notes in Computer Science, vol 7439. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-33078-0_8

Download citation

DOI: https://doi.org/10.1007/978-3-642-33078-0_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-33077-3
Online ISBN: 978-3-642-33078-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics