Abstract
The main contribution of this paper is to show optimal algorithms computing the sum and the prefix-sums on two memory machine models, the Discrete Memory Machine (DMM) and the Unified Memory Machine (UMM). The DMM and the UMM are theoretical parallel computing models that capture the essence of the shared memory and the global memory of GPUs. These models have three parameters, the number p of threads, the width w of the memory, and the memory access latency l. We first show that the sum of n numbers can be computed in \(O({n\over w}+{nl\over p}+l\log n)\) time units on the DMM and the UMM. We then go on to show that \(\Omega({n\over w}+{nl\over p}+l\log n)\) time units are necessary to compute the sum. Finally, we show an optimal parallel algorithm that computes the prefix-sums of n numbers in \(O({n\over w}+{nl\over p}+l\log n)\) time units on the DMM and the UMM.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Aho, A.V., Ullman, J.D., Hopcroft, J.E.: Data Structures and Algorithms. Addison Wesley (1983)
Akl, S.G.: Parallel Sorting Algorithms. Academic Press (1985)
Batcher, K.E.: Sorting networks and their applications. In: Proc. AFIPS Spring Joint Comput. Conf., vol. 32, pp. 307–314 (1968)
Flynn, M.J.: Some computer organizations and their effectiveness. IEEE Transactions on Computers C-21, 948–960 (1872)
Gibbons, A., Rytter, W.: Efficient Parallel Algorithms. Cambridge University Press (1988)
Gottlieb, A., Grishman, R., Kruskal, C.P., McAuliffe, K.P., Rudolph, L., Snir, M.: The nyu ultracomputer–designing an MIMD shared memory parallel computer. IEEE Trans. on Computers C-32(2), 175–189 (1983)
Grama, A., Karypis, G., Kumar, V., Gupta, A.: Introduction to Parallel Computing. Addison Wesley (2003)
Harris, M., Sengupta, S., Owens, J.D.: Chapter 39. parallel prefix sum (scan) with CUDA. In: GPU Gems 3. Addison-Wesley (2007)
Hillis, W.D., Steele Jr., G.L.: Data parallel algorithms. Commun. ACM 29(12), 1170–1183 (1986), http://doi.acm.org/10.1145/7902.7903
Hwu, W.W.: GPU Computing Gems Emerald Edition. Morgan Kaufmann (2011)
Ito, Y., Ogawa, K., Nakano, K.: Fast ellipse detection algorithm using hough transform on the GPU. In: Proc. of International Conference on Networking and Computing, pp. 313–319 (December 2011)
Lawrie, D.H.: Access and alignment of data in an array processor. IEEE Trans. on Computers C-24(12), 1145–1155 (1975)
Man, D., Uda, K., Ito, Y., Nakano, K.: A GPU implementation of computing euclidean distance map with efficient memory access. In: Proc. of International Conference on Networking and Computing, pp. 68–76 (December 2011)
Man, D., Uda, K., Ueyama, H., Ito, Y., Nakano, K.: Implementations of a parallel algorithm for computing euclidean distance map in multicore processors and GPUs. International Journal of Networking and Computing 1, 260–276 (2011)
Nakano, K.: Simple memory machine models for GPUs. In: Proc. of International Parallel and Distributed Processing Symposium Workshops, pp. 788–797 (May 2012)
Nishida, K., Ito, Y., Nakano, K.: Accelerating the dynamic programming for the matrix chain product on the GPU. In: Proc. of International Conference on Networking and Computing, pp. 320–326 (December 2011)
NVIDIA Corporation: NVIDIA CUDA C best practice guide version 3.1 (2010)
NVIDIA Corporation: NVIDIA CUDA C programming guide version 4.0 (2011)
Quinn, M.J.: Parallel Computing: Theory and Practice. McGraw-Hill (1994)
Uchida, A., Ito, Y., Nakano, K.: Fast and accurate template matching using pixel rearrangement on the GPU. In: Proc. of International Conference on Networking and Computing, pp. 153–159 (December 2011)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Nakano, K. (2012). An Optimal Parallel Prefix-Sums Algorithm on the Memory Machine Models for GPUs. In: Xiang, Y., Stojmenovic, I., Apduhan, B.O., Wang, G., Nakano, K., Zomaya, A. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2012. Lecture Notes in Computer Science, vol 7439. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-33078-0_8
Download citation
DOI: https://doi.org/10.1007/978-3-642-33078-0_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-33077-3
Online ISBN: 978-3-642-33078-0
eBook Packages: Computer ScienceComputer Science (R0)