A Parallel Scan Algorithm in the Tensor Core Unit Model

Zouzias, Anastasios; McColl, William F.

doi:10.1007/978-3-031-39698-4_33

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14100))

Included in the following conference series:

European Conference on Parallel Processing

1468 Accesses

Abstract

We present a parallel scan (prefix sum) algorithm in the Tensor Core Unit (TCU) model of computation. The TCU model assumes that multiplication between two square matrices of constant size s is a basic operation. In the \((s^2,\ell )\)-TCU model, we show that for inputs of size n, the algorithm has depth at most \(2\lfloor \log _s(n)\rfloor \) and runs in \(\mathcal {O}(n(1+\ell /s^2) / p + (s^2 + \ell ) \log _s (n))\) assuming p tensor core units. Equivalently, the algorithm performs \(\mathcal {O}(n/s^2)\) multiplications of square matrices of size s.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The first parameter \(s^2\) of the TCU model is squared to avoid writing square roots on the matrix sizes.
2.
That said, the cost of memory operations (memory coalescing, bank conflicts, etc.) is crucial to achieving high performance in an actual implementation.
3.
Fan-in is the maximum number of inputs an adder can have. Similarly, fan-out is the maximum number of outputs.
4.
The main goal of the authors is to provide highly optimized kernels and, hence, use the terms of warp/block/grid of the CUDA programming model.

References

Blelloch, G.E.: Prefix sums and their applications. In: Sythesis of Parallel Algorithms, pp. 35–60. Morgan Kaufmann (1990)
Google Scholar
Brent, Kung: A regular layout for parallel adders. IEEE Trans. Comput. C-31(3), 260–264 (1982). https://doi.org/10.1109/TC.1982.1675982
Brent, R.P., Kung, H.T.: The chip complexity of binary arithmetic. In: Proceedings of the Symposium on Theory of Computing (STOC), pp. 190–200. ACM (1980). https://doi.org/10.1145/800141.804666
Brent, R.P.: The parallel evaluation of general arithmetic expressions. J. ACM 21(2), 201–206 (1974). https://doi.org/10.1145/321812.321815
Article MathSciNet MATH Google Scholar
Chen, T., Guestrin, C.: XGBoost: a scalable tree boosting system. In: Proceedings of International Conference on Knowledge Discovery and Data Mining (KDD), pp. 785–794. ACM (2016). https://doi.org/10.1145/2939672.2939785
Chowdhury, R., Silvestri, F., Vella, F.: A computational model for tensor core units. In: Proceedings of Symposium on Parallelism in Algorithms and Architectures (SPAA), pp. 519–521. ACM (2020). https://doi.org/10.1145/3350755.3400252
Chowdhury, R., Silvestri, F., Vella, F.: Algorithm design for tensor units. In: Sousa, L., Roma, N., Tomás, P. (eds.) Euro-Par 2021. LNCS, vol. 12820, pp. 353–367. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-85665-6_22
Chapter Google Scholar
Dakkak, A., Li, C., Xiong, J., Gelado, I., Hwu, W.M.: Accelerating reduction and scan using tensor core units. In: Proceedings of the ACM International Conference on Supercomputing, ICS 2019, pp. 46–57. ACM (2019). https://doi.org/10.1145/3330345.3331057
Harris, C., et al.: Array programming with NumPy. Nature 585(7825), 357–362 (2020). https://doi.org/10.1038/s41586-020-2649-2
Article Google Scholar
Harris, D.: A taxonomy of parallel prefix networks. In: 2003 the Thirty-Seventh Asilomar Conference on Signals, Systems & Computers, vol. 2, pp. 2213–2217 (2003). https://doi.org/10.1109/ACSSC.2003.1292373
Hillis, W.D., Steele, G.L.: Data parallel algorithms. Commun. ACM 29(12), 1170–1183 (1986). https://doi.org/10.1145/7902.7903
Article Google Scholar
Hwu, W.W., Kirk, D.B., El Hajj, I.: Programming Massively Parallel Processors, 4th edn. Morgan Kaufmann (2023). https://doi.org/10.1016/B978-0-323-91231-0.00006-9
Jouppi, N.P., et al.: In-datacenter performance analysis of a tensor processing unit. In: Proceedings of International Symposium on Computer Architecture (ISCA), pp. 1–12. ACM (2017). https://doi.org/10.1145/3079856.3080246
Jouppi, N.P., et al.: Ten lessons from three generations shaped Google’s TPUv4i: industrial product. In: Proceedings of International Symposium on Computer Architecture (ISCA), pp. 1–14 (2021). https://doi.org/10.1109/ISCA52012.2021.00010
Kogge, P.M., Stone, H.S.: A parallel algorithm for the efficient solution of a general class of recurrence equations. IEEE Trans. Comput. C-22(8), 786–793 (1973). https://doi.org/10.1109/TC.1973.5009159
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems (NIPS), pp. 1097–1105 (2012)
Google Scholar
Liao, H., et al.: Ascend: a scalable and unified architecture for ubiquitous deep neural network computing: industry track paper. In: Proceedings of International Symposium on High-Performance Computer Architecture (HPCA), pp. 789–801. IEEE (2021). https://doi.org/10.1109/HPCA51647.2021.00071
Liao, H., Tu, J., Xia, J., Zhou, X.: DaVinci: a scalable architecture for neural network computing. In: Hot Chips Symposium on High-Performance Chips (HCS), pp. 1–44 (2019). https://doi.org/10.1109/HOTCHIPS.2019.8875654
NVIDIA Authors: NVIDIA DGX-1 with Tesla V100 system architecture. Technical report MSU-CSE-06-2, Nvidia Corporation (2017). https://images.nvidia.com/content/pdf/dgx1-v100-system-architecture-whitepaper.pdf
Sklansky, J.: Conditional-sum addition logic. IRE Trans. Electron. Comput. EC-9(2), 226–231 (1960). https://doi.org/10.1109/TEC.1960.5219822
Snir, M.: Depth-size trade-offs for parallel prefix computation. J. Algorithms 7(2), 185–201 (1986). https://doi.org/10.1016/0196-6774(86)90003-9
Article MathSciNet MATH Google Scholar
Zhu, H., Cheng, C.K., Graham, R.: On the construction of zero-deficiency parallel prefix circuits with minimum depth. ACM Trans. Des. Autom. Electron. Syst. 11(2), 387–409 (2006). https://doi.org/10.1145/1142155.1142162
Article Google Scholar
Zimmermann, R.V.: Binary adder architectures for cell-based VLSI and their synthesis. Ph.D. thesis, Swiss Federal Institute of Technology Zurich, Zurich (1997)
Google Scholar

Download references

Author information

Authors and Affiliations

Computing Systems Laboratory, Zurich Research Center, Huawei Technologies, Zürich, Switzerland
Anastasios Zouzias & William F. McColl

Authors

Anastasios Zouzias
View author publications
You can also search for this author in PubMed Google Scholar
William F. McColl
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Anastasios Zouzias .

Editor information

Editors and Affiliations

University of Glasgow, Glasgow, UK
José Cano
University of Cyprus, Nicosia, Cyprus
Marios D. Dikaiakos
University of Cyprus, Nicosia, Cyprus
George A. Papadopoulos
Chalmers University of Technology, Gothenburg, Sweden
Miquel Pericàs
University of Manchester, Manchester, UK
Rizos Sakellariou

A Appendix

We provide a functional end-to-end (but not high-performance) Python implementation of Algorithm 1 using NumPy (v1.24.1) [9]. The implementation demonstrates the memory layout operations required to orchestrate the batched matrix multiplications of Algorithm 1.

1.1 A.1 Correctness of Algorithm 1

In this section, we prove the correctness of Algorithm 1. We reformulate Algorithm 1 using recursion as stated in Algorithm 2. The recursive formulation will enable us to prove the correctness using strong induction. Indeed, we prove by induction that MatMulScanRecursive is correct for all inputs that are powers of s, given an arbitrary \(s\ge 2\).

In particular, it suffices to show that the Recurse method with input \(\boldsymbol{z}\) and s has the following precondition/postcondition relation: given the precondition that on all consecutive chunks of size s of \(\boldsymbol{z}\), i.e., \((0,1,\dots , s-1), (s,s+1,\dots , 2s-1), \dots \), the “local” prefix sums on each chunk is precomputed, Recurse returns the prefix sum of \(\boldsymbol{z}\) (postcondition). Indeed, by the definition of MatMulScanRecursive in Line 2 the “local” prefix sums of size s are computed and, in Line 3, the Recurse method is called with the precondition to be true.

Base Case. For inputs of size less than s, the termination criterion of Line 6 is met, therefore the postcondition follows directly from the precondition since the input size is less than s.

Inductive Step. The inductive hypothesis is that Recurse is correct for input sizes strictly less than n. We will show that Recurse is correct for inputs of size n. Indeed, given an input \(\boldsymbol{z}\) where all its “local” prefix sums are precomputed, we prove that Recurse with input \(\boldsymbol{z}\) returns the prefix sum of \(\boldsymbol{z}\). Now, Line 8 computes the “local” prefix sums on the s-strided subvector \(\boldsymbol{x}[\text {start}::\text {step}]\). The prefix sum on \(\boldsymbol{x}[\text {start}::\text {step}]\) is computed on Line 9 by the inductive hypothesis. Then, Line 11 broadcasts and add the correct prefix sum values of the s-strided subvector of \(\boldsymbol{z}\) to the corresponding s following indices of each subvector. Hence, the postcondition of Recurse holds.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zouzias, A., McColl, W.F. (2023). A Parallel Scan Algorithm in the Tensor Core Unit Model. In: Cano, J., Dikaiakos, M.D., Papadopoulos, G.A., Pericàs, M., Sakellariou, R. (eds) Euro-Par 2023: Parallel Processing. Euro-Par 2023. Lecture Notes in Computer Science, vol 14100. Springer, Cham. https://doi.org/10.1007/978-3-031-39698-4_33

Download citation

DOI: https://doi.org/10.1007/978-3-031-39698-4_33
Published: 24 August 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-39697-7
Online ISBN: 978-3-031-39698-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Parallel Scan Algorithm in the Tensor Core Unit Model

Abstract

Access this chapter

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

A Appendix

A Appendix

1.1 A.1 Correctness of Algorithm 1

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation