Abstract
We present a parallel scan (prefix sum) algorithm in the Tensor Core Unit (TCU) model of computation. The TCU model assumes that multiplication between two square matrices of constant size s is a basic operation. In the \((s^2,\ell )\)-TCU model, we show that for inputs of size n, the algorithm has depth at most \(2\lfloor \log _s(n)\rfloor \) and runs in \(\mathcal {O}(n(1+\ell /s^2) / p + (s^2 + \ell ) \log _s (n))\) assuming p tensor core units. Equivalently, the algorithm performs \(\mathcal {O}(n/s^2)\) multiplications of square matrices of size s.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
The first parameter \(s^2\) of the TCU model is squared to avoid writing square roots on the matrix sizes.
- 2.
That said, the cost of memory operations (memory coalescing, bank conflicts, etc.) is crucial to achieving high performance in an actual implementation.
- 3.
Fan-in is the maximum number of inputs an adder can have. Similarly, fan-out is the maximum number of outputs.
- 4.
The main goal of the authors is to provide highly optimized kernels and, hence, use the terms of warp/block/grid of the CUDA programming model.
References
Blelloch, G.E.: Prefix sums and their applications. In: Sythesis of Parallel Algorithms, pp. 35–60. Morgan Kaufmann (1990)
Brent, Kung: A regular layout for parallel adders. IEEE Trans. Comput. C-31(3), 260–264 (1982). https://doi.org/10.1109/TC.1982.1675982
Brent, R.P., Kung, H.T.: The chip complexity of binary arithmetic. In: Proceedings of the Symposium on Theory of Computing (STOC), pp. 190–200. ACM (1980). https://doi.org/10.1145/800141.804666
Brent, R.P.: The parallel evaluation of general arithmetic expressions. J. ACM 21(2), 201–206 (1974). https://doi.org/10.1145/321812.321815
Chen, T., Guestrin, C.: XGBoost: a scalable tree boosting system. In: Proceedings of International Conference on Knowledge Discovery and Data Mining (KDD), pp. 785–794. ACM (2016). https://doi.org/10.1145/2939672.2939785
Chowdhury, R., Silvestri, F., Vella, F.: A computational model for tensor core units. In: Proceedings of Symposium on Parallelism in Algorithms and Architectures (SPAA), pp. 519–521. ACM (2020). https://doi.org/10.1145/3350755.3400252
Chowdhury, R., Silvestri, F., Vella, F.: Algorithm design for tensor units. In: Sousa, L., Roma, N., Tomás, P. (eds.) Euro-Par 2021. LNCS, vol. 12820, pp. 353–367. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-85665-6_22
Dakkak, A., Li, C., Xiong, J., Gelado, I., Hwu, W.M.: Accelerating reduction and scan using tensor core units. In: Proceedings of the ACM International Conference on Supercomputing, ICS 2019, pp. 46–57. ACM (2019). https://doi.org/10.1145/3330345.3331057
Harris, C., et al.: Array programming with NumPy. Nature 585(7825), 357–362 (2020). https://doi.org/10.1038/s41586-020-2649-2
Harris, D.: A taxonomy of parallel prefix networks. In: 2003 the Thirty-Seventh Asilomar Conference on Signals, Systems & Computers, vol. 2, pp. 2213–2217 (2003). https://doi.org/10.1109/ACSSC.2003.1292373
Hillis, W.D., Steele, G.L.: Data parallel algorithms. Commun. ACM 29(12), 1170–1183 (1986). https://doi.org/10.1145/7902.7903
Hwu, W.W., Kirk, D.B., El Hajj, I.: Programming Massively Parallel Processors, 4th edn. Morgan Kaufmann (2023). https://doi.org/10.1016/B978-0-323-91231-0.00006-9
Jouppi, N.P., et al.: In-datacenter performance analysis of a tensor processing unit. In: Proceedings of International Symposium on Computer Architecture (ISCA), pp. 1–12. ACM (2017). https://doi.org/10.1145/3079856.3080246
Jouppi, N.P., et al.: Ten lessons from three generations shaped Google’s TPUv4i: industrial product. In: Proceedings of International Symposium on Computer Architecture (ISCA), pp. 1–14 (2021). https://doi.org/10.1109/ISCA52012.2021.00010
Kogge, P.M., Stone, H.S.: A parallel algorithm for the efficient solution of a general class of recurrence equations. IEEE Trans. Comput. C-22(8), 786–793 (1973). https://doi.org/10.1109/TC.1973.5009159
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems (NIPS), pp. 1097–1105 (2012)
Liao, H., et al.: Ascend: a scalable and unified architecture for ubiquitous deep neural network computing: industry track paper. In: Proceedings of International Symposium on High-Performance Computer Architecture (HPCA), pp. 789–801. IEEE (2021). https://doi.org/10.1109/HPCA51647.2021.00071
Liao, H., Tu, J., Xia, J., Zhou, X.: DaVinci: a scalable architecture for neural network computing. In: Hot Chips Symposium on High-Performance Chips (HCS), pp. 1–44 (2019). https://doi.org/10.1109/HOTCHIPS.2019.8875654
NVIDIA Authors: NVIDIA DGX-1 with Tesla V100 system architecture. Technical report MSU-CSE-06-2, Nvidia Corporation (2017). https://images.nvidia.com/content/pdf/dgx1-v100-system-architecture-whitepaper.pdf
Sklansky, J.: Conditional-sum addition logic. IRE Trans. Electron. Comput. EC-9(2), 226–231 (1960). https://doi.org/10.1109/TEC.1960.5219822
Snir, M.: Depth-size trade-offs for parallel prefix computation. J. Algorithms 7(2), 185–201 (1986). https://doi.org/10.1016/0196-6774(86)90003-9
Zhu, H., Cheng, C.K., Graham, R.: On the construction of zero-deficiency parallel prefix circuits with minimum depth. ACM Trans. Des. Autom. Electron. Syst. 11(2), 387–409 (2006). https://doi.org/10.1145/1142155.1142162
Zimmermann, R.V.: Binary adder architectures for cell-based VLSI and their synthesis. Ph.D. thesis, Swiss Federal Institute of Technology Zurich, Zurich (1997)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
A Appendix
A Appendix
We provide a functional end-to-end (but not high-performance) Python implementation of Algorithm 1 using NumPy (v1.24.1) [9]. The implementation demonstrates the memory layout operations required to orchestrate the batched matrix multiplications of Algorithm 1.
1.1 A.1 Correctness of Algorithm 1
In this section, we prove the correctness of Algorithm 1. We reformulate Algorithm 1 using recursion as stated in Algorithm 2. The recursive formulation will enable us to prove the correctness using strong induction. Indeed, we prove by induction that MatMulScanRecursive is correct for all inputs that are powers of s, given an arbitrary \(s\ge 2\).
In particular, it suffices to show that the Recurse method with input \(\boldsymbol{z}\) and s has the following precondition/postcondition relation: given the precondition that on all consecutive chunks of size s of \(\boldsymbol{z}\), i.e., \((0,1,\dots , s-1), (s,s+1,\dots , 2s-1), \dots \), the “local” prefix sums on each chunk is precomputed, Recurse returns the prefix sum of \(\boldsymbol{z}\) (postcondition). Indeed, by the definition of MatMulScanRecursive in Line 2 the “local” prefix sums of size s are computed and, in Line 3, the Recurse method is called with the precondition to be true.
Base Case. For inputs of size less than s, the termination criterion of Line 6 is met, therefore the postcondition follows directly from the precondition since the input size is less than s.
Inductive Step. The inductive hypothesis is that Recurse is correct for input sizes strictly less than n. We will show that Recurse is correct for inputs of size n. Indeed, given an input \(\boldsymbol{z}\) where all its “local” prefix sums are precomputed, we prove that Recurse with input \(\boldsymbol{z}\) returns the prefix sum of \(\boldsymbol{z}\). Now, Line 8 computes the “local” prefix sums on the s-strided subvector \(\boldsymbol{x}[\text {start}::\text {step}]\). The prefix sum on \(\boldsymbol{x}[\text {start}::\text {step}]\) is computed on Line 9 by the inductive hypothesis. Then, Line 11 broadcasts and add the correct prefix sum values of the s-strided subvector of \(\boldsymbol{z}\) to the corresponding s following indices of each subvector. Hence, the postcondition of Recurse holds.
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Zouzias, A., McColl, W.F. (2023). A Parallel Scan Algorithm in the Tensor Core Unit Model. In: Cano, J., Dikaiakos, M.D., Papadopoulos, G.A., Pericàs, M., Sakellariou, R. (eds) Euro-Par 2023: Parallel Processing. Euro-Par 2023. Lecture Notes in Computer Science, vol 14100. Springer, Cham. https://doi.org/10.1007/978-3-031-39698-4_33
Download citation
DOI: https://doi.org/10.1007/978-3-031-39698-4_33
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-39697-7
Online ISBN: 978-3-031-39698-4
eBook Packages: Computer ScienceComputer Science (R0)