Skip to main content

A Parallel Scan Algorithm in the Tensor Core Unit Model

  • Conference paper
  • First Online:
Euro-Par 2023: Parallel Processing (Euro-Par 2023)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14100))

Included in the following conference series:

  • 1468 Accesses

Abstract

We present a parallel scan (prefix sum) algorithm in the Tensor Core Unit (TCU) model of computation. The TCU model assumes that multiplication between two square matrices of constant size s is a basic operation. In the \((s^2,\ell )\)-TCU model, we show that for inputs of size n, the algorithm has depth at most \(2\lfloor \log _s(n)\rfloor \) and runs in \(\mathcal {O}(n(1+\ell /s^2) / p + (s^2 + \ell ) \log _s (n))\) assuming p tensor core units. Equivalently, the algorithm performs \(\mathcal {O}(n/s^2)\) multiplications of square matrices of size s.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The first parameter \(s^2\) of the TCU model is squared to avoid writing square roots on the matrix sizes.

  2. 2.

    That said, the cost of memory operations (memory coalescing, bank conflicts, etc.) is crucial to achieving high performance in an actual implementation.

  3. 3.

    Fan-in is the maximum number of inputs an adder can have. Similarly, fan-out is the maximum number of outputs.

  4. 4.

    The main goal of the authors is to provide highly optimized kernels and, hence, use the terms of warp/block/grid of the CUDA programming model.

References

  1. Blelloch, G.E.: Prefix sums and their applications. In: Sythesis of Parallel Algorithms, pp. 35–60. Morgan Kaufmann (1990)

    Google Scholar 

  2. Brent, Kung: A regular layout for parallel adders. IEEE Trans. Comput. C-31(3), 260–264 (1982). https://doi.org/10.1109/TC.1982.1675982

  3. Brent, R.P., Kung, H.T.: The chip complexity of binary arithmetic. In: Proceedings of the Symposium on Theory of Computing (STOC), pp. 190–200. ACM (1980). https://doi.org/10.1145/800141.804666

  4. Brent, R.P.: The parallel evaluation of general arithmetic expressions. J. ACM 21(2), 201–206 (1974). https://doi.org/10.1145/321812.321815

    Article  MathSciNet  MATH  Google Scholar 

  5. Chen, T., Guestrin, C.: XGBoost: a scalable tree boosting system. In: Proceedings of International Conference on Knowledge Discovery and Data Mining (KDD), pp. 785–794. ACM (2016). https://doi.org/10.1145/2939672.2939785

  6. Chowdhury, R., Silvestri, F., Vella, F.: A computational model for tensor core units. In: Proceedings of Symposium on Parallelism in Algorithms and Architectures (SPAA), pp. 519–521. ACM (2020). https://doi.org/10.1145/3350755.3400252

  7. Chowdhury, R., Silvestri, F., Vella, F.: Algorithm design for tensor units. In: Sousa, L., Roma, N., Tomás, P. (eds.) Euro-Par 2021. LNCS, vol. 12820, pp. 353–367. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-85665-6_22

    Chapter  Google Scholar 

  8. Dakkak, A., Li, C., Xiong, J., Gelado, I., Hwu, W.M.: Accelerating reduction and scan using tensor core units. In: Proceedings of the ACM International Conference on Supercomputing, ICS 2019, pp. 46–57. ACM (2019). https://doi.org/10.1145/3330345.3331057

  9. Harris, C., et al.: Array programming with NumPy. Nature 585(7825), 357–362 (2020). https://doi.org/10.1038/s41586-020-2649-2

    Article  Google Scholar 

  10. Harris, D.: A taxonomy of parallel prefix networks. In: 2003 the Thirty-Seventh Asilomar Conference on Signals, Systems & Computers, vol. 2, pp. 2213–2217 (2003). https://doi.org/10.1109/ACSSC.2003.1292373

  11. Hillis, W.D., Steele, G.L.: Data parallel algorithms. Commun. ACM 29(12), 1170–1183 (1986). https://doi.org/10.1145/7902.7903

    Article  Google Scholar 

  12. Hwu, W.W., Kirk, D.B., El Hajj, I.: Programming Massively Parallel Processors, 4th edn. Morgan Kaufmann (2023). https://doi.org/10.1016/B978-0-323-91231-0.00006-9

  13. Jouppi, N.P., et al.: In-datacenter performance analysis of a tensor processing unit. In: Proceedings of International Symposium on Computer Architecture (ISCA), pp. 1–12. ACM (2017). https://doi.org/10.1145/3079856.3080246

  14. Jouppi, N.P., et al.: Ten lessons from three generations shaped Google’s TPUv4i: industrial product. In: Proceedings of International Symposium on Computer Architecture (ISCA), pp. 1–14 (2021). https://doi.org/10.1109/ISCA52012.2021.00010

  15. Kogge, P.M., Stone, H.S.: A parallel algorithm for the efficient solution of a general class of recurrence equations. IEEE Trans. Comput. C-22(8), 786–793 (1973). https://doi.org/10.1109/TC.1973.5009159

  16. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems (NIPS), pp. 1097–1105 (2012)

    Google Scholar 

  17. Liao, H., et al.: Ascend: a scalable and unified architecture for ubiquitous deep neural network computing: industry track paper. In: Proceedings of International Symposium on High-Performance Computer Architecture (HPCA), pp. 789–801. IEEE (2021). https://doi.org/10.1109/HPCA51647.2021.00071

  18. Liao, H., Tu, J., Xia, J., Zhou, X.: DaVinci: a scalable architecture for neural network computing. In: Hot Chips Symposium on High-Performance Chips (HCS), pp. 1–44 (2019). https://doi.org/10.1109/HOTCHIPS.2019.8875654

  19. NVIDIA Authors: NVIDIA DGX-1 with Tesla V100 system architecture. Technical report MSU-CSE-06-2, Nvidia Corporation (2017). https://images.nvidia.com/content/pdf/dgx1-v100-system-architecture-whitepaper.pdf

  20. Sklansky, J.: Conditional-sum addition logic. IRE Trans. Electron. Comput. EC-9(2), 226–231 (1960). https://doi.org/10.1109/TEC.1960.5219822

  21. Snir, M.: Depth-size trade-offs for parallel prefix computation. J. Algorithms 7(2), 185–201 (1986). https://doi.org/10.1016/0196-6774(86)90003-9

    Article  MathSciNet  MATH  Google Scholar 

  22. Zhu, H., Cheng, C.K., Graham, R.: On the construction of zero-deficiency parallel prefix circuits with minimum depth. ACM Trans. Des. Autom. Electron. Syst. 11(2), 387–409 (2006). https://doi.org/10.1145/1142155.1142162

    Article  Google Scholar 

  23. Zimmermann, R.V.: Binary adder architectures for cell-based VLSI and their synthesis. Ph.D. thesis, Swiss Federal Institute of Technology Zurich, Zurich (1997)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Anastasios Zouzias .

Editor information

Editors and Affiliations

A Appendix

A Appendix

We provide a functional end-to-end (but not high-performance) Python implementation of Algorithm 1 using NumPy (v1.24.1) [9]. The implementation demonstrates the memory layout operations required to orchestrate the batched matrix multiplications of Algorithm 1.

figure c

1.1 A.1 Correctness of Algorithm 1

In this section, we prove the correctness of Algorithm 1. We reformulate Algorithm 1 using recursion as stated in Algorithm 2. The recursive formulation will enable us to prove the correctness using strong induction. Indeed, we prove by induction that MatMulScanRecursive is correct for all inputs that are powers of s, given an arbitrary \(s\ge 2\).

figure d

In particular, it suffices to show that the Recurse method with input \(\boldsymbol{z}\) and s has the following precondition/postcondition relation: given the precondition that on all consecutive chunks of size s of \(\boldsymbol{z}\), i.e., \((0,1,\dots , s-1), (s,s+1,\dots , 2s-1), \dots \), the “local” prefix sums on each chunk is precomputed, Recurse returns the prefix sum of \(\boldsymbol{z}\) (postcondition). Indeed, by the definition of MatMulScanRecursive in Line 2 the “local” prefix sums of size s are computed and, in Line 3, the Recurse method is called with the precondition to be true.

Base Case. For inputs of size less than s, the termination criterion of Line 6 is met, therefore the postcondition follows directly from the precondition since the input size is less than s.

Inductive Step. The inductive hypothesis is that Recurse is correct for input sizes strictly less than n. We will show that Recurse is correct for inputs of size n. Indeed, given an input \(\boldsymbol{z}\) where all its “local” prefix sums are precomputed, we prove that Recurse with input \(\boldsymbol{z}\) returns the prefix sum of \(\boldsymbol{z}\). Now, Line 8 computes the “local” prefix sums on the s-strided subvector \(\boldsymbol{x}[\text {start}::\text {step}]\). The prefix sum on \(\boldsymbol{x}[\text {start}::\text {step}]\) is computed on Line 9 by the inductive hypothesis. Then, Line 11 broadcasts and add the correct prefix sum values of the s-strided subvector of \(\boldsymbol{z}\) to the corresponding s following indices of each subvector. Hence, the postcondition of Recurse holds.

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zouzias, A., McColl, W.F. (2023). A Parallel Scan Algorithm in the Tensor Core Unit Model. In: Cano, J., Dikaiakos, M.D., Papadopoulos, G.A., Pericàs, M., Sakellariou, R. (eds) Euro-Par 2023: Parallel Processing. Euro-Par 2023. Lecture Notes in Computer Science, vol 14100. Springer, Cham. https://doi.org/10.1007/978-3-031-39698-4_33

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-39698-4_33

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-39697-7

  • Online ISBN: 978-3-031-39698-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics