Skip to main content
Log in

AMT: asynchronous in-place matrix transpose mechanism for sunway many-core processor

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Matrix multiplication is widely used in a variety of application domains. When the input matrices and the product differ in the memory format, matrix transpose is required. The efficiency of matrix transpose has a non-negligible impact on performance. However, the state-of-the-art software solution and its optimizations suffer from low efficiency due to frequent interference to main pipeline and their inability to achieve parallel matrix transpose and multiplication. To address this issue, we propose AMT, an asynchronous and in-place matrix transpose mechanism based on C2R algorithm, to efficiently perform matrix transpose. AMT performs matrix transpose in an asynchronous processing module and uses two customized asynchronous matrix transpose instructions to facilitate processing. We implement the logic design of AMT using RTL and verify its correctness. Simulation results show that AMT achieves an average of 1.27x (up to 1.48x) speedup over a state-of-the-art software baseline, and is within 95.4% of an ideal method. Overhead analysis shows that AMT only incurs small area overhead and power consumption.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

References

  1. Kurth T, Treichler S, Romero J, Mudigonda M, Luehr N, Phillips E, Mahesh A, Matheson M, Deslippe J, Fatica M, et al (2018) Exascale deep learning for climate analytics. In: SC18: International conference for high performance computing, networking, storage and analysis, pp 649–660

  2. Han Y, Zhang GJ, Huang X, Wang Y (2020) A moist physics parameterization based on deep learning. J Adv Model Earth Syst 12(9):2020–002076

    Article  Google Scholar 

  3. Jia W, Wang H, Chen M, Lu D, Lin L, Car R, Weinan E, Zhang L (2020) Pushing the limit of molecular dynamics with ab initio accuracy to 100 million atoms with machine learning. In: SC20: International conference for high performance computing, networking, storage and analysis, pp 1–14

  4. Albawi S, Mohammed TA, Al-Zawi S (2017) Understanding of a convolutional neural network. In: 2017 international conference on engineering and technology (ICET), pp 1–6

  5. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780

    Article  Google Scholar 

  6. Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2020) Generative adversarial networks. Commun ACM 63(11):139–144

    Article  MathSciNet  Google Scholar 

  7. D’Arco M, Napoli E, Zacharelos E (2020) Digital circuit for seamless resampling adc output streams. Sensors 20(6):1619

    Article  Google Scholar 

  8. Yildirim M (2021) Analog circuit implementation based on median filter for salt and pepper noise reduction in image. Analog Integr Circ Sig Process 107(1):195–202

    Article  Google Scholar 

  9. Berman MF (1958) A method for transposing a matrix. J ACM 5(4):383–384

    Article  MathSciNet  Google Scholar 

  10. Tretyakov A, Tyrtyshnikov E (2009) Optimal in-place transposition of rectangular matrices. J Complex 25(4):377–384

    Article  MathSciNet  Google Scholar 

  11. Sung I-J, Gómez-Luna J, González-Linares JM, Guil N, Hwu W-MW (2014) In-place transposition of rectangular matrices on accelerators. ACM SIGPLAN Notices 49(8):207–218

    Article  Google Scholar 

  12. Gomez-Luna J, Sung I-J, Chang L-W, González-Linares JM, Guil N, Hwu W-MW (2015) In-place matrix transposition on gpus. IEEE Trans Parallel Distrib Syst 27(3):776–788

    Article  Google Scholar 

  13. Catanzaro B, Keller A, Garland M (2014) A decomposition for in-place matrix transposition. ACM SIGPLAN Notices 49(8):193–206

    Article  Google Scholar 

  14. Godard P, Loechner V, Bastoul C (2020) Efficient out-of-core and out-of-place rectangular matrix transposition and rotation. IEEE Trans Comput

  15. Ma S, Lei Y, Huang L, Wang Z (2018) Mt-dma: a dma controller supporting efficient matrix transposition for digital signal processing. IEEE Access 7:5808–5818

    Article  Google Scholar 

  16. Bradford DR, Corbal J, Hickmann B, Sharma R (2020) Method and apparatus for efficient matrix transpose. Google Patents. US Patent 10,649,772

  17. Fu H, Liao J, Yang J, Wang L, Song Z, Huang X, Yang C, Xue W, Liu F, Qiao F et al (2016) The sunway taihulight supercomputer: system and applications. Science China Inf Sci 59(7):1–16

    Article  Google Scholar 

  18. Kahan W (1996) Ieee standard 754 for binary floating-point arithmetic. Lecture Notes Status IEEE 754(94720–1776):11

    Google Scholar 

  19. Wang H, Liu W, Hou K, Feng W (2016) Parallel transposition of sparse data structures. In: Proceedings of the 2016 international conference on supercomputing, pp 1–13

  20. Gustavson F, Karlsson L, Kågström B (2012) Parallel and cache-efficient in-place matrix storage format conversion. ACM Trans Math Softw 38(3):1–32

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhengbo Chen.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, Z., Wang, D., Yu, Q. et al. AMT: asynchronous in-place matrix transpose mechanism for sunway many-core processor. J Supercomput 78, 9456–9474 (2022). https://doi.org/10.1007/s11227-021-04282-6

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-021-04282-6

Keywords

Navigation