Abstract
Matrix multiplication is widely used in a variety of application domains. When the input matrices and the product differ in the memory format, matrix transpose is required. The efficiency of matrix transpose has a non-negligible impact on performance. However, the state-of-the-art software solution and its optimizations suffer from low efficiency due to frequent interference to main pipeline and their inability to achieve parallel matrix transpose and multiplication. To address this issue, we propose AMT, an asynchronous and in-place matrix transpose mechanism based on C2R algorithm, to efficiently perform matrix transpose. AMT performs matrix transpose in an asynchronous processing module and uses two customized asynchronous matrix transpose instructions to facilitate processing. We implement the logic design of AMT using RTL and verify its correctness. Simulation results show that AMT achieves an average of 1.27x (up to 1.48x) speedup over a state-of-the-art software baseline, and is within 95.4% of an ideal method. Overhead analysis shows that AMT only incurs small area overhead and power consumption.
Similar content being viewed by others
References
Kurth T, Treichler S, Romero J, Mudigonda M, Luehr N, Phillips E, Mahesh A, Matheson M, Deslippe J, Fatica M, et al (2018) Exascale deep learning for climate analytics. In: SC18: International conference for high performance computing, networking, storage and analysis, pp 649–660
Han Y, Zhang GJ, Huang X, Wang Y (2020) A moist physics parameterization based on deep learning. J Adv Model Earth Syst 12(9):2020–002076
Jia W, Wang H, Chen M, Lu D, Lin L, Car R, Weinan E, Zhang L (2020) Pushing the limit of molecular dynamics with ab initio accuracy to 100 million atoms with machine learning. In: SC20: International conference for high performance computing, networking, storage and analysis, pp 1–14
Albawi S, Mohammed TA, Al-Zawi S (2017) Understanding of a convolutional neural network. In: 2017 international conference on engineering and technology (ICET), pp 1–6
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2020) Generative adversarial networks. Commun ACM 63(11):139–144
D’Arco M, Napoli E, Zacharelos E (2020) Digital circuit for seamless resampling adc output streams. Sensors 20(6):1619
Yildirim M (2021) Analog circuit implementation based on median filter for salt and pepper noise reduction in image. Analog Integr Circ Sig Process 107(1):195–202
Berman MF (1958) A method for transposing a matrix. J ACM 5(4):383–384
Tretyakov A, Tyrtyshnikov E (2009) Optimal in-place transposition of rectangular matrices. J Complex 25(4):377–384
Sung I-J, Gómez-Luna J, González-Linares JM, Guil N, Hwu W-MW (2014) In-place transposition of rectangular matrices on accelerators. ACM SIGPLAN Notices 49(8):207–218
Gomez-Luna J, Sung I-J, Chang L-W, González-Linares JM, Guil N, Hwu W-MW (2015) In-place matrix transposition on gpus. IEEE Trans Parallel Distrib Syst 27(3):776–788
Catanzaro B, Keller A, Garland M (2014) A decomposition for in-place matrix transposition. ACM SIGPLAN Notices 49(8):193–206
Godard P, Loechner V, Bastoul C (2020) Efficient out-of-core and out-of-place rectangular matrix transposition and rotation. IEEE Trans Comput
Ma S, Lei Y, Huang L, Wang Z (2018) Mt-dma: a dma controller supporting efficient matrix transposition for digital signal processing. IEEE Access 7:5808–5818
Bradford DR, Corbal J, Hickmann B, Sharma R (2020) Method and apparatus for efficient matrix transpose. Google Patents. US Patent 10,649,772
Fu H, Liao J, Yang J, Wang L, Song Z, Huang X, Yang C, Xue W, Liu F, Qiao F et al (2016) The sunway taihulight supercomputer: system and applications. Science China Inf Sci 59(7):1–16
Kahan W (1996) Ieee standard 754 for binary floating-point arithmetic. Lecture Notes Status IEEE 754(94720–1776):11
Wang H, Liu W, Hou K, Feng W (2016) Parallel transposition of sparse data structures. In: Proceedings of the 2016 international conference on supercomputing, pp 1–13
Gustavson F, Karlsson L, Kågström B (2012) Parallel and cache-efficient in-place matrix storage format conversion. ACM Trans Math Softw 38(3):1–32
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Chen, Z., Wang, D., Yu, Q. et al. AMT: asynchronous in-place matrix transpose mechanism for sunway many-core processor. J Supercomput 78, 9456–9474 (2022). https://doi.org/10.1007/s11227-021-04282-6
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-021-04282-6