AMT: asynchronous in-place matrix transpose mechanism for sunway many-core processor

Chen, Zhengbo; Wang, Di; Yu, Qi; Zheng, Fang; Guo, Feng; Chen, Zuoning

doi:10.1007/s11227-021-04282-6

AMT: asynchronous in-place matrix transpose mechanism for sunway many-core processor

Published: 17 January 2022

Volume 78, pages 9456–9474, (2022)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Zhengbo Chen ORCID: orcid.org/0000-0002-1208-0760¹,
Di Wang²,
Qi Yu²,
Fang Zheng²,
Feng Guo² &
…
Zuoning Chen³

212 Accesses
Explore all metrics

Abstract

Matrix multiplication is widely used in a variety of application domains. When the input matrices and the product differ in the memory format, matrix transpose is required. The efficiency of matrix transpose has a non-negligible impact on performance. However, the state-of-the-art software solution and its optimizations suffer from low efficiency due to frequent interference to main pipeline and their inability to achieve parallel matrix transpose and multiplication. To address this issue, we propose AMT, an asynchronous and in-place matrix transpose mechanism based on C2R algorithm, to efficiently perform matrix transpose. AMT performs matrix transpose in an asynchronous processing module and uses two customized asynchronous matrix transpose instructions to facilitate processing. We implement the logic design of AMT using RTL and verify its correctness. Simulation results show that AMT achieves an average of 1.27x (up to 1.48x) speedup over a state-of-the-art software baseline, and is within 95.4% of an ideal method. Overhead analysis shows that AMT only incurs small area overhead and power consumption.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A methodology for speeding up matrix vector multiplication for single/multi-core architectures

Article 29 March 2015

Architecture-aware configuration and scheduling of matrix multiplication on asymmetric multicore processors

Article 04 August 2016

BLAS3 optimization for the Godson-3B1500

Article Open access 25 November 2016

References

Kurth T, Treichler S, Romero J, Mudigonda M, Luehr N, Phillips E, Mahesh A, Matheson M, Deslippe J, Fatica M, et al (2018) Exascale deep learning for climate analytics. In: SC18: International conference for high performance computing, networking, storage and analysis, pp 649–660
Han Y, Zhang GJ, Huang X, Wang Y (2020) A moist physics parameterization based on deep learning. J Adv Model Earth Syst 12(9):2020–002076
Article Google Scholar
Jia W, Wang H, Chen M, Lu D, Lin L, Car R, Weinan E, Zhang L (2020) Pushing the limit of molecular dynamics with ab initio accuracy to 100 million atoms with machine learning. In: SC20: International conference for high performance computing, networking, storage and analysis, pp 1–14
Albawi S, Mohammed TA, Al-Zawi S (2017) Understanding of a convolutional neural network. In: 2017 international conference on engineering and technology (ICET), pp 1–6
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Article Google Scholar
Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2020) Generative adversarial networks. Commun ACM 63(11):139–144
Article MathSciNet Google Scholar
D’Arco M, Napoli E, Zacharelos E (2020) Digital circuit for seamless resampling adc output streams. Sensors 20(6):1619
Article Google Scholar
Yildirim M (2021) Analog circuit implementation based on median filter for salt and pepper noise reduction in image. Analog Integr Circ Sig Process 107(1):195–202
Article Google Scholar
Berman MF (1958) A method for transposing a matrix. J ACM 5(4):383–384
Article MathSciNet Google Scholar
Tretyakov A, Tyrtyshnikov E (2009) Optimal in-place transposition of rectangular matrices. J Complex 25(4):377–384
Article MathSciNet Google Scholar
Sung I-J, Gómez-Luna J, González-Linares JM, Guil N, Hwu W-MW (2014) In-place transposition of rectangular matrices on accelerators. ACM SIGPLAN Notices 49(8):207–218
Article Google Scholar
Gomez-Luna J, Sung I-J, Chang L-W, González-Linares JM, Guil N, Hwu W-MW (2015) In-place matrix transposition on gpus. IEEE Trans Parallel Distrib Syst 27(3):776–788
Article Google Scholar
Catanzaro B, Keller A, Garland M (2014) A decomposition for in-place matrix transposition. ACM SIGPLAN Notices 49(8):193–206
Article Google Scholar
Godard P, Loechner V, Bastoul C (2020) Efficient out-of-core and out-of-place rectangular matrix transposition and rotation. IEEE Trans Comput
Ma S, Lei Y, Huang L, Wang Z (2018) Mt-dma: a dma controller supporting efficient matrix transposition for digital signal processing. IEEE Access 7:5808–5818
Article Google Scholar
Bradford DR, Corbal J, Hickmann B, Sharma R (2020) Method and apparatus for efficient matrix transpose. Google Patents. US Patent 10,649,772
Fu H, Liao J, Yang J, Wang L, Song Z, Huang X, Yang C, Xue W, Liu F, Qiao F et al (2016) The sunway taihulight supercomputer: system and applications. Science China Inf Sci 59(7):1–16
Article Google Scholar
Kahan W (1996) Ieee standard 754 for binary floating-point arithmetic. Lecture Notes Status IEEE 754(94720–1776):11
Google Scholar
Wang H, Liu W, Hou K, Feng W (2016) Parallel transposition of sparse data structures. In: Proceedings of the 2016 international conference on supercomputing, pp 1–13
Gustavson F, Karlsson L, Kågström B (2012) Parallel and cache-efficient in-place matrix storage format conversion. ACM Trans Math Softw 38(3):1–32
Article Google Scholar

Download references

Author information

Authors and Affiliations

Information Engineering University, Zhengzhou, China
Zhengbo Chen
State Key Laboratory of Mathematical Engineering and Advanced Computing, Wuxi, China
Di Wang, Qi Yu, Fang Zheng & Feng Guo
Chinese Academy of Engineering, Beijing, China
Zuoning Chen

Authors

Zhengbo Chen
View author publications
You can also search for this author in PubMed Google Scholar
Di Wang
View author publications
You can also search for this author in PubMed Google Scholar
Qi Yu
View author publications
You can also search for this author in PubMed Google Scholar
Fang Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Feng Guo
View author publications
You can also search for this author in PubMed Google Scholar
Zuoning Chen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhengbo Chen.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chen, Z., Wang, D., Yu, Q. et al. AMT: asynchronous in-place matrix transpose mechanism for sunway many-core processor. J Supercomput 78, 9456–9474 (2022). https://doi.org/10.1007/s11227-021-04282-6

Download citation

Accepted: 21 December 2021
Published: 17 January 2022
Issue Date: May 2022
DOI: https://doi.org/10.1007/s11227-021-04282-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

AMT: asynchronous in-place matrix transpose mechanism for sunway many-core processor

Abstract

Access this article

Similar content being viewed by others

A methodology for speeding up matrix vector multiplication for single/multi-core architectures

Architecture-aware configuration and scheduling of matrix multiplication on asymmetric multicore processors

BLAS3 optimization for the Godson-3B1500

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

AMT: asynchronous in-place matrix transpose mechanism for sunway many-core processor

Abstract

Access this article

Similar content being viewed by others

A methodology for speeding up matrix vector multiplication for single/multi-core architectures

Architecture-aware configuration and scheduling of matrix multiplication on asymmetric multicore processors

BLAS3 optimization for the Godson-3B1500

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation