research-article

Open access

QR-PULP: Streamlining QR Decomposition for RISC-V Parallel Ultra-Low-Power Platforms

Authors:

Amirhossein Kiamarzi,

Giuseppe TagliaviniAuthors Info & Claims

CF '24: Proceedings of the 21st ACM International Conference on Computing Frontiers

Pages 147 - 154

https://doi.org/10.1145/3649153.3649210

Published: 02 July 2024 Publication History

Abstract

QR decomposition is a numerical method used in many applications from the High-Performance Computing (HPC) domain to embedded systems. This broad spectrum of applications has drawn academic and commercial attention to developing many software libraries and domain-specific hardware solutions. In the Internet of Things (IoT) domain, multicore Parallel Ultra-Low-Power (PULP) architectures are emerging as energy-efficient alternatives, outperforming conventional single-core devices by coupling parallel processing with near-threshold computing. To the best of the authors' knowledge, our study introduces the first parallelized and optimized implementation of three distinct QR decomposition methods (Givens rotations, Gram-Schmidt process, and Householder transformation) on GAP-9, a commercial embodiment of the PULP architecture. Parallel execution on the 8-core cluster leads to a reduction in the total number of cycles by 241% for Givens rotations, 470% for Gram-Schmidt, and 567% for Householder, compared to the GAP9 1-core scenario. while each of them only consumes 0.013 mJ, 0.012 mJ, and 0.216 mJ, respectively. Compared to traditional single-core architectures based on ARM architectures, we achieve 8×, 24×, and 30× better performance and 36×, 35×, and 30× better energy efficiency, paving the way for broad adoption of complex linear algebra tasks in the IoT domain.

References

[1]

[n.d.]. Intel FPGA High Level Synthesis Compiler Pro Edition. https://www.intel.com/content/www/us/en/docs/programmable/683349/22-3/pro-edition-reference-manual.html. Accessed: 23-08-2023.

[2]

[n.d.]. Intel®-Optimized Math Library for Numerical Computing on CPUs & GPUs. https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl.html. Accessed: 22-08-2023.

[3]

[n. d.]. PULP Platform. https://pulp-platform.org/. Accessed: [2023].

[4]

[n. d.]. vitis-hls. https://docs.xilinx.com/r/en-US/Vitis_Libraries/index.html. Accessed: 23-08-2023.

[5]

Shadi G Alawneh, Lei Zeng, and Seyed Ali Arefifar. 2023. A Review of High-Performance Computing Methods for Power Flow Analysis. Mathematics 11, 11 (2023), 2461.

[6]

E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen. 1999. LAPACK Users' Guide (third ed.). Society for Industrial and Applied Mathematics, Philadelphia, PA.

[7]

ARM. 2010. Eigen v3. https://arm-software.github.io/CMSIS-DSP/main/index.html.

[8]

Alfredo Buttari, Julien Langou, Jakub Kurzak, and Jack Dongarra. 2009. A class of parallel tiled linear algebra algorithms for multicore architectures. Parallel Comput. 35, 1 (2009), 38--53.

Digital Library

[9]

Simplice Donfack, Jack Dongarra, Mathieu Faverge, Mark Gates, Jakub Kurzak, Piotr Luszczek, and Ichitaro Yamazaki. 2015. A survey of recent developments in parallel implementations of Gaussian elimination. Concurrency and Computation: Practice and Experience 27, 5 (2015), 1292--1309. https://doi.org/10.1002/cpe.3306 arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/cpe.3306

Digital Library

[10]

Peng Du, Rick Weber, Piotr Luszczek, Stanimire Tomov, Gregory Peterson, and Jack Dongarra. 2012. From CUDA to OpenCL: Towards a Performance-portable Solution for Multi-platform GPU Programming. Parallel Comput. 38, 8 (Aug. 2012), 391--407. https://doi.org/10.1016/j.parco.2011.10.002

Digital Library

[11]

Takeshi Fukaya. 2022. Distributed Parallel Tall-Skinny QR Factorization: Performance Evaluation of Various Algorithms on Various Systems. In International Conference on Parallel and Distributed Computing: Applications and Technologies. Springer, 275--287.

[12]

Michael Gautschi, Pasquale Davide Schiavone, Andreas Traber, Igor Loi, Antonio Pullini, Davide Rossi, Eric Flamand, Frank K. Gürkaynak, and Luca Benini. 2017. Near-Threshold RISC-V Core With DSP Extensions for Scalable IoT Endpoint Devices. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 25, 10 (2017), 2700--2713. https://doi.org/10.1109/TVLSI.2017.2654506

Digital Library

[13]

Gene H. Golub and Charles F. Van Loan. 2012. Matrix Computations (4th ed.). Johns Hopkins University Press.

[14]

Gaël Guennebaud, Benoît Jacob, et al. 2010. Eigen v3. http://eigen.tuxfamily.org.

[15]

Juyoung Hong, Tae Yoon Kim, and Jeong-Soo Park. 2019. Multivariate bias correction for climate simulation data, with application to precipitation extremes in Korea. Quantitative Bio-Science 38, 2 (2019), 121--130.

[16]

Akihiro Ida, Hiroshi Nakashima, Tasuku Hiraishi, Ichitaro Yamazaki, Rio Yokota, and Takeshi Iwashita. 2019. QR factorization of block low-rank matrices with weak admissibility condition. Journal of Information Processing 27 (2019), 831--839.

[17]

Zakaria Kasmi, Naouar Guerchali, Abdelmoumen Norrdine, and Jochen H. Schiller. 2019. Algorithms and Position Optimization for a Decentralized Localization Platform Based on Resource-Constrained Devices. IEEE Transactions on Mobile Computing 18, 8 (2019), 1731--1744. https://doi.org/10.1109/TMC.2018.2868930

[18]

Andrew Kerr, Dan Campbell, and Mark A. Richards. 2009. QR decomposition on GPUs. In GPGPU-2.

[19]

N Kishore Kumar and Jan Schneider. 2017. Literature survey on low rank approximation of matrices. Linear and Multilinear Algebra 65, 11 (2017), 2212--2244.

[20]

Martin Langhammer and Bogdan Pasca. 2018. High-Performance QR Decomposition for FPGAs. In Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (Monterey, CALIFORNIA, USA) (FPGA '18). Association for Computing Machinery, New York, NY, USA, 183--188. https://doi.org/10.1145/3174243.3174273

Digital Library

[21]

Sabine Le Borne. 2023. A block Cholesky-LU-based QR factorization for rectangular matrices. Numerical Linear Algebra with Applications (2023), e2497.

[22]

Farhad Merchant, Tarun Vatwani, Anupam Chattopadhyay, Soumyendu Raha, S. K. Nandy, and Ranjani Narayan. 2018. Efficient Realization of Householder Transform Through Algorithm-Architecture Co-Design for Acceleration of QR Factorization. IEEE Transactions on Parallel and Distributed Systems 29, 8 (2018), 1707--1720. https://doi.org/10.1109/TPDS.2018.2803820

[23]

Seyed Ahmad Mirsalari, Giuseppe Tagliavini, Davide Rossi, and Luca Benini. 2023. TransLib: A Library to Explore Transprecision Floating-Point Arithmetic on Multi-Core IoT End-Nodes. In 2023 Design, Automation & Test in Europe Conference & Exhibition (DATE). 1--2. https://doi.org/10.23919/DATE56975.2023.10136916

[24]

Sergio D Muñoz and Javier Hormigo. 2015. High-throughput FPGA implementation of QR decomposition. IEEE Transactions on Circuits and Systems II: Express Briefs 62, 9 (2015), 861--865.

[25]

Satoshi Ohshima, Akihiro Ida, Rio Yokota, and Ichitaro Yamazaki. 2022. QR Factorization of Block Low-Rank Matrices on Multi-instance GPU. In International Conference on Parallel and Distributed Computing: Applications and Technologies. Springer, 359--369.

[26]

PULP open-source community. 2020. PULP-DSP. https://github.com/pulp-platform/pulp-dsp.

[27]

Jiajun Ren, Weitang Li, Tong Jiang, Yuanheng Wang, and Zhigang Shuai. 2022. Time-dependent density matrix renormalization group method for quantum dynamics in complex systems. Wiley Interdisciplinary Reviews: Computational Molecular Science 12, 6 (2022), e1614.

[28]

Davide Rossi, Francesco Conti, Manuel Eggiman, Alfio Di Mauro, Giuseppe Tagliavini, Stefan Mach, Marco Guermandi, Antonio Pullini, Igor Loi, Jie Chen, Eric Flamand, and Luca Benini. 2022. Vega: A Ten-Core SoC for IoT Endnodes With DNN Acceleration and Cognitive Wake-Up From MRAM-Based State-Retentive Sleep Mode. IEEE Journal of Solid-State Circuits 57, 1 (2022), 127--139. https://doi.org/10.1109/JSSC.2021.3114881

[29]

Farzad Samie, Lars Bauer, and Jörg Henkel. 2016. IoT Technologies for Embedded Computing: A Survey. In Proceedings of the Eleventh IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (Pittsburgh, Pennsylvania) (CODES '16). Association for Computing Machinery, New York, NY, USA, Article 8, 10 pages. https://doi.org/10.1145/2968456.2974004

Digital Library

[30]

STMicroelectronics. 2020. STM32f4 Series. https://www.st.com/resource/en/datasheet/stm32f405rg.pdf

[31]

STMicroelectronics. 2020. STM32L4 Series. https://www.st.com/en/microcontrollersmicroprocessors/stm32l4-series/documentation.html

[32]

STMicroelectronics. 2023. STM32h7 Series. https://www.st.com/resource/en/datasheet/stm32h743vi.pdf

[33]

GreenWaves Technologies. 2021. GAP9: Low-power System-on-Chip for edge AI and IoT applications.

[34]

Takeshi Terao, Katsuhisa Ozaki, and Takeshi Ogita. 2020. LU-Cholesky QR algorithms for thin QR decomposition. Parallel Comput. 92 (2020), 102571.

Digital Library

[35]

Charles F Van Loan and G Golub. 1996. Matrix computations (Johns Hopkins studies in mathematical sciences). Matrix Computations 5 (1996).

[36]

Xiaojun Wang and Miriam Leeser. 2009. A truly two-dimensional systolic array FPGA implementation of QR decomposition. ACM Transactions on Embedded Computing Systems (TECS) 9, 1 (2009), 1--17.

Digital Library

[37]

Federica Zonzini, Vasilis Dertimanis, Eleni Chatzi, and Luca De Marchi. 2022. System Identification at the Extreme Edge for Network Load Reduction in Vibration-Based Monitoring. IEEE Internet of Things Journal 9, 20 (2022), 20467--20478. https://doi.org/10.1109/JIOT.2022.3176671

Recommendations

Acceleration of Parallel-Blocked QR Decomposition of Tall-and-Skinny Matrices on FPGAs

QR decomposition is one of the most useful factorization kernels in modern numerical linear algebra algorithms. In particular, the decomposition of tall-and-skinny matrices (TSMs) has major applications in areas including scientific computing, machine ...
Analysis of a QR Algorithm for Computing Singular Values

We extend the Golub--Kahan algorithm for computing the singular value decomposition of bidiagonal matrices to triangular matrices $R$. Our algorithm avoids the explicit formation of $R^TR$ or $RR^T$.

We derive a relation between left and right ...
QR decomposition on GPUs
GPGPU-2: Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units

QR decomposition is a computationally intensive linear algebra operation that factors a matrix A into the product of a unitary matrix Q and upper triangular matrix R. Adaptive systems commonly employ QR decomposition to solve overdetermined least ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CF '24: Proceedings of the 21st ACM International Conference on Computing Frontiers

May 2024

345 pages

ISBN:9798400705977

DOI:10.1145/3649153

Copyright © 2024 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMICRO: ACM Special Interest Group on Microarchitectural Research and Processing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 July 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

Italian Ministry of University and Research (MUR)

Conference

CF '24

Sponsor:

SIGMICRO

CF '24: 21st ACM International Conference on Computing Frontiers

May 7 - 9, 2024

Ischia, Italy

Acceptance Rates

CF '24 Paper Acceptance Rate 33 of 105 submissions, 31%;

Overall Acceptance Rate 273 of 785 submissions, 35%

Upcoming Conference

CF '25

Sponsor:
sigmicro

22nd ACM International Conference on Computing Frontiers

May 28 - 30, 2025

Cagliari , Italy

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
251
Total Downloads

Downloads (Last 12 months)251
Downloads (Last 6 weeks)46

Reflects downloads up to 11 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten