skip to main content
10.1145/3649153.3649210acmconferencesArticle/Chapter ViewAbstractPublication PagescfConference Proceedingsconference-collections
research-article
Open access

QR-PULP: Streamlining QR Decomposition for RISC-V Parallel Ultra-Low-Power Platforms

Published: 02 July 2024 Publication History

Abstract

QR decomposition is a numerical method used in many applications from the High-Performance Computing (HPC) domain to embedded systems. This broad spectrum of applications has drawn academic and commercial attention to developing many software libraries and domain-specific hardware solutions. In the Internet of Things (IoT) domain, multicore Parallel Ultra-Low-Power (PULP) architectures are emerging as energy-efficient alternatives, outperforming conventional single-core devices by coupling parallel processing with near-threshold computing. To the best of the authors' knowledge, our study introduces the first parallelized and optimized implementation of three distinct QR decomposition methods (Givens rotations, Gram-Schmidt process, and Householder transformation) on GAP-9, a commercial embodiment of the PULP architecture. Parallel execution on the 8-core cluster leads to a reduction in the total number of cycles by 241% for Givens rotations, 470% for Gram-Schmidt, and 567% for Householder, compared to the GAP9 1-core scenario. while each of them only consumes 0.013 mJ, 0.012 mJ, and 0.216 mJ, respectively. Compared to traditional single-core architectures based on ARM architectures, we achieve 8×, 24×, and 30× better performance and 36×, 35×, and 30× better energy efficiency, paving the way for broad adoption of complex linear algebra tasks in the IoT domain.

References

[1]
[n.d.]. Intel FPGA High Level Synthesis Compiler Pro Edition. https://www.intel.com/content/www/us/en/docs/programmable/683349/22-3/pro-edition-reference-manual.html. Accessed: 23-08-2023.
[2]
[n.d.]. Intel®-Optimized Math Library for Numerical Computing on CPUs & GPUs. https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl.html. Accessed: 22-08-2023.
[3]
[n. d.]. PULP Platform. https://pulp-platform.org/. Accessed: [2023].
[4]
[n. d.]. vitis-hls. https://docs.xilinx.com/r/en-US/Vitis_Libraries/index.html. Accessed: 23-08-2023.
[5]
Shadi G Alawneh, Lei Zeng, and Seyed Ali Arefifar. 2023. A Review of High-Performance Computing Methods for Power Flow Analysis. Mathematics 11, 11 (2023), 2461.
[6]
E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen. 1999. LAPACK Users' Guide (third ed.). Society for Industrial and Applied Mathematics, Philadelphia, PA.
[7]
ARM. 2010. Eigen v3. https://arm-software.github.io/CMSIS-DSP/main/index.html.
[8]
Alfredo Buttari, Julien Langou, Jakub Kurzak, and Jack Dongarra. 2009. A class of parallel tiled linear algebra algorithms for multicore architectures. Parallel Comput. 35, 1 (2009), 38--53.
[9]
Simplice Donfack, Jack Dongarra, Mathieu Faverge, Mark Gates, Jakub Kurzak, Piotr Luszczek, and Ichitaro Yamazaki. 2015. A survey of recent developments in parallel implementations of Gaussian elimination. Concurrency and Computation: Practice and Experience 27, 5 (2015), 1292--1309. https://doi.org/10.1002/cpe.3306 arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/cpe.3306
[10]
Peng Du, Rick Weber, Piotr Luszczek, Stanimire Tomov, Gregory Peterson, and Jack Dongarra. 2012. From CUDA to OpenCL: Towards a Performance-portable Solution for Multi-platform GPU Programming. Parallel Comput. 38, 8 (Aug. 2012), 391--407. https://doi.org/10.1016/j.parco.2011.10.002
[11]
Takeshi Fukaya. 2022. Distributed Parallel Tall-Skinny QR Factorization: Performance Evaluation of Various Algorithms on Various Systems. In International Conference on Parallel and Distributed Computing: Applications and Technologies. Springer, 275--287.
[12]
Michael Gautschi, Pasquale Davide Schiavone, Andreas Traber, Igor Loi, Antonio Pullini, Davide Rossi, Eric Flamand, Frank K. Gürkaynak, and Luca Benini. 2017. Near-Threshold RISC-V Core With DSP Extensions for Scalable IoT Endpoint Devices. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 25, 10 (2017), 2700--2713. https://doi.org/10.1109/TVLSI.2017.2654506
[13]
Gene H. Golub and Charles F. Van Loan. 2012. Matrix Computations (4th ed.). Johns Hopkins University Press.
[14]
Gaël Guennebaud, Benoît Jacob, et al. 2010. Eigen v3. http://eigen.tuxfamily.org.
[15]
Juyoung Hong, Tae Yoon Kim, and Jeong-Soo Park. 2019. Multivariate bias correction for climate simulation data, with application to precipitation extremes in Korea. Quantitative Bio-Science 38, 2 (2019), 121--130.
[16]
Akihiro Ida, Hiroshi Nakashima, Tasuku Hiraishi, Ichitaro Yamazaki, Rio Yokota, and Takeshi Iwashita. 2019. QR factorization of block low-rank matrices with weak admissibility condition. Journal of Information Processing 27 (2019), 831--839.
[17]
Zakaria Kasmi, Naouar Guerchali, Abdelmoumen Norrdine, and Jochen H. Schiller. 2019. Algorithms and Position Optimization for a Decentralized Localization Platform Based on Resource-Constrained Devices. IEEE Transactions on Mobile Computing 18, 8 (2019), 1731--1744. https://doi.org/10.1109/TMC.2018.2868930
[18]
Andrew Kerr, Dan Campbell, and Mark A. Richards. 2009. QR decomposition on GPUs. In GPGPU-2.
[19]
N Kishore Kumar and Jan Schneider. 2017. Literature survey on low rank approximation of matrices. Linear and Multilinear Algebra 65, 11 (2017), 2212--2244.
[20]
Martin Langhammer and Bogdan Pasca. 2018. High-Performance QR Decomposition for FPGAs. In Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (Monterey, CALIFORNIA, USA) (FPGA '18). Association for Computing Machinery, New York, NY, USA, 183--188. https://doi.org/10.1145/3174243.3174273
[21]
Sabine Le Borne. 2023. A block Cholesky-LU-based QR factorization for rectangular matrices. Numerical Linear Algebra with Applications (2023), e2497.
[22]
Farhad Merchant, Tarun Vatwani, Anupam Chattopadhyay, Soumyendu Raha, S. K. Nandy, and Ranjani Narayan. 2018. Efficient Realization of Householder Transform Through Algorithm-Architecture Co-Design for Acceleration of QR Factorization. IEEE Transactions on Parallel and Distributed Systems 29, 8 (2018), 1707--1720. https://doi.org/10.1109/TPDS.2018.2803820
[23]
Seyed Ahmad Mirsalari, Giuseppe Tagliavini, Davide Rossi, and Luca Benini. 2023. TransLib: A Library to Explore Transprecision Floating-Point Arithmetic on Multi-Core IoT End-Nodes. In 2023 Design, Automation & Test in Europe Conference & Exhibition (DATE). 1--2. https://doi.org/10.23919/DATE56975.2023.10136916
[24]
Sergio D Muñoz and Javier Hormigo. 2015. High-throughput FPGA implementation of QR decomposition. IEEE Transactions on Circuits and Systems II: Express Briefs 62, 9 (2015), 861--865.
[25]
Satoshi Ohshima, Akihiro Ida, Rio Yokota, and Ichitaro Yamazaki. 2022. QR Factorization of Block Low-Rank Matrices on Multi-instance GPU. In International Conference on Parallel and Distributed Computing: Applications and Technologies. Springer, 359--369.
[26]
PULP open-source community. 2020. PULP-DSP. https://github.com/pulp-platform/pulp-dsp.
[27]
Jiajun Ren, Weitang Li, Tong Jiang, Yuanheng Wang, and Zhigang Shuai. 2022. Time-dependent density matrix renormalization group method for quantum dynamics in complex systems. Wiley Interdisciplinary Reviews: Computational Molecular Science 12, 6 (2022), e1614.
[28]
Davide Rossi, Francesco Conti, Manuel Eggiman, Alfio Di Mauro, Giuseppe Tagliavini, Stefan Mach, Marco Guermandi, Antonio Pullini, Igor Loi, Jie Chen, Eric Flamand, and Luca Benini. 2022. Vega: A Ten-Core SoC for IoT Endnodes With DNN Acceleration and Cognitive Wake-Up From MRAM-Based State-Retentive Sleep Mode. IEEE Journal of Solid-State Circuits 57, 1 (2022), 127--139. https://doi.org/10.1109/JSSC.2021.3114881
[29]
Farzad Samie, Lars Bauer, and Jörg Henkel. 2016. IoT Technologies for Embedded Computing: A Survey. In Proceedings of the Eleventh IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (Pittsburgh, Pennsylvania) (CODES '16). Association for Computing Machinery, New York, NY, USA, Article 8, 10 pages. https://doi.org/10.1145/2968456.2974004
[30]
STMicroelectronics. 2020. STM32f4 Series. https://www.st.com/resource/en/datasheet/stm32f405rg.pdf
[31]
STMicroelectronics. 2020. STM32L4 Series. https://www.st.com/en/microcontrollersmicroprocessors/stm32l4-series/documentation.html
[32]
STMicroelectronics. 2023. STM32h7 Series. https://www.st.com/resource/en/datasheet/stm32h743vi.pdf
[33]
GreenWaves Technologies. 2021. GAP9: Low-power System-on-Chip for edge AI and IoT applications.
[34]
Takeshi Terao, Katsuhisa Ozaki, and Takeshi Ogita. 2020. LU-Cholesky QR algorithms for thin QR decomposition. Parallel Comput. 92 (2020), 102571.
[35]
Charles F Van Loan and G Golub. 1996. Matrix computations (Johns Hopkins studies in mathematical sciences). Matrix Computations 5 (1996).
[36]
Xiaojun Wang and Miriam Leeser. 2009. A truly two-dimensional systolic array FPGA implementation of QR decomposition. ACM Transactions on Embedded Computing Systems (TECS) 9, 1 (2009), 1--17.
[37]
Federica Zonzini, Vasilis Dertimanis, Eleni Chatzi, and Luca De Marchi. 2022. System Identification at the Extreme Edge for Network Load Reduction in Vibration-Based Monitoring. IEEE Internet of Things Journal 9, 20 (2022), 20467--20478. https://doi.org/10.1109/JIOT.2022.3176671

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CF '24: Proceedings of the 21st ACM International Conference on Computing Frontiers
May 2024
345 pages
ISBN:9798400705977
DOI:10.1145/3649153
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 July 2024

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. QR decomposition
  2. parallel algorithms
  3. ultra-low-power computing

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

  • Italian Ministry of University and Research (MUR)

Conference

CF '24
Sponsor:

Acceptance Rates

CF '24 Paper Acceptance Rate 33 of 105 submissions, 31%;
Overall Acceptance Rate 273 of 785 submissions, 35%

Upcoming Conference

CF '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 251
    Total Downloads
  • Downloads (Last 12 months)251
  • Downloads (Last 6 weeks)46
Reflects downloads up to 11 Feb 2025

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media