ABSTRACT
In recent years, hardware architectures optimized for general matrix multiplication (GEMM) have been well studied to deliver better performance and efficiency for deep neural networks. With trends towards batched, low-precision data, e.g., FP8 format in this work, we observe that there is growing untapped potential for value reuse. We propose a novel computing paradigm, value-level parallelism, whereby unique products are computed only once, and different inputs subscribe to (select) their products via temporal coding. Our architecture, Carat, employs value-level parallelism and transforms multiplication into accumulation, performing GEMMs with efficient multiplier-free hardware. Experiments show that, on average, Carat improves iso-area throughput and energy efficiency by 1.02× and 1.06× over a systolic array and 3.2× and 4.3× when scaled up to multiple nodes.
- Shakeel Ahmad, Muhammad Zubair Asghar, Fahad Mazaed Alotaibi, and Yasser D Al-Otaibi. A Hybrid CNN+BILSTM Deep Learning-Based DSS for Efficient Prediction of Judicial Case Decisions. Expert Systems with Applications, 209:118318, 2022.Google ScholarDigital Library
- Ryo Akita, Akira Yoshihara, Takashi Matsubara, and Kuniaki Uehara. Deep Learning for Stock Prediction Using Numerical and Textual Information. In International Conference on Computer and Information Science, 2016.Google Scholar
- Arm. Arm supports FP8: A new 8-bit floating-point interchange format for Neural Network processing. Online, Sep 2022.Google Scholar
- Mir Mohammad Azad, Apoorva Ganapathy, Siddhartha Vadlamudi, and Harish Paruchuri. Medical Diagnosis Using Deep Learning Techniques: A Research Survey. Annals of the Romanian Society for Cell Biology, 25(6):5591--5600, 2021.Google Scholar
- Mihalj Bakator and Dragica Radosav. Deep Learning and Medical Diagnosis: A Review of Literature. Multimodal Technologies and Interaction, 2(3):47, 2018.Google ScholarCross Ref
- Rajeev Balasubramonian, Andrew B. Kahng, Naveen Muralimanohar, Ali Shafiee, and Vaishnav Srinivas. CACTI 7: New Tools for Interconnect Exploration in Innovative Off-Chip Memories. Transactions on Architecture and Code Optimization, 14(2), Jun 2017.Google Scholar
- Chase. How Often is Your Credit Score Updated? Online, Sep 2023.Google Scholar
- Baogui Chen, Yu Li, Shu Zhang, Hao Lian, and Tieke He. A Deep Learning Method for Judicial Decision Support. In International Conference on Software Quality, Reliability and Security Companion, pages 145--149, 2019.Google Scholar
- Yu-Hsin Chen, Joel Emer, and Vivienne Sze. Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks. In International Symposium on Computer Architecture, 2016.Google Scholar
- Yujeong Choi, Yunseong Kim, and Minsoo Rhu. Lazy Batching: An SLA-aware Batching System for Cloud Machine Learning Inference. In International Symposium on High-Performance Computer Architecture, 2021.Google Scholar
- Iulia M. Comsa, Krzysztof Potempa, Luca Versari, Thomas Fischbacher, Andrea Gesmundo, and Jyrki Alakuijala. Temporal Coding in Spiking Neural Networks with Alpha Synaptic Function. In International Conference on Acoustics, Speech and Signal Processing, 2020.Google Scholar
- Daniel Crankshaw, Xin Wang, Guilio Zhou, Michael J. Franklin, Joseph E. Gonzalez, and Ion Stoica. Clipper: A Low-Latency Online Prediction Serving System. In Symposium on Networked Systems Design and Implementation, 2017.Google Scholar
- Jeffrey Dean, Greg S. Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao, Marc'Aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, and Andrew Y. Ng. Large Scale Distributed Deep Networks. In International Conference on Neural Information Processing Systems, 2012.Google Scholar
- Alberto Delmas Lascorz, Patrick Judd, Dylan Malone Stuart, Zissis Poulos, Mostafa Mahmoud, Sayeh Sharify, Milos Nikolic, Kevin Siu, and Andreas Moshovos. Bit-Tactical: A Software/Hardware Approach to Exploiting Value and Bit Sparsity in Neural Networks. In International Conference on Architectural Support for Programming Languages and Operating Systems, 2019.Google Scholar
- Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In Conference on Computer Vision and Pattern Recognition, 2009.Google Scholar
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina" Toutanova. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. In Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2019.Google Scholar
- S. Rasoul Faraji and Kia Bazargan. Hybrid Binary-Unary Hardware Accelerator. Transactions on Computers, 69(9):1308--1319, 2020.Google ScholarCross Ref
- Andrea Fasoli, Chia-Yu Chen, Mauricio Serrano, Swagath Venkataramani, George Saon, Xiaodong Cui, Brian Kingsbury, and Kailash Gopalakrishnan. Accelerating Inference and Language Model Fusion of Recurrent Neural Network Transducers via End-to-End 4-bit Quantization. arXiv, 2022.Google Scholar
- Adi Fuchs and David Wentzlaff. Scaling Datacenter Accelerators with Compute-Reuse Architectures. In International Symposium on Computer Architecture, 2018.Google Scholar
- Mingyu Gao, Jing Pu, Xuan Yang, Mark Horowitz, and Christos Kozyrakis. TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory. In International Conference on Architectural Support for Programming Languages and Operating Systems, 2017.Google ScholarDigital Library
- Pin Gao, Lingfan Yu, Yongwei Wu, and Jinyang Li. Low Latency RNN Inference with Cellular Batching. In EuroSys Conference, 2018.Google Scholar
- Jiong Gong, Haihao Shen, Guoming Zhang, Xiaoli Liu, Shane Li, Ge Jin, Niharika Maheshwari, Evarist Fomenko, and Eden Segal. Highly Efficient 8-Bit Low Precision Inference of Convolutional Neural Networks with IntelCaffe. In Reproducible Quality-Efficient Systems Tournament on Co-Designing Pareto-Efficient Deep Learning, 2018.Google ScholarDigital Library
- Patricia Gonzalez-Guerrero, Meriam Gay Bautista, Darren Lyles, and George Michelogiannakis. Temporal and SFQ Pulse-Streams Encoding for Area-Efficient Superconducting Accelerators. In International Conference on Architectural Support for Programming Languages and Operating Systems, 2022.Google Scholar
- Google. System Architecture. Online, Nov 2022.Google Scholar
- Google. Edge TPU Compiler. Online, Apr 2023.Google Scholar
- Björn Rafn Gunnarsson, Seppe Vanden Broucke, Bart Baesens, María Óskarsdóttir, and Wilfried Lemahieu. Deep Learning for Credit Scoring: Do or Don't? European Journal of Operational Research, 295(1):292--305, 2021.Google ScholarCross Ref
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition. In Conference on Computer Vision and Pattern Recognition, 2016.Google Scholar
- Yanzhang He, Tara N. Sainath, Rohit Prabhavalkar, Ian McGraw, Raziel Alvarez, Ding Zhao, David Rybach, Anjuli Kannan, Yonghui Wu, Ruoming Pang, Qiao Liang, Deepti Bhatia, Yuan Shangguan, Bo Li, Golan Pundak, Khe Chai Sim, Tom Bagby, Shuo-yiin Chang, Kanishka Rao, and Alexander Gruenstein. Streaming End-to-end Speech Recognition for Mobile Devices. In International Conference on Acoustics, Speech and Signal Processing, 2019.Google Scholar
- Kartik Hegde, Jiyong Yu, Rohit Agrawal, Mengjia Yan, Michael Pellauer, and Christopher W. Fletcher. UCNN: Exploiting Computational Reuse in Deep Neural Networks via Weight Repetition. In International Symposium on Computer Architecture, 2018.Google ScholarDigital Library
- Qijing Huang, Minwoo Kang, Grace Dinh, Thomas Norell, Aravind Kalaiah, James Demmel, John Wawrzynek, and Yakun Sophia Shao. CoSA: Scheduling by Constrained Optimization for Spatial Accelerators. In International Symposium on Computer Architecture, 2021.Google ScholarDigital Library
- Intel. Neural Network Distiller. Online, Oct 2019.Google Scholar
- Intel. Cross-Industry Hardware Specification to Accelerate AI Software Development. Online, Seq 2022.Google Scholar
- Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, Rick Boyle, Pierre-luc Cantin, Clifford Chao, Chris Clark, Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir Ghaemmaghami, Rajendra Gottipati, William Gulland, Robert Hagmann, C. Richard Ho, Doug Hogberg, John Hu, Robert Hundt, Dan Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexander Kaplan, Harshit Khaitan, Daniel Killebrew, Andy Koch, Naveen Kumar, Steve Lacy, James Laudon, James Law, Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle Lucke, Alan Lundin, Gordon MacKean, Adriana Maggiore, Maire Mahony, Kieran Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray Ni, Kathy Nix, Thomas Norrie, Mark Omernick, Narayana Penukonda, Andy Phelps, Jonathan Ross, Matt Ross, Amir Salek, Emad Samadiani, Chris Severn, Gregory Sizikov, Matthew Snelham, Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan, Gregory Thorson, Bo Tian, Horia Toma, Erick Tuttle, Vijay Vasudevan, Richard Walter, Walter Wang, Eric Wilcox, and Doe Hyun Yoon. In-Datacenter Performance Analysis of A Tensor Processing Unit. In International Symposium on Computer Architecture, 2017.Google Scholar
- Dhiraj Kalamkar, Dheevatsa Mudigere, Naveen Mellempudi, Dipankar Das, Kunal Banerjee, Sasikanth Avancha, Dharma Teja Vooturi, Nataraj Jammalamadaka, Jianyu Huang, Hector Yuen, et al. A Study of BFLOAT16 For Deep Learning Training. arXiv, 2019.Google Scholar
- Yunseong Kim, Yujeong Choi, and Minsoo Rhu. PARIS and ELSA: An Elastic Scheduling Algorithm for Reconfigurable Multi-GPU Inference Servers. arXiv, 2022.Google ScholarDigital Library
- Jack Kosaian, Amar Phanishayee, Matthai Philipose, Debadeepta Dey, and Rashmi Vinayak. Boosting The Throughput and Accelerator Utilization of Specialized CNN Inference Beyond Increasing Batch Size. In International Conference on Machine Learning, 2021.Google Scholar
- Kankawin Kowsrihawat, Peerapon Vateekul, and Prachya Boonkwan. Predicting Judicial Decisions of Criminal Cases from Thai Supreme Court Using Bi-directional GRU with Attention Mechanism. In Asian Conference on Defense Technology, pages 50--55, 2018.Google ScholarCross Ref
- Eli Kravchik, Fan Yang, Pavel Kisilev, and Yoni Choukroun. Low-bit Quantization of Neural Networks for Efficient Inference. In International Conference on Computer Vision Workshops, 2019.Google Scholar
- Andrey Kuzmin, Mart Van Baalen, Yuwei Ren, Markus Nagel, Jorn Peters, and Tijmen Blankevoort. FP8 Quantization: The Power of the Exponent. In Advances in Neural Information Processing Systems, 2022.Google Scholar
- Hyoukjun Kwon, Prasanth Chatarasi, Michael Pellauer, Angshuman Parashar, Vivek Sarkar, and Tushar Krishna. Understanding Reuse, Performance, and Hardware Cost of DNN Dataflow: A Data-Centric Approach. In International Symposium on Microarchitecture, 2019.Google Scholar
- Raymond S. T. Lee. Chaotic Type-2 Transient-Fuzzy Deep Neuro-Oscillatory Network (CT2TFDNN) for Worldwide Financial Prediction. Transactions on Fuzzy Systems, 28(4):731--745, 2020.Google ScholarCross Ref
- Sunwoo Lee, Qiao Kang, Sandeep Madireddy, Prasanna Balaprakash, Ankit Agrawal, Alok Choudhary, Richard Archibald, and Wei-keng Liao. Improving Scalability of Parallel CNN Training by Adjusting Mini-Batch Size at Run-Time. In International Conference on Big Data, 2019.Google ScholarCross Ref
- Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. SSD: Single Shot Multi-box Detector. In European Conference on Computer Vision, 2016.Google Scholar
- Hang Lu, Liang Chang, Chenglong Li, Zixuan Zhu, Shengjian Lu, Yanhuan Liu, and Mingzhe Zhang. Distilling Bit-Level Sparsity Parallelism for General Purpose Deep Learning Acceleration. In International Symposium on Microarchitecture, 2021.Google Scholar
- Siyuan Ma and Mikhail Belkin. Kernel Machines That Adapt To GPUs for Effective Large Batch Training. In Machine Learning and Systems, 2019.Google Scholar
- A. Madhavan, T. Sherwood, and D. Strukov. Race Logic: A Hardware Acceleration for Dynamic Programming Algorithms. In International Symposium on Computer Architecture, 2014.Google ScholarDigital Library
- Peter Mattson, Christine Cheng, Gregory Diamos, Cody Coleman, Paulius Micikevicius, David Patterson, Hanlin Tang, Gu-Yeon Wei, Peter Bailis, Victor Bittorf, David Brooks, Dehao Chen, Debo Dutta, Udit Gupta, Kim Hazelwood, Andy Hock, Xinyuan Huang, Daniel Kang, David Kanter, Naveen Kumar, Jeffery Liao, Deepak Narayanan, Tayo Oguntebi, Gennady Pekhimenko, Lillian Pentecost, Vijay Janapa Reddi, Taylor Robie, Tom St John, Carole-Jean Wu, Lingjie Xu, Cliff Young, and Matei Zaharia. MLPerf Training Benchmark. In Machine Learning and Systems, 2020.Google Scholar
- Paulius Micikevicius, Dusan Stosic, Neil Burgess, Marius Cornea, Pradeep Dubey, Richard Grisenthwaite, Sangwon Ha, Alexander Heinecke, Patrick Judd, John Kamalu, et al. FP8 Formats for Deep Learning. arXiv, 2022.Google Scholar
- Harideep Nair, John Paul Shen, and James E Smith. A Microarchitecture Implementation Framework for Online Learning with Temporal Neural Networks. In Computer Society Annual Symposium on VLSI, 2021.Google Scholar
- Vinod Nair and Geoffrey E Hinton. Rectified Linear Units Improve Restricted Boltzmann Machines. In International Conference on International Conference on Machine Learning, 2010.Google Scholar
- M. Hassan Najafi, David. J. Lilja, MarcD. Riedel, and Kia Bazargan. Low-Cost Sorting Network Circuits Using Unary Processing. Transactions on Very Large Scale Integration Systems, 26(8):1471--1480, 2018.Google ScholarCross Ref
- Maxim Naumov, Dheevatsa Mudigere, Hao-Jun Michael Shi, Jianyu Huang, Narayanan Sundaraman, Jongsoo Park, Xiaodong Wang, Udit Gupta, Carole-Jean Wu, Alisson G Azzolini, et al. Deep Learning Recommendation Model for Personalization and Recommendation Systems. arXiv, 2019.Google Scholar
- Badreddine Noune, Philip Jones, Daniel Justus, Dominic Masters, and Carlo Luschi. 8-bit Numerical Formats for Deep Neural Networks. arXiv, 2022.Google Scholar
- NVIDIA. NVIDIA, Arm, and Intel Publish FP8 Specification for Standardization as an Interchange Format for AI. Online, Sep 2022.Google Scholar
- Mike O'Connor, Niladrish Chatterjee, Donghyuk Lee, John Wilson, Aditya Agrawal, Stephen W. Keckler, and William J. Dally. FineGrained DRAM: Energy-Efficient DRAM for Extreme Bandwidth Systems. In International Symposium on Microarchitecture, 2017.Google Scholar
- Chao Peng, Tete Xiao, Zeming Li, Yuning Jiang, Xiangyu Zhang, Kai Jia, Gang Yu, and Jian Sun. MegDet: A Large Mini-Batch Object Detector. In Conference on Computer Vision and Pattern Recognition, 2018.Google Scholar
- Becky Pokora. Credit Card Statistics And Trends 2023. Online, Mar 2023.Google Scholar
- Marc Riera, Jose-Maria Arnau, and Antonio Gonzalez. Computation Reuse in DNNs by Exploiting Input Similarity. In International Symposium on Computer Architecture, 2018.Google ScholarDigital Library
- Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-Net: Convolutional Networks for Biomedical Image Segmentation. In International Conference on Medical image computing and computer-assisted intervention, 2015.Google Scholar
- Yakun Sophia Shao, Jason Clemons, Rangharajan Venkatesan, Brian Zimmer, Matthew Fojtik, Nan Jiang, Ben Keller, Alicia Klinefelter, Nathaniel Pinckney, Priyanka Raina, Stephen G. Tell, Yanqing Zhang, William J. Dally, Joel Emer, C. Thomas Gray, Brucek Khailany, and Stephen W. Keckler. Simba: Scaling Deep-Learning Inference with Multi-Chip-Module-Based Architecture. In International Symposium on Microarchitecture, 2019.Google Scholar
- Yakun Sophia Shao, Brandon Reagen, Gu-Yeon Wei, and David Brooks. Aladdin: A Pre-RTL, Power-Performance Accelerator Simulator Enabling Large Design Space Exploration of Customized Architectures. In International Symposium on Computer Architecture, 2014.Google Scholar
- Dinggang Shen, Guorong Wu, and Heung-Il Suk. Deep Learning in Medical Image Analysis. Annual review of biomedical engineering, 19:221--248, 2017.Google Scholar
- Haichen Shen, Lequn Chen, Yuchen Jin, Liangyu Zhao, Bingyu Kong, Matthai Philipose, Arvind Krishnamurthy, and Ravi Sundaram. Nexus: A GPU Cluster Engine for Accelerating DNN-Based Video Analysis. In Symposium on Operating Systems Principles, 2019.Google Scholar
- Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Daniel Y Fu, Zhiqiang Xie, Beidi Chen, Clark Barrett, Joseph E Gonzalez, et al. FlexGen: High-Throughput Generative Inference of Large Language Models with A Single GPU. International Conference on Machine Learning, 2023.Google Scholar
- Xiao Sun, Jungwook Choi, Chia-Yu Chen, Naigang Wang, Swagath Venkataramani, Vijayalakshmi Viji Srinivasan, Xiaodong Cui, Wei Zhang, and Kailash Gopalakrishnan. Hybrid 8-Bit Floating Point (HFP8) Training and Inference for Deep Neural Networks. Advances in neural information processing systems, 32, 2019.Google Scholar
- Georgios Tzimpragos, Advait Madhavan, Dilip Vasudevan, Dmitri Strukov, and Timothy Sherwood. Boosted Race Trees for Low Energy Classification. In International Conference on Architectural Support for Programming Languages and Operating Systems, 2019.Google Scholar
- Georgios Tzimpragos, Jennifer Volk, Alex Wynn, James E. Smith, and Timothy Sherwood. Superconducting Computing with Alternating Logic Elements. In International Symposium on Computer Architecture, 2021.Google Scholar
- DiWu, Jingjie Li, Zhewen Pan, Younghyun Kim, and Joshua San Miguel. uBrain: A Unary Brain Computer Interface. In International Symposium on Computer Architecture, 2022.Google Scholar
- Di Wu, Jingjie Li, Ruokai Yin, Hsuan Hsiao, Younghyun Kim, and Joshua San Miguel. uGEMM: Unary Computing Architecture for GEMM Applications. In International Symposium on Computer Architecture, 2020.Google ScholarDigital Library
- Di Wu and Joshua San Miguel. uSystolic: Byte-Crawling Unary Systolic Array. In International Symposium on High-Performance Computer Architecture, 2022.Google Scholar
- Di Wu and Joshua San Miguel. Special Session: When Dataflows Converge: Reconfigurable and Approximate Computing for Emerging Neural Networks. In International Conference on Computer Design, 2021.Google Scholar
- Hao Wu. Low precision Inference on GPU. Online, Mar 2019.Google Scholar
- Hao Wu, Patrick Judd, Xiaojie Zhang, Mikhail Isaev, and Paulius Micikevicius. Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation. arXiv, 2020.Google Scholar
- Yannan Nellie Wu, Joel S. Emer, and Vivienne Sze. Accelergy: An Architecture-Level Energy Estimation Methodology for Accelerator Designs. In International Conference on Computer-Aided Design, 2019.Google Scholar
- Zhewei Yao, Zhen Dong, Zhangcheng Zheng, Amir Gholami, Jiali Yu, Eric Tan, Leyuan Wang, Qijing Huang, Yida Wang, Michael Mahoney, and Kurt Keutzer. HAWQ-V3: Dyadic Neural Network Quantization. In International Conference on Machine Learning, 2021.Google Scholar
- Sungyeob Yoo, Hyunsung Kim, Jinseok Kim, Sunghyun Park, Joo-Young Kim, and Jinwook Oh. LightTrader: A Standalone High-Frequency Trading System with Deep Learning Inference Accelerators and Proactive Scheduler. In International Symposium on High-Performance Computer Architecture, 2023.Google ScholarCross Ref
- Yaqi Zhang, Alexander Rucker, Matthew Vilim, Raghu Prabhakar, William Hwang, and Kunle Olukotun. Scalable interconnects for reconfigurable spatial architectures. In Proceedings of the 46th International Symposium on Computer Architecture, 2019.Google ScholarDigital Library
Index Terms
- Carat: Unlocking Value-Level Parallelism for Multiplier-Free GEMMs
Recommendations
A hierarchical multiplier-free architecture for HEVC transform
In spite of high decorrelation performance, the large block size of transform coding in High Efficiency Video Coding (HEVC) brings about undesirable complexity in hardware design. The heaviest burden in HEVC transform implementation is the large ...
Unlocking Fine-Grain Parallelism for AIG Rewriting
2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)Parallel computing is a trend to enhance scalability of electronic design automation (EDA) tools using widely available multicore platforms. In order to benefit from parallelism, well-known EDA algorithms have to be reformulated and optimized for ...
A VLSI Modulo m Multiplier
A novel method to compute the exact digits of the modulo m product of integers is proposed, and a modulo m multiply structure is defined. Such a structure can be implemented by means of a few fast VLSI binary multipliers, and a response time of about ...
Comments