research-article

Free access

Convolution engine: balancing efficiency and flexibility in specialized computing

Authors:

Wajahat Qadeer,

Preethi Venkatesan,

Christos Kozyrakis,

Mark HorowitzAuthors Info & Claims

Communications of the ACM, Volume 58, Issue 4

Pages 85 - 93

https://doi.org/10.1145/2735841

Published: 23 March 2015 Publication History

All formats PDF

Abstract

General-purpose processors, while tremendously versatile, pay a huge cost for their flexibility by wasting over 99% of the energy in programmability overheads. We observe that reducing this waste requires tuning data storage and compute structures and their connectivity to the data-flow and data-locality patterns in the algorithms. Hence, by backing off from full programmability and instead targeting key data-flow patterns used in a domain, we can create efficient engines that can be programmed and reused across a wide range of applications within that domain.

We present the Convolution Engine (CE)---a programmable processor specialized for the convolution-like data-flow prevalent in computational photography, computer vision, and video processing. The CE achieves energy efficiency by capturing data-reuse patterns, eliminating data transfer overheads, and enabling a large number of operations per memory access. We demonstrate that the CE is within a factor of 2--3× of the energy and area efficiency of custom units optimized for a single kernel. The CE improves energy and area efficiency by 8--15× over data-parallel Single Instruction Multiple Data (SIMD) engines for most image processing applications.

References

[1]

Bakhoda, A., Yuan, G., Fung, W.W.L., Wong, H., Aamodt, T.M. Analyzing CUDA workloads using a detailed GPU simulator. In ISPASS: IEEE International Symposium on Performance Analysis of Systems and Software (2009).

[2]

Balfour, J., Dally, W., Black-Schaffer, D., Parikh, V., Park, J. An energy-efficient processor architecture for embedded systems. Comput. Architect. Lett. 7, 1 (2007), 29--32.

Digital Library

[3]

Bayer, B. Color Imaging Array. US Patent Application No. 3971065 (1976).

[4]

Chen, T.-C., Chien, S.-Y., Huang, Y.-W., Tsai, C.-H., Chen, C.-Y., Chen, T.-W., Chen, L.-G. Analysis and architecture design of an HDTV720p 30 frames/sec H.264/AVC encoder. IEEE Trans. Circuits Syst. Video Technol. 16, 6 (2006), 673--688.

Digital Library

[5]

Corbal, J., Valero, M., Espasa, R. Exploiting a new level of DLP in multimedia applications. In Proceedings of the 32nd Annual International Symposium on Microarchitecture (Nov. 1999), 72--79.

Digital Library

[6]

Gonzalez, R. Xtensa: A configurable and extensible processor. Micro IEEE 20, 2 (Mar. 2000), 60--70.

Digital Library

[7]

Hameed, R., Qadeer, W., Wachs, M., Azizi, O., Solomatnikov, A., Lee, B.C., Richardson, S., Kozyrakis, C., Horowitz, M. Understanding sources of inefficiency in general-purpose chips. In ISCA '10: Proceedings of the 37th Annual International Symposium on Computer Architecture (2010), ACM.

Digital Library

[8]

Hamilton, J.F., Adams, J.E. Adaptive Color Plane Interpolation in Single Sensor Color Electronic Camera. US Patent Application No. 5629734 (1997).

[9]

Leng, J., Gilani, S., Hetherington, T., Tantawy, A.E., Kim, N.S., Aamodt, T.M., Reddi, V.J. GPUWattch: Enabling energy optimizations in GPGPUs. In ISCA 2013: International Symposium on Computer Architecture (2013).

Digital Library

[10]

Lowe, D. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision 60, 2 (2004), 91--110.

Digital Library

[11]

NVIDIA Inc. Tegra mobile processors. http://www.nvidia.com/object/tegra-4-processor.html.

[12]

Shacham, O., Azizi, O., Wachs, M., Qadeer, W., Asgar, Z., Kelley, K., Stevenson, J., Solomatnikov A., Firoozshahian, A., Lee, B., Richardson, S., Horowitz, M. Rethinking digital design: Why design must change. IEEE Micro 30, 6 (Nov. 2010), 9--24.

Digital Library

[13]

Stratton, J.A., Rodrigues, C., Sung, I.-J., Obeid, N., Chang, L.W., Anssari, N., Liu, G.D., Hwu, W.-M.W. Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing. IMPACT Technical Report. In IMPACT-12-01, 2012.

[14]

Tensilica Inc. Tensilica Instruction Extension (TIE) Language Reference Manual.

[15]

Texas Instruments Inc. OMAP 5 platform. www.ti.com/omap.

[16]

Venkatesh, G., Sampson, J., Goulding, N., Garcia, S., Bryksin, V., Lugo-Martinez, J., Swanson, S., Taylor, M.B. Conservation cores: Reducing the energy of mature computations. In ASPLOS'10 (2010), ACM.

Digital Library

Cited By

Zhang QFan ZAn HWang ZLi ZWang GAbillama PKim HBlaauw DSylvester D(2024)RoboVisio: A Micro-Robot Vision Domain-Specific SoC for Autonomous Navigation Enabling Fully-on-Chip Intelligence via 2-MB eMRAMIEEE Journal of Solid-State Circuits10.1109/JSSC.2024.336835059:8(2644-2658)Online publication date: Aug-2024
https://doi.org/10.1109/JSSC.2024.3368350
Jokai RTan CZhang J(2024)Fused Functional Units for Area-Efficient CGRAs2024 25th International Symposium on Quality Electronic Design (ISQED)10.1109/ISQED60706.2024.10528780(1-8)Online publication date: 3-Apr-2024
https://doi.org/10.1109/ISQED60706.2024.10528780
Xu WSun YFan SYu HFu X(2023)Accelerating Convolutional Neural Network by Exploiting Sparsity on GPUsACM Transactions on Architecture and Code Optimization10.1145/3600092Online publication date: 27-May-2023
https://doi.org/10.1145/3600092
Show More Cited By

Index Terms

Convolution engine: balancing efficiency and flexibility in specialized computing
1. Computer systems organization
  1. Architectures
    1. Other architectures
2. Hardware
  1. Very large scale integration design
    1. VLSI system specification and constraints

Recommendations

Convolution engine: balancing efficiency & flexibility in specialized computing
ISCA '13: Proceedings of the 40th Annual International Symposium on Computer Architecture

This paper focuses on the trade-off between flexibility and efficiency in specialized computing. We observe that specialized units achieve most of their efficiency gains by tuning data storage and compute structures and their connectivity to the data-...
Convolution engine: balancing efficiency & flexibility in specialized computing
ICSA '13

This paper focuses on the trade-off between flexibility and efficiency in specialized computing. We observe that specialized units achieve most of their efficiency gains by tuning data storage and compute structures and their connectivity to the data-...
Computing discrete transforms on the Cell Broadband Engine

Discrete transforms are of primary importance and fundamental kernels in many computationally intensive scientific applications. In this paper, we investigate the performance of two such algorithms; Fast Fourier Transform (FFT) and Discrete Wavelet ...

Comments

Information & Contributors

Information

Published In

cover image Communications of the ACM

Communications of the ACM Volume 58, Issue 4

April 2015

86 pages

ISSN:0001-0782

EISSN:1557-7317

DOI:10.1145/2749359

Editor:
Moshe Y. Vardi
Association for Computing Machinery, New York, NY

Issue’s Table of Contents

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 March 2015

Published in CACM Volume 58, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article
Research
Refereed

Funding Sources

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

21
Total Citations
View Citations
6,282
Total Downloads

Downloads (Last 12 months)363
Downloads (Last 6 weeks)74

Reflects downloads up to 01 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zhang QFan ZAn HWang ZLi ZWang GAbillama PKim HBlaauw DSylvester D(2024)RoboVisio: A Micro-Robot Vision Domain-Specific SoC for Autonomous Navigation Enabling Fully-on-Chip Intelligence via 2-MB eMRAMIEEE Journal of Solid-State Circuits10.1109/JSSC.2024.336835059:8(2644-2658)Online publication date: Aug-2024
https://doi.org/10.1109/JSSC.2024.3368350
Jokai RTan CZhang J(2024)Fused Functional Units for Area-Efficient CGRAs2024 25th International Symposium on Quality Electronic Design (ISQED)10.1109/ISQED60706.2024.10528780(1-8)Online publication date: 3-Apr-2024
https://doi.org/10.1109/ISQED60706.2024.10528780
Xu WSun YFan SYu HFu X(2023)Accelerating Convolutional Neural Network by Exploiting Sparsity on GPUsACM Transactions on Architecture and Code Optimization10.1145/3600092Online publication date: 27-May-2023
https://doi.org/10.1145/3600092
Akarvardar KWong H(2023)Technology Prospects for Data-Intensive ComputingProceedings of the IEEE10.1109/JPROC.2022.3218057111:1(92-112)Online publication date: Jan-2023
https://doi.org/10.1109/JPROC.2022.3218057
Pokhrel NSnäll SHeimo OSarwar UAirola ASäntti T(2023)Accelerating Image Processing Using Reduced Precision Calculation Convolution EnginesJournal of Signal Processing Systems10.1007/s11265-023-01869-595:9(1115-1126)Online publication date: 9-May-2023
https://doi.org/10.1007/s11265-023-01869-5
Naghibijouybari HKoruyeh EAbu-Ghazaleh N(2022)Microarchitectural Attacks in Heterogeneous Systems: A SurveyACM Computing Surveys10.1145/354410255:7(1-40)Online publication date: 15-Jun-2022
https://dl.acm.org/doi/10.1145/3544102
Asmussen NHaas SWeinhold CMiemietz TRoitzsch MFalsafi BFerdman MLu SWenisch T(2022)Efficient and scalable core multiplexing with M³vProceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3503222.3507741(452-466)Online publication date: 28-Feb-2022
https://dl.acm.org/doi/10.1145/3503222.3507741
Sankaralingam KNowatzki TGangadhar VShah PDavies MGalliher WGuo ZKhare JVijay DPalamuttam PPunde MTan AThiruvengadam VWang RXu SSalapura VZahran MChong FTang L(2022)The Mozart reuse exposed dataflow processor for AI and beyondProceedings of the 49th Annual International Symposium on Computer Architecture10.1145/3470496.3533040(978-992)Online publication date: 18-Jun-2022
https://dl.acm.org/doi/10.1145/3470496.3533040
Kang I(2022)The Art of Scaling: Distributed and Connected to Sustain the Golden Age of Computation2022 IEEE International Solid- State Circuits Conference (ISSCC)10.1109/ISSCC42614.2022.9731536(25-31)Online publication date: 20-Feb-2022
https://doi.org/10.1109/ISSCC42614.2022.9731536
Dhilleswararao PBoppu SManikandan MCenkeramaddi L(2022)Efficient Hardware Architectures for Accelerating Deep Neural Networks: SurveyIEEE Access10.1109/ACCESS.2022.322976710(131788-131828)Online publication date: 2022
https://doi.org/10.1109/ACCESS.2022.3229767
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF Chinese translation

eReader

View online with eReader.

Digital Edition

View this article in digital edition.

Digital Edition

Magazine Site

View this article on the magazine site (external)

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Figures

Tables

Media

View Issue’s Table of Contents