research-article

A programmable parallel accelerator for learning and classification

Authors:

Srihari Cadambi,

Abhinandan Majumdar,

Michela Becchi,

Srimat Chakradhar,

Hans Peter GrafAuthors Info & Claims

PACT '10: Proceedings of the 19th international conference on Parallel architectures and compilation techniques

Pages 273 - 284

https://doi.org/10.1145/1854273.1854309

Published: 11 September 2010 Publication History

Abstract

For learning and classification workloads that operate on large amounts of unstructured data with stringent performance constraints, general purpose processor performance scales poorly with data size. In this paper, we present a programmable accelerator for this workload domain. To architect the accelerator, we profile five representative workloads, and find that their computationally intensive portions can be formulated as matrix or vector operations generating large amounts of intermediate data, which are then reduced by a secondary operation such as array ranking, finding max/min and aggregation. The proposed accelerator, called MAPLE, has hundreds of simple processing elements (PEs) laid out in a two-dimensional grid, with two key features. First, it uses in-memory processing where on-chip memory blocks perform the secondary reduction operations. By doing so, the intermediate data are dynamically processed and never stored or sent off-chip. Second, MAPLE uses banked off-chip memory, and organizes its PEs into independent groups each with its own off-chip memory bank. These two features together allow MAPLE to scale its performance with data size. This paper describes the MAPLE architecture, explores its design space with a simulator, and illustrates how to automatically map application kernels to the hardware. We also implement a 512-PE FPGA prototype of MAPLE and find that it is 1.5-10x faster than a 2.5 GHz quad-core Xeon processor despite running at a modest 125 MHz.

References

[1]

}}Mei, T., Hua, X., Yang, L., Li, S., "VideoSense: towards effective online video advertising," Proc. 15th International Conference on Multimedia 2007, pp 1075--1084.

Digital Library

[2]

}}Datta, R., et al.,"Image retrieval: Ideas, influences, and trends of the new age," ACM Comput. Surv. 40,2,Apr 08.

Digital Library

[3]

}}Bai, B., Weston, J., Grangier, D., Collobert, R., Sadamasa, K., Qi, Y., Chapelle, O., Weinberge, K., "Learning to Rank with (a lot of) word features," Special Issue: Learning to Rank for Information Retrieval. Information Retrieval, 2009.

Digital Library

[4]

}}Lecun, Y., Bottou, L., Bengio, Y., Haffner, P., "Gradient-based learning applied to document recognition," Proc. of the IEEE, vol.86, no.11, pp.2278--2324, Nov 1998.

[5]

}}MacQueen, J. B., "Some methods for classification and analysis of multivariate observation," Proc. Berkeley Symp. on Math. Stat. and Prob., pages 281--297.

[6]

}}Platt, J., "Fast Training of Support Vector Machines Using Sequential Minimal Optimization," in Advances in Kernel Methods - Support Vector Learning, MIT Press 1999.

Digital Library

[7]

}}Sato, A., Yamada, K., "Generalized learning vector quantization", Neural Information Processing Systems, pp.423--429, 1995.

[8]

}}Graf, H. P., Cadambi, S., Durdanovic, I., Jakkula, V., Sankaradass, M., Cosatto, E., Chakradhar, S. T., "A Massively Parallel Digital Learning Processor," Neural Information Processing. Systems, Dec. 2008.

[9]

}}Catanzaro, B., Sundaram, N., Keutzer, K., "Fast Support Vector Training and Classification on Graphics Processors," Machine Learning, 25th International Conference on, (ICML 2008), Jul. 2008.

Digital Library

[10]

}}Chellapilla, K., Puri, S., Simard, P., "High Performance Convolutional Neural Networks for Document Processing," Tenth International Workshop on Frontiers in Handwriting Recognition (2006).

[11]

}}Nasse, F., Thurau, C., Fink, G. A., "Face Detection Using GPU-Based Convolutional Neural Networks," Computer Analysis of Images and Patterns, 13th International Conference, CAIP 2009, Proc. LNCS 2009.

Digital Library

[12]

}}Collobert, R., Weston, J., "A unified architecture for natural language processing: deep neural networks with multitask learning," Proc. of the 25th International Conference on Machine Learning, vol. 307, pp.160--167, Jul 2008.

Digital Library

[13]

}}Lloyd, S.P., "Least squares quantization in PCM," IEEE Transactions on Information Theory 28 (2): pp 129--137.

Digital Library

[14]

}}Hall, J. D., Hart, J. C., "GPU Acceleration of Iterative Clustering," The ACM Workshop on General Purpose Computing on Graphics Processors and SIGGRAPH 2004 poster, Aug 2004.

[15]

}}Cosatto, E., Miller, M., Graf, H. P., Meyer, J., "Grading Nuclear Pleomorphism on Histological Micrographs," Proc. Int. Conf. Pattern Recognition, 2008.

[16]

}}Lawrence, S., Giles, C.L., Ah Chung Tsoi, Back, A.D., "Face recognition: a convolutional neural-network approach," Neural Networks, IEEE Transactions on, vol.8, no.1, pp.98--113, Jan 1997.

Digital Library

[17]

}}M D Taylor et al, "The Raw Microprocessor: A Computational Fabric for Software Circuits and General-Purpose Programs," IEEE Micro, vol. 22, no. 2, pp. 25--35, Mar./Apr. 2002.

Digital Library

[18]

}}Owens, J.D., Luebke, D., Govindaraju, N., Harris, M, Krueger, J., Lefohn, A.E., Purcell, T.J., "A survey of general-purpose computation on graphics hardware," Computer Graphics Forum, 26(1):80--113, 2007.

[19]

}}Burger, D, et al. "Scaling to the End of Silicon with EDGE Architectures," IEEE Computer, 37(7), pp. 44--55, July 2004.

Digital Library

[20]

}}Diamond, J. R., Robatmili, B., Keckler, S. W., van de Geijn, R., Goto, K., and Burger, D. 2008. "High performance dense linear algebra on a spatially distributed processor," Proc. 13th ACM SIGPLAN PPoPP 2008.

Digital Library

[21]

}}Seiler, L., et al., "Larrabee: a many-core x86 architecture for visual computing," In ACM SIGGRAPH 2008.

Digital Library

[22]

}}Kapasi, U.J., Rixner, S., Dally, W.J., Khailany, B., Jung Ho Ahn, Mattson, P., Owens, J.D., "Programmable stream processors," IEEE Computer, vol.36, no.8, pp. 54--62, Aug. 2003.

Digital Library

[23]

}}Sankaradas, M., et al, "A Massively Parallel Coprocessor for Convolution Neural Networks", In Proc. 20th IEEE International Conference on Application-specific Systems, Architectures and Processors (ASAP), 2009, Boston, MA.

Digital Library

[24]

}}Zhuo, L., Prasanna, V. K., "High Performance Linear Algebra Operations on Reconfigurable Systems", in ACM/IEEE Conference on Supercomputing, Proc. of the 2005, November 2005.

Digital Library

[25]

}}Rousseaux, S., Hubaux, D., Guisset, P., Legat, J., "A High Performance FPGA-Based Accelerator for BLAS Library Implementation," Proc. of the Third Annual Reconfigurable Systems Summer Institute (RSSI'07).

[26]

}}Raina, R., Madhavan, A., Ng, A. Y., "Large-scale deep unsupervised learning using graphics processors," Proc. 26th Annual international Conference on Machine Learning 2009.

Digital Library

[27]

}}Cadambi, S., et al, "A Massively Parallel FPGA-based Coprocessor for Support Vector Machines", Proc. IEEE Symposium on FCCM 2009, Napa, CA.

Digital Library

Cited By

Nie QMalik S(2023)CNNFlow: Memory-driven Data Flow Optimization for Convolutional Neural NetworksACM Transactions on Design Automation of Electronic Systems10.1145/357701728:3(1-36)Online publication date: 19-Mar-2023
https://dl.acm.org/doi/10.1145/3577017
SENOO TJINGUJI AKURAMOCHI RNAKAHARA H(2022)Multilayer Perceptron Training Accelerator Using Systolic ArrayIEICE Transactions on Information and Systems10.1587/transinf.2022PAP0003E105.D:12(2048-2056)Online publication date: 1-Dec-2022
https://doi.org/10.1587/transinf.2022PAP0003
Yanamala RPullakandam M(2022)An Efficient Configurable Hardware Accelerator Design for CNN on Low Memory 32-Bit Edge Device2022 IEEE International Symposium on Smart Electronic Systems (iSES)10.1109/iSES54909.2022.00033(112-117)Online publication date: Dec-2022
https://doi.org/10.1109/iSES54909.2022.00033
Show More Cited By

Index Terms

A programmable parallel accelerator for learning and classification
1. Computer systems organization
  1. Embedded and cyber-physical systems
  2. Real-time systems

Recommendations

A Massively Parallel, Energy Efficient Programmable Accelerator for Learning and Classification

Applications that use learning and classification algorithms operate on large amounts of unstructured data, and have stringent performance constraints. For such applications, the performance of general purpose processors scales poorly with data size ...
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance Computing

The graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...
Reusable software components for accelerator-based clusters

Abstract: The emerging accelerator-based heterogeneous clusters, comprising specialized processors such as the IBM Cell and GPUs, have exhibited excellent price to performance ratio as well as high energy-efficiency. However, developing and maintaining ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

PACT '10: Proceedings of the 19th international conference on Parallel architectures and compilation techniques

September 2010

596 pages

ISBN:9781450301787

DOI:10.1145/1854273

General Chair:
Valentina Salapura
IBM TJ Watson Research Center
,
Program Chairs:
Michael Gschwind
IBM Systems & Technology Group
,
Jens Knoop
Technische Universität Wien

Copyright © 2010 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

IFIP WG 10.3: IFIP working group 10.3 on concurrent systems
IEEE CS TCPP: IEEE-CS technical committee on parallel processing
SIGARCH: ACM Special Interest Group on Computer Architecture
IEEE CS TCAA: IEEE CS technical committee on architectural acoustics

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 September 2010

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

PACT '10

Sponsor:

IFIP WG 10.3
IEEE CS TCPP
SIGARCH
IEEE CS TCAA

PACT '10: International Conference on Parallel Architectures and Compilation Techniques

September 11 - 15, 2010

Vienna, Austria

Acceptance Rates

Overall Acceptance Rate 121 of 471 submissions, 26%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

77
Total Citations
View Citations
1,375
Total Downloads

Downloads (Last 12 months)47
Downloads (Last 6 weeks)1

Reflects downloads up to 20 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Nie QMalik S(2023)CNNFlow: Memory-driven Data Flow Optimization for Convolutional Neural NetworksACM Transactions on Design Automation of Electronic Systems10.1145/357701728:3(1-36)Online publication date: 19-Mar-2023
https://dl.acm.org/doi/10.1145/3577017
SENOO TJINGUJI AKURAMOCHI RNAKAHARA H(2022)Multilayer Perceptron Training Accelerator Using Systolic ArrayIEICE Transactions on Information and Systems10.1587/transinf.2022PAP0003E105.D:12(2048-2056)Online publication date: 1-Dec-2022
https://doi.org/10.1587/transinf.2022PAP0003
Yanamala RPullakandam M(2022)An Efficient Configurable Hardware Accelerator Design for CNN on Low Memory 32-Bit Edge Device2022 IEEE International Symposium on Smart Electronic Systems (iSES)10.1109/iSES54909.2022.00033(112-117)Online publication date: Dec-2022
https://doi.org/10.1109/iSES54909.2022.00033
Dhilleswararao PBoppu SManikandan MCenkeramaddi L(2022)Efficient Hardware Architectures for Accelerating Deep Neural Networks: SurveyIEEE Access10.1109/ACCESS.2022.322976710(131788-131828)Online publication date: 2022
https://doi.org/10.1109/ACCESS.2022.3229767
Cordeiro ASantos SMoreira FSantos PCarro LAlves M(2022)Efficient Machine Learning execution with Near-Data ProcessingMicroprocessors & Microsystems10.1016/j.micpro.2022.10443590:COnline publication date: 1-Apr-2022
https://dl.acm.org/doi/10.1016/j.micpro.2022.104435
Venkata Ramana KAkhila N(2022)Double MAC Supported CNN AcceleratorInnovations in Signal Processing and Embedded Systems10.1007/978-981-19-1669-4_6(53-64)Online publication date: 14-Sep-2022
https://doi.org/10.1007/978-981-19-1669-4_6
Cao QYu LWang ZZhan SQuan HYu YKhan ZKoubaa A(2021)Wild Animal Information Collection Based on Depthwise Separable Convolution in Software Defined IoT NetworksElectronics10.3390/electronics1017209110:17(2091)Online publication date: 28-Aug-2021
https://doi.org/10.3390/electronics10172091
Li FXi Q(2021)Research and implementation of a fabric printing detection system based on a field programmable gate array and deep neural networkTextile Research Journal10.1177/0040517521104815692:7-8(1060-1078)Online publication date: 27-Sep-2021
https://doi.org/10.1177/00405175211048156
Cariow ACariowa G(2021)Fast Algorithms for Quaternion-Valued Convolutional Neural NetworksIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2020.297968232:1(457-462)Online publication date: Jan-2021
https://doi.org/10.1109/TNNLS.2020.2979682
Cordeiro ASantos SMoreira FSantos PCarro LAlves M(2021)Machine Learning Migration for Efficient Near-Data Processing2021 29th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)10.1109/PDP52278.2021.00041(212-219)Online publication date: Mar-2021
https://doi.org/10.1109/PDP52278.2021.00041
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents