skip to main content
10.1145/1854273.1854309acmconferencesArticle/Chapter ViewAbstractPublication PagespactConference Proceedingsconference-collections
research-article

A programmable parallel accelerator for learning and classification

Published: 11 September 2010 Publication History

Abstract

For learning and classification workloads that operate on large amounts of unstructured data with stringent performance constraints, general purpose processor performance scales poorly with data size. In this paper, we present a programmable accelerator for this workload domain. To architect the accelerator, we profile five representative workloads, and find that their computationally intensive portions can be formulated as matrix or vector operations generating large amounts of intermediate data, which are then reduced by a secondary operation such as array ranking, finding max/min and aggregation. The proposed accelerator, called MAPLE, has hundreds of simple processing elements (PEs) laid out in a two-dimensional grid, with two key features. First, it uses in-memory processing where on-chip memory blocks perform the secondary reduction operations. By doing so, the intermediate data are dynamically processed and never stored or sent off-chip. Second, MAPLE uses banked off-chip memory, and organizes its PEs into independent groups each with its own off-chip memory bank. These two features together allow MAPLE to scale its performance with data size. This paper describes the MAPLE architecture, explores its design space with a simulator, and illustrates how to automatically map application kernels to the hardware. We also implement a 512-PE FPGA prototype of MAPLE and find that it is 1.5-10x faster than a 2.5 GHz quad-core Xeon processor despite running at a modest 125 MHz.

References

[1]
}}Mei, T., Hua, X., Yang, L., Li, S., "VideoSense: towards effective online video advertising," Proc. 15th International Conference on Multimedia 2007, pp 1075--1084.
[2]
}}Datta, R., et al.,"Image retrieval: Ideas, influences, and trends of the new age," ACM Comput. Surv. 40,2,Apr 08.
[3]
}}Bai, B., Weston, J., Grangier, D., Collobert, R., Sadamasa, K., Qi, Y., Chapelle, O., Weinberge, K., "Learning to Rank with (a lot of) word features," Special Issue: Learning to Rank for Information Retrieval. Information Retrieval, 2009.
[4]
}}Lecun, Y., Bottou, L., Bengio, Y., Haffner, P., "Gradient-based learning applied to document recognition," Proc. of the IEEE, vol.86, no.11, pp.2278--2324, Nov 1998.
[5]
}}MacQueen, J. B., "Some methods for classification and analysis of multivariate observation," Proc. Berkeley Symp. on Math. Stat. and Prob., pages 281--297.
[6]
}}Platt, J., "Fast Training of Support Vector Machines Using Sequential Minimal Optimization," in Advances in Kernel Methods - Support Vector Learning, MIT Press 1999.
[7]
}}Sato, A., Yamada, K., "Generalized learning vector quantization", Neural Information Processing Systems, pp.423--429, 1995.
[8]
}}Graf, H. P., Cadambi, S., Durdanovic, I., Jakkula, V., Sankaradass, M., Cosatto, E., Chakradhar, S. T., "A Massively Parallel Digital Learning Processor," Neural Information Processing. Systems, Dec. 2008.
[9]
}}Catanzaro, B., Sundaram, N., Keutzer, K., "Fast Support Vector Training and Classification on Graphics Processors," Machine Learning, 25th International Conference on, (ICML 2008), Jul. 2008.
[10]
}}Chellapilla, K., Puri, S., Simard, P., "High Performance Convolutional Neural Networks for Document Processing," Tenth International Workshop on Frontiers in Handwriting Recognition (2006).
[11]
}}Nasse, F., Thurau, C., Fink, G. A., "Face Detection Using GPU-Based Convolutional Neural Networks," Computer Analysis of Images and Patterns, 13th International Conference, CAIP 2009, Proc. LNCS 2009.
[12]
}}Collobert, R., Weston, J., "A unified architecture for natural language processing: deep neural networks with multitask learning," Proc. of the 25th International Conference on Machine Learning, vol. 307, pp.160--167, Jul 2008.
[13]
}}Lloyd, S.P., "Least squares quantization in PCM," IEEE Transactions on Information Theory 28 (2): pp 129--137.
[14]
}}Hall, J. D., Hart, J. C., "GPU Acceleration of Iterative Clustering," The ACM Workshop on General Purpose Computing on Graphics Processors and SIGGRAPH 2004 poster, Aug 2004.
[15]
}}Cosatto, E., Miller, M., Graf, H. P., Meyer, J., "Grading Nuclear Pleomorphism on Histological Micrographs," Proc. Int. Conf. Pattern Recognition, 2008.
[16]
}}Lawrence, S., Giles, C.L., Ah Chung Tsoi, Back, A.D., "Face recognition: a convolutional neural-network approach," Neural Networks, IEEE Transactions on, vol.8, no.1, pp.98--113, Jan 1997.
[17]
}}M D Taylor et al, "The Raw Microprocessor: A Computational Fabric for Software Circuits and General-Purpose Programs," IEEE Micro, vol. 22, no. 2, pp. 25--35, Mar./Apr. 2002.
[18]
}}Owens, J.D., Luebke, D., Govindaraju, N., Harris, M, Krueger, J., Lefohn, A.E., Purcell, T.J., "A survey of general-purpose computation on graphics hardware," Computer Graphics Forum, 26(1):80--113, 2007.
[19]
}}Burger, D, et al. "Scaling to the End of Silicon with EDGE Architectures," IEEE Computer, 37(7), pp. 44--55, July 2004.
[20]
}}Diamond, J. R., Robatmili, B., Keckler, S. W., van de Geijn, R., Goto, K., and Burger, D. 2008. "High performance dense linear algebra on a spatially distributed processor," Proc. 13th ACM SIGPLAN PPoPP 2008.
[21]
}}Seiler, L., et al., "Larrabee: a many-core x86 architecture for visual computing," In ACM SIGGRAPH 2008.
[22]
}}Kapasi, U.J., Rixner, S., Dally, W.J., Khailany, B., Jung Ho Ahn, Mattson, P., Owens, J.D., "Programmable stream processors," IEEE Computer, vol.36, no.8, pp. 54--62, Aug. 2003.
[23]
}}Sankaradas, M., et al, "A Massively Parallel Coprocessor for Convolution Neural Networks", In Proc. 20th IEEE International Conference on Application-specific Systems, Architectures and Processors (ASAP), 2009, Boston, MA.
[24]
}}Zhuo, L., Prasanna, V. K., "High Performance Linear Algebra Operations on Reconfigurable Systems", in ACM/IEEE Conference on Supercomputing, Proc. of the 2005, November 2005.
[25]
}}Rousseaux, S., Hubaux, D., Guisset, P., Legat, J., "A High Performance FPGA-Based Accelerator for BLAS Library Implementation," Proc. of the Third Annual Reconfigurable Systems Summer Institute (RSSI'07).
[26]
}}Raina, R., Madhavan, A., Ng, A. Y., "Large-scale deep unsupervised learning using graphics processors," Proc. 26th Annual international Conference on Machine Learning 2009.
[27]
}}Cadambi, S., et al, "A Massively Parallel FPGA-based Coprocessor for Support Vector Machines", Proc. IEEE Symposium on FCCM 2009, Napa, CA.

Cited By

View all
  • (2023)CNNFlow: Memory-driven Data Flow Optimization for Convolutional Neural NetworksACM Transactions on Design Automation of Electronic Systems10.1145/357701728:3(1-36)Online publication date: 19-Mar-2023
  • (2022)Multilayer Perceptron Training Accelerator Using Systolic ArrayIEICE Transactions on Information and Systems10.1587/transinf.2022PAP0003E105.D:12(2048-2056)Online publication date: 1-Dec-2022
  • (2022)An Efficient Configurable Hardware Accelerator Design for CNN on Low Memory 32-Bit Edge Device2022 IEEE International Symposium on Smart Electronic Systems (iSES)10.1109/iSES54909.2022.00033(112-117)Online publication date: Dec-2022
  • Show More Cited By

Index Terms

  1. A programmable parallel accelerator for learning and classification

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      PACT '10: Proceedings of the 19th international conference on Parallel architectures and compilation techniques
      September 2010
      596 pages
      ISBN:9781450301787
      DOI:10.1145/1854273
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 11 September 2010

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. accelerator-based systems
      2. heterogeneous computing
      3. machine learning
      4. parallel computing

      Qualifiers

      • Research-article

      Conference

      PACT '10
      Sponsor:
      • IFIP WG 10.3
      • IEEE CS TCPP
      • SIGARCH
      • IEEE CS TCAA

      Acceptance Rates

      Overall Acceptance Rate 121 of 471 submissions, 26%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)47
      • Downloads (Last 6 weeks)1
      Reflects downloads up to 20 Jan 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2023)CNNFlow: Memory-driven Data Flow Optimization for Convolutional Neural NetworksACM Transactions on Design Automation of Electronic Systems10.1145/357701728:3(1-36)Online publication date: 19-Mar-2023
      • (2022)Multilayer Perceptron Training Accelerator Using Systolic ArrayIEICE Transactions on Information and Systems10.1587/transinf.2022PAP0003E105.D:12(2048-2056)Online publication date: 1-Dec-2022
      • (2022)An Efficient Configurable Hardware Accelerator Design for CNN on Low Memory 32-Bit Edge Device2022 IEEE International Symposium on Smart Electronic Systems (iSES)10.1109/iSES54909.2022.00033(112-117)Online publication date: Dec-2022
      • (2022)Efficient Hardware Architectures for Accelerating Deep Neural Networks: SurveyIEEE Access10.1109/ACCESS.2022.322976710(131788-131828)Online publication date: 2022
      • (2022)Efficient Machine Learning execution with Near-Data ProcessingMicroprocessors & Microsystems10.1016/j.micpro.2022.10443590:COnline publication date: 1-Apr-2022
      • (2022)Double MAC Supported CNN AcceleratorInnovations in Signal Processing and Embedded Systems10.1007/978-981-19-1669-4_6(53-64)Online publication date: 14-Sep-2022
      • (2021)Wild Animal Information Collection Based on Depthwise Separable Convolution in Software Defined IoT NetworksElectronics10.3390/electronics1017209110:17(2091)Online publication date: 28-Aug-2021
      • (2021)Research and implementation of a fabric printing detection system based on a field programmable gate array and deep neural networkTextile Research Journal10.1177/0040517521104815692:7-8(1060-1078)Online publication date: 27-Sep-2021
      • (2021)Fast Algorithms for Quaternion-Valued Convolutional Neural NetworksIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2020.297968232:1(457-462)Online publication date: Jan-2021
      • (2021)Machine Learning Migration for Efficient Near-Data Processing2021 29th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)10.1109/PDP52278.2021.00041(212-219)Online publication date: Mar-2021
      • Show More Cited By

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media