Elsevier

Integration

Volume 60, January 2018, Pages 213-223
Integration

Customizable embedded processor array for multimedia applications

https://doi.org/10.1016/j.vlsi.2017.09.009Get rights and content

Highlights

  • Our proposed architecture CPAMA is a highly configurable processor array.

  • CPAMA was designed considering specifications of image/video processing algorithms.

  • We implemented four different algorithms on a FPGA and using 90 nm CMOS technology.

  • Since CPAMA is a reusable architecture, using it decreases time-to-market of low cost, low power consumer electronics.

Abstract

We are proposing a Customizable Embedded Processor Array for Multimedia Applications (CPAMA). This architecture can be used as a standalone image/video processing chip in consumer electronics. Its building blocks are all designed to achieve low power and low area, thus it is a good candidate for low cost consumer electronics. Our contribution is, designing a configurable embedded multimedia processor array considering the nature of image/video processing applications. This approach is considered in all the basic blocks of the architecture. Because of its configurable architecture and ability to connect with other devices, it may be used in a large domain of applications. Our architecture is purely implemented with VHDL. It is not dependent on any technology or design software. We have implemented our architecture for different applications on a Xilinx Virtex-5 device and as a number of Application Specific Integrated Circuits (ASIC) by using 90 nm CMOS technology. Experimental case studies show that CPAMA has better or comparable results to the existing similar architectures in terms of performance and energy consumption. Our studies show that throughput of CPAMA is 0.3x–2.4x times better than ADRES. Energy consumption of CPAMA is 31–50% less than ADRES. On the other hand, in one configuration of IDCT application, CPAMA provides 56% less throughput and consumes 55% more energy than ADRES.

Introduction

Computing hardware design methodology has evolved significantly over the years. As chips get larger and complexity of each design increases, flexibility and quick time to market in the form of reprogrammable/reconfigurable chips and systems increase in importance [1]. Several Multi Processor System on a Chip (MPSoC) and Coarse-Grained Reconfigurable Architectures (CGRA) have been proposed in recent years [2], [3], [4]. Using CGRAs may be preferred for several reasons such as speed, area, power or IP re-usability [3]. Furthermore, comparing to Field Programmable Gate Arrays (FPGA), CGRAs have a shorter reconfiguration time. CGRAs are suitable for systems that require intensive computations. By adjusting the number and structure of processing elements on a CGRA, we can obtain an architecture that meets the requirements of the computation.

Image/video processing is an area where algorithms need intensive computation with high performance. Handling this kind of computation usually requires custom hardware [5]. Considering today's technology, every portable device tends to have a camera, e.g. glasses, watches, smart phones, etc. Each device has its own configuration and requires mostly different features. Designing dedicated hardware for image processing tasks for every device is time consuming and not economically feasible at all. In most devices, image processing tasks are handled using System-on-Chips (SoC) with DSP or GPU cores. If a designer chooses to use commercial SoCs, he/she has to accept what the chip offers, in terms of speed and power dissipation. Those architectures may include redundant parts that might not be used at all. This redundancy leads to extra chip area usage and power dissipation. On the other hand, implementing an image processing task on a CGRA yields efficient results in terms of area, power dissipation, or speed comparing to commercial SoCs [6]. Time-to-market of an image/video processing system, which is implemented on customizable cores like CGRAs, is less than that of a custom Application Specific Integrated Circuit (ASIC) [7]. Besides, it is easy to adopt such systems for later alterations. Consequently, we can say that CGRAs are suitable for image/video processing tasks of low power, low cost consumer electronics.

In this paper, we introduce a Customizable Embedded Processor Array for Multimedia Applications (CPAMA). CPAMA consists of a processor array for intensive computation, and a host processor for control and coordination with other devices. Our configurable architecture is designed by considering the nature and requirements of image processing algorithms:

  • CPAMA processes a multimedia application in sequences of image blocks. Hence, we design a configurable processor array which concurrently processes all pixels in an image block.

  • Each processor of CPAMA can also be configured according to the position of a pixel in an image block depending on the application.

This architecture can be used for domains that require intensive computation such as image/video processing, and scientific computations that can be mapped onto a 2 dimensional (2D) processor array.

This paper is organized as follows: In Section 2 we mention the related architectures in literature and demonstrate the differences with the proposed CPAMA. In Section 3, we explain the basic concepts that we refer in CPAMA design. In Section 4 we present the configurable hardware architecture of CPAMA in details. In Section 5, we present our case study implementations and make comparisons with the existing similar architectures. Finally in Section 6, we make our remarks on the CPAMA architecture and conclude the paper.

Section snippets

Related works

Mei et al. [3] proposed a template-based CGRA called Architecture for Dynamically Reconfigurable Embedded System (ADRES). Coarse grained reconfiguration refers to reconfiguration in relatively high level modules, not in logic blocks or in Look Up Tables(LUT) as in an FPGA. A design tool, namely Dynamically Reconfigurable Embedded System Compiler (DRESC) [8], is used for this architecture to generate the design. Propagating data, in other words performing iterations, is implemented in a stream

Basic concepts of CPAMA

CPAMA is mainly designed to be vastly generic and flexible. In every development cycle of CPAMA, requirements and characters of image processing applications have been considered. Register files of the processors, data-path design, instruction set of the processors, communication among the processors, and FIFO structures are all studied considering the image processing domain. CPAMA has a template-based configurable structure. As any template structure, CPAMA has both fixed and configurable

Hardware design

Hardware side of CPAMA consists of a 2D grid network structure as shown in Fig. 4. Considering the nature of image processing, there is a strong similarity between a 2D signal (image) and a 2D Mesh NoC. Therefore, we preferred this type of network in CPAMA.

One processor is connected to each node. Data communication among processors is done by routers. Image is delivered by FIFOs or routers through the network. FIFOs are placed in processors, and deliver the data in one (vertical) direction.

Case studies

We have evaluated performance of CPAMA by implementing four different algorithms; which are dot product, TIFF to gray level image transformation (TIFF2BW) [40], Inverse Discrete Cosine Transform (IDCT) and block-match.

Conclusion

Our proposed architecture CPAMA is a highly configurable processor array targeted for low power, low cost image/video processing devices. In comparison with ADRES, CPAMA has shown better performance in TIFF2BW and comparable performance in IDCT application in terms of energy consumption, throughput and area occupation. We think, this is because it occupies only the necessary hardware for a given application. This is achieved by considering the image processing nature in every development cycle

Acknowledgment

The authors would like to thank Mr. Gökhan Işık for his recommendations on ASIC implementation, and Dr. Salih Bayar for his help on partial reconfiguration techniques in FPGAs.

References (47)

  • S. Bayar et al.

    A dynamically reconfigurable communication architecture for multicore embedded systems

    J. Syst. Archit.

    (2012)
  • D. Macmillen et al.

    An industrial view of electronic design automation

    IEEE Trans. Comput.-Aided Des. Integr. Circ. Syst.

    (2000)
  • B. De Sutter et al.

    Coarse-grained reconfigurable array architectures

  • B. Mei, S. Vernalde, D. Verkest, H. De Man, R. Lauwereins, Adres: An architecture with tightly coupled vliw processor...
  • D. Gohringer et al.

    Adaptive multiprocessor865 system-on-chip architecture: new degrees of freedom in systemdesign and runtime support

  • S. Pedre et al.

    Accelerating embedded image processing for real time: a case study

    J. Real-Time Image Process.

    (2013)
  • M. Hartmann et al.

    Still image processing on coarse-grained reconfigurable array architectures

    J. Signal Process. Syst.

    (2010)
  • B. Stabernack et al.

    A system on a chip architecture of an h.264/avc coprocessor for dvb-h and dmb applications

    IEEE Trans. Consum. Electron.

    (2007)
  • B. Mei et al.

    Adres&dresc: Architecture and compiler for coarse-grain recon gurable processors

  • A. Marshall, T. Stansfield, I. Kostarnov, J. Vuillemin, B. Hutchings, A reconfigurable arithmetic array for multimedia...
  • C. Jang, J. Kim, J. Lee, H.-S. Kim, D.-H. Yoo, S. Kim, H.-S. Kim, S. Ryu, An instruction-scheduling-aware data...
  • N.R. Miniskar, R.R. Patil, R.N. Gadde, Y.C.R. Cho, S. Kim, S.H. Lee, Intra mode power saving methodology for cgra-based...
  • H. Eichel

    Customising a processor architecture for multimedia applications

    Electron. Syst. Softw.

    (2003)
  • J.-C. Chu, C.-W. Huang, H.-C. Chen, K.-P. Lu, M.-S. Lee, J.-I. Guo, T.-F. Chen, Design of customized functional units...
  • C.S. Bassoy, H. Manteuffel, F. Mayer-Lindenberg, Sharf: An fpga-based customizable processor architecture, in: 2009...
  • K. Masselos, F. Catthoor, C. E. Goutis, H. DeMan, Low power mapping of video processing applications on vliw multimedia...
  • K. Sanghai, R. Gentile, Multi-core programming frameworks for embedded multimedia applications, 2017....
  • M. Rashid, L. Apvrille, R. Pacalet, Application specific processors for multimedia applications, in: 2008 11th IEEE...
  • Synopsys, 2017....
  • D. Göhringer, J. Becker, High performance reconfigurable multi-processor-based computing on fpgas, in: Parallel...
  • M. Tukel, M. Yalcin, A new architecture for cellular neural network on reconfigurable hardware with an advance memory...
  • N. Yildiz et al.

    Architecture of a fully pipelined real-time cellular neural network emulator

    IEEE Trans. Circuits Syst. I: Reg. Pap.

    (2015)
  • S. Malki et al.

    A cnn-specific integrated processor

    EURASIP J. Adv Signal Process.

    (2009)
  • Cited by (2)

    • PulseDL: A reconfigurable deep learning array processor dedicated to pulse characterization for high energy physics detectors

      2020, Nuclear Instruments and Methods in Physics Research, Section A: Accelerators, Spectrometers, Detectors and Associated Equipment
      Citation Excerpt :

      In the inference phase, the majority of mathematical operations in the network are multiply-accumulate (MAC), so it is important to improve the efficiency of the MAC operations. Several system architectures are possible to finish the task, such as the single CPU solution [15], the many-core solution [16] and the array processor solution [17]. Considering the particular demand and system complexity, we choose the customized array processor as our overall hardware architecture.

    1

    Anka Microelectronic Systems.

    View full text