Tools for code optimization and system evaluation of the image processing system PAPRICA-3

https://doi.org/10.1016/S1383-7621(98)00021-6Get rights and content

Abstract

This paper presents the complex environment that was built to ease the prototyping of real-time applications on the PAPRICA-3 massively parallel system. Applications are developed in C++ using high level data types and the corresponding Assembly code is automatically created by a code generator. A stochastic code optimizer takes the assembly code and improves it according to a genetic approach; due to the high computational power required by this approach, the stochastic code optimizer was implemented with MPI and runs in parallel on a cluster of workstations. The availability of this complex environment allowed to test the performance of the system and to tune it according to some target applications before the actual development of the hardware. For this purpose a system-level simulator was also built to determine the number of clock cycles required to run a specific segment of code. The whole environment has been used to validate possible solutions for the hardware system and to develop, test, and tune several real-time image processing applications. The hardware system is now completely defined.

Introduction

Real-time image processing applications require powerful engines. A lot of general-purpose processors nowadays can deliver sufficient computational power, but when specific requirements (such as a low power consumption, a small physical size, and a low production cost2) must be met, a special-purpose system with a deep match between the architecture and the application becomes mandatory.

A lot of special-purpose architectures have been conceived, designed, and implemented [9], but not always the design of new systems has been accompanied with the development of proper environments to ease the prototyping and tuning of applications. In general, complex hardware systems require either a specialized skill for its programming or highly optimizing compilers and environments: in the first case the user must be aware of all internal details of the machine to be able to exploit its sophisticated characteristics, while in the latter the user programs with high level languages, thus leaving to an optimizing compiler or code generator the hard task of writing efficient code. Obviously this complex task becomes even harder when the target of the project is to reach real-time performance on a custom system.

With the continuously growing possibilities offered by hardware today, words such as pipeline, branch prediction, dynamic scheduling, and superscalar architectures are no more restricted to a small specialists elite but are spreading also in a standard and general environment; even low-cost special-purpose processors are designed focusing on these techniques [11]. Thus, code optimization by-hand is not only difficult and time-consuming, but generally produces code with an efficiency lower than what could be achievable using automatic tools.

This work presents the programming environment that was developed to ease the prototyping of real-time applications on the PAPRICA-3 system, a special-purpose coprocessor designed to run real-time image processing tasks developed in cooperation with the Polytechnic of Turin, Italy. To exploit the natural parallelism of low-level processing of images, PAPRICA-3 is composed of a high number of processors working in SIMD fashion. Moreover, the internal structure of each single processor features a complex pipeline which allows to take advantage of the intrinsic parallelism of a generic program [24]. Thus, both Spatial Parallelism and Instruction-Level Parallelism [21] are used to boost performance, and must be taken into account when writing Assembly code.

The PAPRICA-3 programming environment has thus been structured to hide these two instances of parallelism. Spatial parallelism, namely the distribution of the data over the set of processing elements, is automatically handled by a high level language, which allows the user to describe the application employing abstract data types; a code generator then builds the corresponding assembly program using a set of parameterized libraries. Conversely the exploitation of instruction-level parallelism, namely the simultaneous execution of portions of each single assembly instruction, is handled by a code optimizer, which modifies the assembly code produced by the previous tool in order to minimize the number of clock cycles required to run the code.

Pipelining techniques were first addressed at least 25 years ago [20], and in this period a great variety of algorithms has been studied and implemented to improve code efficiency. Almost all of them rely on deterministic approaches, which lead to NP-complete problems [10]. Conversely in this work a stochastic approach is considered: unlikely other optimization processes, a “population” of modified versions of the original program evolves according to a genetic methodology. This approach allows to achieve higher optimization levels, but unfortunately it requires a higher computational power (in terms of both memory size and CPU time). To face this requirements the optimizer has been implemented on a cluster of workstations using standard MPI (Message Passing Interface [19]) libraries.

This work is organized as follows: Section 2briefly introduces the PAPRICA-3 system; Section 3presents the programming environment and the C++ classes that model the new data types and introduces the code optimization tool; 4 Assembly code optimization, 5 Performance evaluationdescribe how the stochastic optimizer works and how performances are evaluated in absence of the real hardware; Section 6describes the implementation of each logical part of the environment and Section 7presents a case study and describes possible future extensions of the environment; Section 8ends the paper with some conclusive remarks.

Section snippets

Brief overview of the
system

PAPRICA-3 [7] is a massively parallel system dedicated to run real-time image processing tasks. The system specifications are now completely defined, and the ICs are currently under fabrication; the hardware system, a PCI board that will be connected to a standard PC, will be available in early 1998.

The programming environment

The intrinsic complexity of Assembly language, especially when combined with the SIMD computational model, produces a low programming efficiency; a high level programming language, together with a software tool for the generation of its corresponding Assembly code, are then needed to ease the development of software applications. This section gives an overview of the approach, describes the classes available to the user and discusses the optimization problems that must be faced.

Assembly code optimization

Since the code generator produces the assembly code in a single pass, a backward optimization is not possible. Thus a tool for the optimization of the assembly code has been developed.

The efficient use of a pipelined processor is mainly based on a specific ordering of the program instructions (known as instruction scheduling), aimed to the maximization of the number of the elementary operations executed simultaneously by the different pipeline stages. A lot of different approaches have been

Performance evaluation

Due to the complex internal structure of each PE, the determination of the execution time of a given program is unpredictable 13, 14 without a simulation or a real run. Since the hardware system is not yet available, a tool for performance evaluation of a general pipelined processor was developed: PiPE, Pipeline Performance Evaluator. The development of a simulator of a general pipelined processor allowed to dynamically modify the internal architecture of the system and to test and evaluate

Implementation

This section describes the implementation issues that have been addressed in the development of the whole environment.

For portability purposes the entire code has been written using standard programming languages and tools, such as ANSI C, C++, LEX, YACC, and MPI libraries, and runs both on UNIX based machines (SunOS and Linux) and in DOS based environments (MsDos and Windows).

An example

A number of different programs has been developed, tested, and optimized using the PAPRICA-3 operating environment; nevertheless, hereinafter we will refer to a specific example: an image processing algorithm for the detection of road markings in images acquired from a moving vehicle [5].

In this case the execution time plays a basic role, and thus code optimization becomes mandatory. The algorithm is based on a morphological filter which enhances the road markings, and on an adaptive threshold

Conclusions

In this paper the operating environment of the PAPRICA-3 system has been described.

The programming environment, based on C++, has proven to be very effective, since the user does not have to learn a new language, but programming becomes an easy task thanks to the use of overloaded operators acting on new C++ classes. Moreover since the system-level simulator delivers remarkable performance, an extremely efficient prototyping, testing, and tuning of applications becomes possible. On the other

Acknowledgements

This work was partially supported by the Italian CNR under the frame of the Progetto Finalizzato Trasporti 2.

Alberto Broggi is Assistant Professor at the Dipartimento di Ingegneria dell'Informazione of the University of Parma since 1994. He received both the Dr.Ing. (Master) degree in Electronic Engineering (1990) and the PhD degree in Information Technology (1994) from the same University. His research interests include real-time computer vision algorithms for the navigation of unmanned vehicles, and the development of low-cost computer systems to be used on autonomous robots. He is the coordinator

References (24)

  • A. Aho, R. Sethi, J. Ullman, Compilers: Principles, Techniques, and Tools, Addison–Wesley, Reading, MA,...
  • S.J. Beaty, S. Colcord, P.H. Sweany. Using genetic algorithms to fine-tune instruction-scheduling heuristics, in:...
  • G. Booch, Object-Oriented Analysis and Design-with Applications, Benjamin/Cummings, Menlo Park, CA,...
  • M.J. Bourke, III, P.H. Sweany, S.J. Beaty, Extending list scheduling to consider extension frequency, in: Proceedings...
  • A. Broggi, The evolution of a massively parallel vision system for real-time automotive image processing, in:...
  • A. Broggi

    Global communications on a linear array architecture

    Journal of Parallel Algorithms and Applications

    (1997)
  • A. Broggi, G. Conte, G. Burzio, L. Lavagno, F. Gregoretti, C. Sansoè, L.M. Reyneri, PAPRICA-3: A real-time...
  • A. Broggi et al.

    The evolution of the PAPRICA system (Special Issue on Massively Parallel Computing)

    Integrated Computer-Aided Engineering Journal

    (1997)
  • A. Broggi, F. Gregoretti, in: A. Broggi, F. Gregoretti (Eds.), Editorial: Special-Issue on Special-Purpose...
  • D.J. DeWitt, A machine independent approach to the production of optimal horizontal microcode, Ph.D. Thesis, Dept. of...
  • F. Gregoretti, F. Intini, L. Lavagno, R. Passerone, L.M. Reyneri, Design and implementation of the control structure of...
  • R.M. Haralick et al.

    Image analysis using mathematical morphology

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (1987)
  • Cited by (1)

    Alberto Broggi is Assistant Professor at the Dipartimento di Ingegneria dell'Informazione of the University of Parma since 1994. He received both the Dr.Ing. (Master) degree in Electronic Engineering (1990) and the PhD degree in Information Technology (1994) from the same University. His research interests include real-time computer vision algorithms for the navigation of unmanned vehicles, and the development of low-cost computer systems to be used on autonomous robots. He is the coordinator of the ARGO project, aimed to the design, development and test of the ARGO autonomous prototype vehicle of the University of Parma. He is the author of more than 100 refereed publications in international Journals, book chapters, and conference proceedings. He is the Editor of the Newsletter and member of the Executive Committee of the IEEE Technical Committee on Complexity in Computing, and member of the Editorial Board of Real-Time Imaging Journal, and Engineering Applications of Artificial Intelligence Journal. He has guest edited a number of special-issues of international journals on Machine Vision (IEEE Intelligent Systems, Image and Vision Computing, Real-Time Imaging, Engineering Applications of Artificial Intelligence) and has been invited to organize several minitracks and special sessions in international conferences (IEEE Intl Conf on Intelligent Transportation Systems, IEEE Symp on Intelligent Vehicles, IEEE Intl Conf on Algorithms And Architectures for Parallel Processing, SPIE Aerosense, Intl Symp on Automotive Technology and Automation, Hawaii Intl Conf on System Sciences). He served on the Program Committee and as Publicity Chair and Tutorial Chair of many major conferences.

    Massimo Bertozzi was born in Parma in 1966. In 1994 he received the Dr.Ing.~degree in Electronic Engineering from the University of Parma discussing a Master Thesis about the implementation of Simulation of Petri Nets on the CM-2 Massive Parallel Architecture. From November 1994 to October 1997 he was a PhD student in information technology atthe Dipartimento di Ingegneria dell'Informazione of the University of Parma where he chaired the local IEEE student branch. During thisperiod his research interests focused mainly on the application ofimage processing to real-time systems and to vehicle guidance, on theoptimization of machine code at assembly level, and on parallel and distributed computing. Since November 1997 he has held a permanent position at the Dipartimento di Ingegneria dell'Informazione.

    View full text