# 3D Optoelectronic Fix Point Unit and Its Advantages Processing 3D Data<sup>\*</sup>

B. Kasche, D. Fey, T. Höhn, and W. Erhard

Friedrich Schiller University Jena Faculty of Mathematics and Computer Science Department for Computer Architecture and -communication Ernst Abbe Platz 1-4 07743 Jena, Germany Phone: ++49 3641 946373, Fax: ++49 3641 946372 kasche@informatik.uni-jena.de

**Abstract.** In this paper we show the design of a 3 dimensional optoelectronic hardware approach to realize a fix point processing unit. For that we show the main ideas of the low level algorithm. We will introduce several concepts and evaluate them with regard to the highest throughput. At the end we will focus on an application of our 3d approach, especially on an algorithm for volume rendering of medical image sets.

## 1 Introduction

Optics is said to become one of the most important components for computing hardware in the near future. This fact is motivated by the problems which are generated by using pure electronics for data processing with a high demand on communication.

There are some consequences of physics on pure electronic chip fabrication. The so called MOORE's law [1] says that the transistor density is doubled every 12-18 monthes. The inference from this fact is both the steady increasing of the number of transistors integrable into one chip and the decreasing of the transistor switching time, thus the chip clock rate is increased and one can integrate more logic within the same silicon.

RENT's rule [2] says that the number of pins necessary for in- and output increases in an exponential way with the number of transistors. The pin limitation problem is known as the imbalance of a quadratic enlargement of the relative chip area and the linear enlargement of the number of pins of a chip.

Finally it can be outlined that using only pure electronics the huge amount on fast communication channels is hard to manage. Optics is said to help overcoming those problems.

For this optics and electronics have to form a synergetic union. The chosen algorithms have to be well adapted to the hardware, too. This is our way to achieve high performance computing.

<sup>\*</sup> This project is supported by a grant of DFG (Deutsche Forschungs Gemeinschaft)

P. Amestoy et al. (Eds.): Euro-Par'99, LNCS 1685, pp. 1005-1012, 1999.

<sup>©</sup> Springer-Verlag Berlin Heidelberg 1999

In our approach we have designed an arithmetic logical unit. For that we designed an integer and a fix point unit to be able to calculate standard functions for a data which is given in a floating point representation. By using optics we could employ the third dimension for data processing.

In this paper we emphasize the fix point unit. We have developed several approaches which are based on the so called *BICDIC* (<u>bit</u> <u>c</u>ompletion <u>digital</u> <u>c</u>omputer) and *CORDIC* (<u>co</u>ordinate <u>r</u>otating <u>digital</u> <u>c</u>omputer) algorithm belonging to the class of add and shift algorithm. These kind of low level algorithm were developed further and we could condense 8 of them into an unique structure.

Out of all developed concepts we want to determine the most efficient processing method. These concepts are characterized by the art of data processing. We used a bit serial, a bit parallel and a method using a redundant number representation. By using the redundancy it is possible to add any two numbers within a constant time.

The design process of a synergetic relationship between optics, electronics and last but not least between the low level algorithms is to be applied also to the application algorithms. These algorithms are called high level algorithms. As an example we will introduce algorithms which are necessary for 3d medical image processing. We will show that we are able to process 3 dimensional datasets by using our approach.

# 2 Synergy of Optics and Electronics

There are so called *Smart Pixel* elements which can guide to a solution of the communication problems using only pure electronics. It is a synergy of an optical and electronic signal processing, thus one Smart Pixel consists of an optical input and optical output to communicate and consists of an electronical processing unit like common VLSI chips. Figure 1 illustrates this fact.

But the electronical processing unit is not as complex as purchasable, common chips. Theoretical study shows [3] that there is a certain small size of such an Smart Pixel to ensure the highest efficiency. In general the following holds: the less sized a Smart Pixel is the more efficient is communication.

Thus the main task, designing an optoelectronic chip using Smart Pixel, is to find out the best ratio between chip area necessary for electronical processing and the chip area that is needed for the optical receiver and transmitter. In the following chapters hardware approaches with the best efficiency with respect to calculate standard functions are estimated.

# 3 Arithmetic Logical Unit

Our aim was an arithmetic logical unit (ALU) which is to realize using an optoelectronical approach to overcome problems in communication mentioned above. In this paper we will focus on a fix point unit which is used in our ALU in conjunction with an integer unit. There will be a memory and an input output unit as well. All of them are partially controlled by the control unit.

All the units of each node consist of several processing elements. A certain number of processing elements form one pipeline which can fulfill the property of the dedicated unit. Thus parallel processing is realized not only by using several nodes but using several pipelines within each unit. In order to realize a fully synchronized processing of different data each pipeline must have the same time behaviour. Consequently, we will have a SIMD like structure, but the single instruction means in our case a floating point instruction. I.e. each pipeline could realize a different calculation of a standard function, but within the same time window. So we have a weaker SIMD structure.

## 4 Algorithm

In our arithmetic logical unit we have to calculate standard functions. There are several methods calculating standard functions. Since we want to get a Smart Pixel based approach we want to use only simple operation. Finally we are looking for an algorithm which is absolutely well geared with the hardware and vice versa.

Approaches based on table look up, power series or restoring algorithm are too space consuming and not as uniform as it would be necessary for a Smart Pixel based approach.

By looking for additional algorithm to realize standard functions we came across so-called *BICDIC and CORDIC algorithm*. Here we have to endeavour only simple operations.

All these algorithms are iterative procedures. We start with a triple  $(x_0, y_0, z_0)$ . Each tuple is modified by applying a special transformation instruction to the successor triple.

If the CORDIC [4] algorithms are used, one will get a more unique structure from the beginning but it is necessary for some functions to execute additional procedures[5]. We were able to condense 8 different functions into one unique scheme, thus we can use always the same hardware with only slight modifications. We can calculate logarithm, exponential, square root, multiplication, division as well as sine, cosine and arctangent function.

All developed and adapted algorithm for the 8 standard functions can be found in literature[6].

Well adapted means that we use only simple operations SHIFT, ADD, SE-LECT. Thus, we have optimal starting conditions to design a Smart Pixel based approach. Here we have pursued two different ways.

First we investigated a so called multi chip version and evaluated the performance. This was followed by a single chip investigation.

## 5 Multi Chip Approach

We started to design a multi chip version in order to determine the estimated performance of a system which may require high technological equipment to be build up. That means we would have to design 4 different chips, each with a different functionality, and stack them together within a 3d setup. Figure 2 illustrates the composition.





Fig. 1. Setup of a Smart Pixel

Fig. 2. Setup of the 3d multi chip approach

A pipeline is built up connecting all modules adjacent to each other. The largest module determines the over-all dilatation of a pipeline. For instance if one adder functionality covers the whole chip area of the add module only one pipeline could be realized.

We assumed, that the over-all dilatation of one pipeline is determined by the largest pipeline stage, i.e. by the adder functionality. Therefore we considered different approaches by using different methods to perform the adding.

To determine the best ratio between computing time and chip area we designed a *bit serial*, *bit parallel* and a *bit redundant* method. For each method we determined the required chip area and could determine the computing performance for the purpose of a maximized throughput.

The throughput  $\Theta$  is determined by the number of parallel working pipelines (#Pipes.Chip), the number of steps (s) and the chip clock rate ( $\Delta t$ ). The number of pipelines is determined by the whole chip area and the area occupied by one complete pipeline.

$$\Theta = \frac{\# \text{Pipes.Chip}}{\Delta t \cdot s} = \frac{\frac{A_{\text{Chip}}}{A_{\text{pipeline}}}}{\Delta t \cdot s} \tag{1}$$

The performance with respect to the given technological parameter was determined.

The throughput depends on the applied word length. If a word length of 32 bit is used, a maximum of about 35 giga operation per second could be performed. Giga operation per second means  $10^9$  finished calculations out of the 8 realized standard functions.

At the moment it is quite difficult to build just one optoelectronic chip, let alone four. It can hardly be justified to build up a large system just to show the principle technical feasibility. That's why we investigated a single chip as well. Here we can not expect as much performance as for the multi chip approach, but it might be possible to really get a realizable optoelectronic chip.

#### 6 Single Chip Approaches

We have realized 3 different processing methods as we have done it for the multi chip approach.

In all approaches we have one chip to realize the functionality and another one to provide the table values. On the first chip there are all the pipelines each corresponding to one calculation. Due to different methods employed for the add functionality the pipelines require different chip area. Each iteration is calculated by the corresponding pipeline stage. Each time one iteration is finished, i.e. one calculation is done, the result is handed over to the north.

For the bit serial processing one processing element realizes the whole functionality of one iteration. That's why we expected to get the most pipelines next to each other onto the chip.

The behaviour of the 3 approaches mentioned above was described by a hardware description language (VHDL)<sup>1</sup>, synthesized into a gate layout and finally into a transistor layout. The transistor layout is based on a  $0.8\mu m$  CMOS process of the AMS company[7]. Thus we could determine the number of necessary transistors and the chip area as well as the critical path length in order to determine the maximum clock chip rate. All of this parameters are taken into consideration when we have evaluated the performance.

#### 7 Performance Evaluation of the Single Chip Approaches

Our aim was maximizing the throughput when calculating function values of standard functions. So we determined the throughput using formula (1) again.

We have determined the bit serial approach as the most efficient one, with respect to our aim which was the maximized throughput. What does the supremacy means? We could see, that the gain of throughput by parallel processing is made up for the higher demand for the area size of the processing elements.

The bit serial approach was outlined as the best one, so we have determined the setup of the future hardware solution in more detail. The two necessary chips communicate optically. The left provides the necessary table values and the right one calculates the results of each iteration. At the top we get the results. Realizing 60 pipelines we get 60 results each clock cycle.

Knowing all the logical and technological parameters we were able to determine the real performance of the two chips working in a bit serial way. The performance is shown in Fig. 3. Here we can see, that our system outperforms existing signal processors, but not a super computer. But this was not our aim. We should mention as well, that the purpose of the digital signal processors used

<sup>&</sup>lt;sup>1</sup> This was done in cooperation with the University of Erlangen

for comparison is not the maximized throughput. Such a chip have to realize a fast calculation of one single function value as well. But there is a problem if one wants to calculate function values more often. If we would have to design a chip calculating single function values as fast as possible we would have chosen the bit redundant approach. This is the fastest approach in terms of a single calculation.



Fig. 3. Real performance of the bit serial chip approach in comparison to other existing fix point units

# 8 Application

As we have seen in the last section, we outperform existing system, even with the single chip approach, if we have a high demand on calculating function values. So we have pursued our investigations in the field of 3d imaging processing. Here we assign each voxel (pendant to pixel, but volume picture element) one calculation unit, i.e. one pipeline of the approaches described above. Thus we have the conformity between the computing task and our future hardware. Consequently we can more or less easily access the neighbours of each voxel within an one-step or a two step communication, see Fig. 4.

One application for our hardware could be a volume rendering or an artificial lighting. Here we have the volume data set and some light sources. There is a starting point, called  $x_0$  and a given direction  $\boldsymbol{\omega}$  with the scaling  $s \in [0, 1]$  (see Fig. 5).

The light intensity at one voxel within the 3d data set depends on the initial light intensity at the position zero in the direction  $\omega$ ,  $(I(0, \omega))$  and the sum of



Fig. 4. The accessable neighbours using one or two steps to communicate



Fig. 5. General setup of a volume rendering scheme

all the light coming from the source  $(J(s', \omega'))$  and all the points between. This light is determined by the extinction  $(\kappa)$ , i.e. absorption, emission and scattering, the optical depth  $(\tau)$  and the optical density  $(\varrho)$  [8] thus finally (2) holds.

$$I(s,\boldsymbol{\omega}) = I(0,\boldsymbol{\omega}) \cdot e^{-\tau(0,s)} + \int_{0}^{s} J(s',\boldsymbol{\omega'}) \cdot \kappa \cdot \varrho(s') \cdot e^{-\tau(s',s)} ds$$
(2)

One can see, that we integrate the radiance. This is modified by applying the exponential function. The scattering is determined by using sine and cosine function. At the end we need to realize a multiplication. The whole algorithm was described by an abstract description language and simulated on a MASPAR [7] multi processor system. Here we have determined that each Smart Pixel processing element have to have 8 register and uses six of the eight elementary functions mentioned in Chap. 4.



Fig. 6. Screenshot of the X11 application processing volume data

In order to prove the correctness of the algorithm manipulating 3d data sets we have designed a simulation tool. Fig. 6 shows a screen shot of the X11 application.

Another application is a 3d image rotating and an image correlation procedure. There the required amount of calculations is huge enough because common systems require still more calculation time than it would be necessary to realize a real time processing.

## 9 Summary and Outlook

The major problems in the current VLSI design are restrictions of both the number of available pins and the off-chip communication speed. The current approach of increasing integration density of VLSI chips keeps these problems alive and still increases the difficulties. Due to physical reasons the ability of a high speed off-chip communication in the same range of the on-chip communication is very difficult to achieve. Optoelectronic 3D circuits based on Smart Pixel technologies offer a principle solution for the problems mentioned above.

In our paper we presented a multi chip approach based on Smart Pixel technology as well as a single chip solution. We determined the necessary chip areas by describing the hardware using a hardware description language and synthesizing these into transistor layouts. Using this information we were able to determine the performance in terms of throughput. Our systems outperforms existing systems, but has still less computing power than super computers. But we don't have such a high expenditure of hardware as it is necessary for common super computers.

The most spectacular result was the supremacy of the bit serial approach over the bit parallel method using a conditional sum adder and even over a method adding two numbers by applying a redundant number representation.

We outlined the theory of BICDIC and CORDIC algorithm because an adapted kind of these algorithm was applied in our approaches.

We have finished our paper by presenting a volume rendering application in the field of medical image processing.

## References

- G.E. Moore. Some Personal Perspectives on Research in the Semiconductor Industry. In A. Rosenbloom, S. Richard, and J.W. Spencer, editors, *Engines of Innovation*, pages 165–174. Harvard Business School Press, 1996.
- [2] D. K. Ferry, L. A. Akers, and E. W. Greeneich. Ultra Large Scale Integrated Microelectronics. Prentice Hall, Englewood Cliffs, New Jersey, 1988.
- [3] D. Fey and W. Erhard. Algorithms for High–Performance Computing with Smart Pixels. In G.A. Lampropoulos et al., editors, *Applications of Photonic Technology*, pages 97–100, New York, 1995. Plenum Press.
- [4] J. Walther. A unified algorithm for elementary functions. In *Joint Computer Conference Proc.*, volume 38, 1971.
- Jean Duprat and Jean-Michel Muller. The CORDIC ALgorithm: New Results for Fast VLSI Implementation. *IEEE Transactions on Computers*, 42(2):168–177, February 1993.
- [6] D. Fey, B. Kasche, C. Burkert, and O. Tschäche. A specification for a reconfigurable optoelectronic VLSI processor suitable for digital signal processing. *Applied Optics*, 37(2):284–295, January 1998.
- [7] The mention of brand names in this paper is for information purposes only and does not constitude an endorsement of the product by the authors or their institutions.
- [8] S. Chandrasekhar. Radiative Transfer. Oxford University Press, Dover, N.Y., 1960.