A novel implementation of radix-4 floating-point division/square-root using comparison multiples

doi:10.1016/j.compeleceng.2008.04.013

Computers & Electrical Engineering

Volume 36, Issue 5, September 2010, Pages 850-863

https://doi.org/10.1016/j.compeleceng.2008.04.013 Get rights and content

Abstract

A new implementation for minimally redundant radix-4 floating-point SRT div/sqrt (division/square-root) with the recurrence in the signed-digit format is introduced. The implementation is developed based on the comparison multiples idea. In the proposed approach, the magnitude of the quotient (root) digit is calculated by comparing the truncated partial remainder with 2 limited precision multiples of the divisor (partial root). The digit sign is determined by investigating the polarity of the truncated partial remainder. A timing evaluation using the logical synthesis (Synopsys DC with Artisan 0.18 μm typical library) shows a latency of 2.5 ns for the recurrence of the proposed div/sqrt. This is less than of the conventional implementation.

Introduction

To achieve high performance in carrying out massive mathematical computations, almost all recent microprocessors and digital signal processors, perform in hardware all four fundamental arithmetic operations, namely addition, subtraction, multiplication and division [1]. Studying the processors’ architectures and implementations reveals that of the four operations, division is not performed as fast as addition, subtraction and multiplication [2].

In 1994, Intel lost $475 Million due to an error in the division part of the Pentium microprocessor’s floating-point unit [3], [4]. This fiasco highlights that the algorithms, architectures and realisations proposed for division are still immature, requiring more investigation and attention especially when designing modern high performance processors.

The general objective of this work is to develop an algorithm for more robust radix-4 (floating-point) FP SRT division with less chance of error when being implemented. The FP algorithm is then modified to be used for FP sqrt as well. In this work, by increasing the parallelism among the functional units, more efficient implementation of radix-4 FP SRT div/sqrt is achieved. Having employed more efficient functioning modules with higher concurrency among them, a quicker circuit for radix-4 FP SRT div/sqrt is obtained. The time delay estimations carried out using the method of logical effort [5] and the logic synthesis show considerable decrease in the execution time with respect to conventional implementations.

Section snippets

Background

Some surveys [6], [2] show that most VLSI implementations of FP division are based on digit recurrence division algorithms known as SRT (SRT division algorithm was introduced independently by Sweeney [7], Robertson [8] and Tocher [9]). SRT division is an iterative algorithm with linear convergence toward the quotient. In this algorithm, the quotient digit selection (QDS) function calculates a fix number of the quotient bits every iteration. The speed of a FP SRT divider is mainly determined by

Comparison multiples QDS

Although the latest techniques of implementing the QDS function contributing to increase the speed of FP SRT division are well studied in the literature, still the comparison multiples method [10] is not seriously considered by designers. For example, Ercegovac and Lang [10] claim:

Since the resulting implementation is still complicated because of the need for the multiples (of the divisor) and the comparisons, we develop the following alternative, which has more flexibility and results in a

Radix-4 SRT division recurrence

Traditionally, the recurrence of radix-r FP SRT division is based on the scheme represented in Fig. 2. However, this conventional structure can be altered in order to achieve less delay. For radix-r FP SRT division, Nikmehr [28] introduce a recurrence, which is optimised for the QDS function displayed in Fig. 4. In dividers, which use the QDS function shown in Fig. 3, a factor generator precalculates ±d and ±2d and the quotient digit q_j+1 which is determined by the QDS function, selects the

Comparison multiple sqrt

In Section 5, SRT FP sqrt and the issues related to its implementation are discussed. Due to sharing the same characteristics, $s = \sqrt{x}$ can be represented using digit recurrence algorithms as in division $q = \frac{x}{d}$ [10]. This implies that with minor changes, the comparison multiple technique developed for SRT FP division can be used for implementing SRT FP sqrt. To keep the discussion concise, this section covers only the changes that need to be applied to Sections 3 Comparison multiples QDS, 4 Radix-4

Combined division and sqrt

The IEEE 754 standard requires the designers to implement both division and sqrt in the FP units of the microprocessors [11]. Given the number of similarities between the recurrences of these two, it is very normal to implement a combined circuit that perform both operations. Examples of such implementations can be found in [29], [30].

Comparing the specification of FP SRT division and sqrt reveals that to match the recurrences, it is sufficient to modify the sqrt PR such that $new w [j] = \frac{old w [j]}{2} .$

Timing analysis

A fast and easy to use delay model, called logical effort, is introduced by Sutherland et al. [5]. This model is accurate enough not only to predict whether circuit a is faster than circuit b but also to express an approximation to the circuit absolute delay. Studying reports, which involve delay estimation, reveals that this method is very popular among recent researchers as well as circuit designers [31], [32], [33].

Using the method of logical effort, the critical path delay of the proposed

Conclusion

A new realisation of the QDS function based on the new comparison multiples approach is proposed. The QDS function is then used for implementing a minimally redundant radix-4 FP SRT division. In this method, which is mathematically and architecturally described, instead of searching for the quotient digit in a lookup table, the quotient digit is directly calculated in sign and magnitude format. Using the new representation for the quotient digits, the fan out of some components on the critical

References (35)

M.D. Ercegovac et al.
Module to perform multiplication, division, and square root in systolic arrays for matrix computations
J Parallel Distr Comput
(1991)
S.F. Oberman et al.
Division algorithms and implementations
IEEE Trans Comput
(1997)
P. Soderquist et al.
Area and performance tradeoffs in floating-point divide and square-root implementations
ACM Comput Surv (CSUR)
(1996)
R.E. Bryant
Bit-level analysis of an SRT divider circuit
C.B. Moler
A tale of two numbers
SIAM News
(1995)
I.E. Sutherland et al.
Logical effort: designing fast CMOS circuits
(1999)
Oberman SF, Quach N, Flynn MJ. The design and implementation of a high-performance floating-point divider. Technical...
Cocke J, Sweeney DW. High speed arithmetic in a parallel device. Technical report, IBM Corp.; February...
J.E. Robertson
A new class of digital division methods
IRE Trans Electron Comput
(1958)
K.D. Tocher
Techniques of multiplication and division for automatic binary computers
Q J Mech Appl Math
(1958)

M.D. Ercegovac et al.

Division and square root: digit-recurrence algorithms and implementations

(1994)

IEEE. Std 754-1985 IEEE standard for binary floating-point arithmetic, Standards Committee of The IEEE Computer...

B. Parhami

Computer arithmetic: algorithms and hardware designs

(2000)

N. Burgess et al.

Choices of operand truncation in the SRT division algorithm

IEEE Trans Comput

(1995)

Oberman SF, Flynn MJ. Measuring the complexity of SRT tables. Technical report CSL-TR-95-679, Computer Systems...

H. Rueß et al.

Modular verification of SRT division

Antelo E, Lang T, Montuschi P, Nannarelli A. Fast radix-4 retimed division with selection by comparisons. In:...

Cited by (3)

Design of a compact reversible fault tolerant division circuit
2016, Microelectronics Journal
Citation Excerpt :
Thus, we need parity preserving reversible logic gate to construct parity preserving reversible circuits. Division is the most difficult operation in the computer arithmetic [9,12]. Nowadays, people use a hardware module divider to implement the division algorithm.
In this paper, we propose an n-bit reversible fault tolerant binary division circuit, where n is the number of bits of dividend and divisor. We present a new algorithm for division operation with the optimum time complexity in the design of dividers. The proposed division method consists of four steps: Firstly, it considers floating-point data and rounding. Secondly, it performs correctly rounded division. Thirdly, it performs correct rounding from one sided approximations. Finally, it calculates the result of the division operation. The proposed design of the divider circuit shows that it is composed of reversible fault tolerant multiplexers, parallel-in–parallel out (PIPO) left shift registers, D-Latch, rounding and normalization registers and parallel adder. The proposed divisor register and the parallel adder have the minimum quantum cost with respect to the existing ones. Fredkin gates and Feynman double gates are also used to form the divider circuit. Finally, we present an algorithm to construct a compact n-bit reversible fault tolerant binary division circuit. In this paper, a new algorithm has also been proposed to reduce the number of steps required for performing division operation. Our circuit performs better than the existing approaches considering all the efficiency parameters of reversible logic design which includes number of gates, constant inputs, garbage outputs, quantum cost and delay of the circuit, e.g., for a 256-bit binary division circuit, the proposed reversible fault tolerant binary division circuit improves 27.75% on the number of gates, 0.03% on garbage outputs, 11.04% on quantum cost, 8.94% on constant inputs and 23.50% on delay with respect to the best known existing divider circuit. We also simulate the proposed n-bit reversible fault tolerant binary division circuit using Microwind DSCH 3 which shows the correctness of the circuit.
Reversible and DNA computing
2020, Reversible and DNA Computing
An efficient VLSI architecture for a serial divider
2017, Proceedings of 2nd International Conference on 2017 Devices for Integrated Circuit, DevIC 2017

View full text

A novel implementation of radix-4 floating-point division/square-root using comparison multiples

Abstract

Introduction

Section snippets

Background

Comparison multiples QDS

Radix-4 SRT division recurrence

Comparison multiple sqrt

Combined division and sqrt

Timing analysis

Conclusion

J Parallel Distr Comput

Division algorithms and implementations

IEEE Trans Comput

Area and performance tradeoffs in floating-point divide and square-root implementations

ACM Comput Surv (CSUR)

Bit-level analysis of an SRT divider circuit

A tale of two numbers

SIAM News

Logical effort: designing fast CMOS circuits

A new class of digital division methods

IRE Trans Electron Comput

Techniques of multiplication and division for automatic binary computers

Q J Mech Appl Math

Division and square root: digit-recurrence algorithms and implementations

Computer arithmetic: algorithms and hardware designs

Choices of operand truncation in the SRT division algorithm

IEEE Trans Comput

Modular verification of SRT division