A novel implementation of radix-4 floating-point division/square-root using comparison multiples

https://doi.org/10.1016/j.compeleceng.2008.04.013Get rights and content

Abstract

A new implementation for minimally redundant radix-4 floating-point SRT div/sqrt (division/square-root) with the recurrence in the signed-digit format is introduced. The implementation is developed based on the comparison multiples idea. In the proposed approach, the magnitude of the quotient (root) digit is calculated by comparing the truncated partial remainder with 2 limited precision multiples of the divisor (partial root). The digit sign is determined by investigating the polarity of the truncated partial remainder. A timing evaluation using the logical synthesis (Synopsys DC with Artisan 0.18 μm typical library) shows a latency of 2.5 ns for the recurrence of the proposed div/sqrt. This is less than of the conventional implementation.

Introduction

To achieve high performance in carrying out massive mathematical computations, almost all recent microprocessors and digital signal processors, perform in hardware all four fundamental arithmetic operations, namely addition, subtraction, multiplication and division [1]. Studying the processors’ architectures and implementations reveals that of the four operations, division is not performed as fast as addition, subtraction and multiplication [2].

In 1994, Intel lost $475 Million due to an error in the division part of the Pentium microprocessor’s floating-point unit [3], [4]. This fiasco highlights that the algorithms, architectures and realisations proposed for division are still immature, requiring more investigation and attention especially when designing modern high performance processors.

The general objective of this work is to develop an algorithm for more robust radix-4 (floating-point) FP SRT division with less chance of error when being implemented. The FP algorithm is then modified to be used for FP sqrt as well. In this work, by increasing the parallelism among the functional units, more efficient implementation of radix-4 FP SRT div/sqrt is achieved. Having employed more efficient functioning modules with higher concurrency among them, a quicker circuit for radix-4 FP SRT div/sqrt is obtained. The time delay estimations carried out using the method of logical effort [5] and the logic synthesis show considerable decrease in the execution time with respect to conventional implementations.

Section snippets

Background

Some surveys [6], [2] show that most VLSI implementations of FP division are based on digit recurrence division algorithms known as SRT (SRT division algorithm was introduced independently by Sweeney [7], Robertson [8] and Tocher [9]). SRT division is an iterative algorithm with linear convergence toward the quotient. In this algorithm, the quotient digit selection (QDS) function calculates a fix number of the quotient bits every iteration. The speed of a FP SRT divider is mainly determined by

Comparison multiples QDS

Although the latest techniques of implementing the QDS function contributing to increase the speed of FP SRT division are well studied in the literature, still the comparison multiples method [10] is not seriously considered by designers. For example, Ercegovac and Lang [10] claim:

Since the resulting implementation is still complicated because of the need for the multiples (of the divisor) and the comparisons, we develop the following alternative, which has more flexibility and results in a

Radix-4 SRT division recurrence

Traditionally, the recurrence of radix-r FP SRT division is based on the scheme represented in Fig. 2. However, this conventional structure can be altered in order to achieve less delay. For radix-r FP SRT division, Nikmehr [28] introduce a recurrence, which is optimised for the QDS function displayed in Fig. 4. In dividers, which use the QDS function shown in Fig. 3, a factor generator precalculates ±d and ±2d and the quotient digit qj+1 which is determined by the QDS function, selects the

Comparison multiple sqrt

In Section 5, SRT FP sqrt and the issues related to its implementation are discussed. Due to sharing the same characteristics, s=x can be represented using digit recurrence algorithms as in division q=xd [10]. This implies that with minor changes, the comparison multiple technique developed for SRT FP division can be used for implementing SRT FP sqrt. To keep the discussion concise, this section covers only the changes that need to be applied to Sections 3 Comparison multiples QDS, 4 Radix-4

Combined division and sqrt

The IEEE 754 standard requires the designers to implement both division and sqrt in the FP units of the microprocessors [11]. Given the number of similarities between the recurrences of these two, it is very normal to implement a combined circuit that perform both operations. Examples of such implementations can be found in [29], [30].

Comparing the specification of FP SRT division and sqrt reveals that to match the recurrences, it is sufficient to modify the sqrt PR such thatneww[j]=oldw[j]2.

Timing analysis

A fast and easy to use delay model, called logical effort, is introduced by Sutherland et al. [5]. This model is accurate enough not only to predict whether circuit a is faster than circuit b but also to express an approximation to the circuit absolute delay. Studying reports, which involve delay estimation, reveals that this method is very popular among recent researchers as well as circuit designers [31], [32], [33].

Using the method of logical effort, the critical path delay of the proposed

Conclusion

A new realisation of the QDS function based on the new comparison multiples approach is proposed. The QDS function is then used for implementing a minimally redundant radix-4 FP SRT division. In this method, which is mathematically and architecturally described, instead of searching for the quotient digit in a lookup table, the quotient digit is directly calculated in sign and magnitude format. Using the new representation for the quotient digits, the fan out of some components on the critical

References (35)

  • M.D. Ercegovac et al.

    Module to perform multiplication, division, and square root in systolic arrays for matrix computations

    J Parallel Distr Comput

    (1991)
  • S.F. Oberman et al.

    Division algorithms and implementations

    IEEE Trans Comput

    (1997)
  • P. Soderquist et al.

    Area and performance tradeoffs in floating-point divide and square-root implementations

    ACM Comput Surv (CSUR)

    (1996)
  • R.E. Bryant

    Bit-level analysis of an SRT divider circuit

  • C.B. Moler

    A tale of two numbers

    SIAM News

    (1995)
  • I.E. Sutherland et al.

    Logical effort: designing fast CMOS circuits

    (1999)
  • Oberman SF, Quach N, Flynn MJ. The design and implementation of a high-performance floating-point divider. Technical...
  • Cocke J, Sweeney DW. High speed arithmetic in a parallel device. Technical report, IBM Corp.; February...
  • J.E. Robertson

    A new class of digital division methods

    IRE Trans Electron Comput

    (1958)
  • K.D. Tocher

    Techniques of multiplication and division for automatic binary computers

    Q J Mech Appl Math

    (1958)
  • M.D. Ercegovac et al.

    Division and square root: digit-recurrence algorithms and implementations

    (1994)
  • IEEE. Std 754-1985 IEEE standard for binary floating-point arithmetic, Standards Committee of The IEEE Computer...
  • B. Parhami

    Computer arithmetic: algorithms and hardware designs

    (2000)
  • N. Burgess et al.

    Choices of operand truncation in the SRT division algorithm

    IEEE Trans Comput

    (1995)
  • Oberman SF, Flynn MJ. Measuring the complexity of SRT tables. Technical report CSL-TR-95-679, Computer Systems...
  • H. Rueß et al.

    Modular verification of SRT division

  • Antelo E, Lang T, Montuschi P, Nannarelli A. Fast radix-4 retimed division with selection by comparisons. In:...
  • Cited by (3)

    • Design of a compact reversible fault tolerant division circuit

      2016, Microelectronics Journal
      Citation Excerpt :

      Thus, we need parity preserving reversible logic gate to construct parity preserving reversible circuits. Division is the most difficult operation in the computer arithmetic [9,12]. Nowadays, people use a hardware module divider to implement the division algorithm.

    • Reversible and DNA computing

      2020, Reversible and DNA Computing
    • An efficient VLSI architecture for a serial divider

      2017, Proceedings of 2nd International Conference on 2017 Devices for Integrated Circuit, DevIC 2017
    View full text