Received 25 February 2020; revised 16 March 2020; accepted 17 March 2020. Date of publication 20 March 2020; date of current version 13 April 2020. Digital Object Identifier 10.1109/OJCOMS.2020.2982355

## Computational Power Evaluation for Energy-Constrained Wireless Communications Systems

MARYAM TARIQ<sup>1</sup> (Student Member, IEEE), ARAFAT AL-DWEIK<sup>® 1,2</sup> (Senior Member, IEEE), BAKER MOHAMMAD<sup>® 1</sup> (Senior Member, IEEE), HANI SALEH<sup>® 1</sup> (Senior Member, IEEE), AND THANOS STOURAITIS<sup>1</sup> (Fellow, IEEE)

> <sup>1</sup>Department of Electrical Engineering and Computer Science, Khalifa University, Abu Dhabi, UAE <sup>2</sup>Department of Electrical and Computer Engineering, Western University, London, ON N6A 3K7, Canada

> > CORRESPONDING AUTHOR: A. AL-DWEIK (e-mail: dweik@fulbrightmail.org)

This work was supported in part by the KU Center for Cyber Physical Systems (C2PS) under Grant C2PS-T2. This work was published in part at the International Conference on Electrical and Computing Technologies and Applications, Ras Al Khaimah, UAE, Nov. 2017, pp. 1–6, doi: 10.1109/ICECTA.2017.8251965.

**ABSTRACT** Estimating the power consumption and computational complexity of various digital signal processing (DSP) algorithms used in wireless communications systems is critical to assess the feasibility of implementing such algorithms in hardware, and for designing energy-constrained communications systems. Therefore, this paper presents a novel approach, based on practical system measurements using field programmable gate array (FPGA) and application-specific integrated circuit (ASIC), to evaluate the power consumption and the associated computational complexity of the most common mathematical operations performed within various DSP algorithms. Using the proposed approach, a new metric is developed for mapping the computational complexity to the computational power consumed by the mathematical operation in wireless transceivers. This allows combining the commonly used computational complexity metrics that are typically computed for each mathematical operation separately. Consequently, a single unified metric can be used to describe the entire algorithm. Therefore, the comparison and trade-offs between different algorithms become easier and more informative. The developed approach is used to evaluate the computational power of several DSP algorithms used in wireless communications systems, and perform thorough computational complexity comparisons. The obtained results reveal that computational complexity comparisons using different mathematical operations can be highly misleading in several scenarios. The power consumption evaluation of the considered DSP algorithms show that some algorithms may require a prohibitively high power, which makes such algorithms unsuitable for powerconstrained wireless communications systems. The results also show that the proposed methodology can be adopted for various hardware implementation, however, some calibration might be required based on the adopted platform.

**INDEX TERMS** Power, complexity, power optimization, computational power, computational complexity, FPGA, ASIC, PAPR, CFO, channel estimation.

## I. INTRODUCTION

## A. OVERVIEW

E NHANCING the battery lifetime through power optimization is currently one of the critical challenges for several wireless communications systems. Although massive research efforts have been devoted to develop power-efficient communications systems, the total power consumption per user and per unit time is drastically increasing [1]. Such increase is due to the proliferation of power-hungry high-data rate applications, such as video communications and online gaming. For example, the data traffic per smartphone per month in North America has



FIGURE 1. A general classification of the power budget of a wireless transceiver.

increased from 3.6 GB in 2015 to 5.1 GB in 2016, and it is expected to exceed 25 GB in 2022, an increase of about 5-fold [2]. Therefore, the demand for power-efficient systems will remain high.

The power consumed by a communications device is typically shared between the transmitter and receiver, where each of which has two main subsystems, an analog and a digital subsystem [3]. At the transmitter side, the analog part is the one that consumes most of the power because it consists of the power-hungry power amplifier (PA) [3], in addition to other components such as the mixer, phase locked loop (PLL), voltage controlled oscillator (VCO), etc. The digital part of the transmitter is typically less power demanding because the signal processing at the transmitter is relatively small. As shown in Fig. 1, the main digital processing at the transmitter is required for source and channel encoding, encryption and modulation. The modulation part, for certain applications, may require significant signal processing, as in the case of peak-to-average power ratio (PAPR) reduction techniques [4], [5], or when the transmitter has to perform some optimization for adaptive bit and power allocation in multicarrier systems [6]-[8]. It is worth noting that the power consumed by the PA depends substantially on the range and data rate that the transmitter supports [1]. Therefore, power adaptation is indispensable for applications where the transmission range and data rate may vary over time [1], [9], [10].

Similar to the transmitter, the analog part at the receiver may also consume considerable power, particularly, the low-noise amplifier (LNA) [3]. Nevertheless, the digital part at the receiver typically has several additional digital operations, such as channel estimation, equalization, and synchronization. Moreover, it is highly desirable to design receivers that can operate at low signal-to-noise ratios (SNRs) using blind synchronization [11]–[13] and channel estimation schemes [14], [15]. However, to operate efficiently at low SNRs without pilot symbols, such schemes

VOLUME 1, 2020

usually require a considerable number of signal processing operations. Similarly, error control coding is another pivotal component in wireless receivers that allows the receiver to work at low SNRs, however, at the expense of extensive number of decoding iterations [16]. Consequently, the transmitter and receiver may cooperate to optimize the overall power of the entire link, i.e., the transmitter may reduce the power consumption at the receiver by increasing the transmission power, or it may save its own power and force the receiver to compensate the saved power to maintain the same quality of service (QoS) for the users.

Generally speaking, the processing power at the digital part of the transceiver has not captured much attention in the literature, because it is typically considered negligible when compared to the power consumed at the radio frequency (RF) and analog parts of the system [17], [18]. However, several emerging applications are designed to support high data rate transmission over short distances using multicarrier modulations, such as orthogonal frequency division multiplexing (OFDM). Therefore, the power requirements of the DSP operations shown in Fig. 1 at the transmitter and receiver become non-negligible as compared to the RF and analog circuitries. Examples for such applications include deviceto-device (D2D) communications [19], Wi-Fi [20], and the narrowband Internet of Things (NB-IoT) [21], [22]. In adhoc low-power networks, the nodes will mostly be using peer-to-peer communications, because the deployment of a central node or a base station is not feasible [19], [23]. Hence, the power consumption of the computational logic becomes an important component of the device power budget, because the base station duties will be off-loaded to the nodes [10], [24].

#### **B. RELATED WORK**

Several research papers have reported the direct on-board power measurements and compared them to estimates through power estimation tools. For example, [25] and [26] reported experimental measurements of power consumption for a core logic of 45-nm Spartan 6 field programmable gate array (FPGA) and core logic of a 65-nm Cyclone III FPGA, respectively, and compared them to values predicted by a power estimation tool. In [25], different types of  $32 \times 32$ bit multipliers were implemented using both look-up tables (LUTs) and embedded units, as case studies. Their findings show a high difference between the measured and estimated power values, especially in circuits with longer combinatorial paths, where the tools tend to over estimate the number of produced glitches. The authors of [26] tested varying word length multipliers with multiple designs, such as LUTs and embedded blocks with or without pipelining stages, operating at 50 MHz. Their results reported deviations between power measurements and power estimations. Both works limited their measurements to one multiplier at a certain clock frequency. Meintanis and Papaefstathiou [27] presented a comparison between the estimated and measured power consumption of security algorithms on both Xilinx and Altera FPGAs, where it is shown that Altera's PowerPlay power estimation tool is more accurate than Xilinx's Xpower tool. However, their study focuses on the algorithm level, without considering to the low-level arithmetic operations. On the other hand, [28] presented experimental measurements by implementing FPGA and application-specific integrated circuit (ASIC) circuits using purely LUT-based logic elements. Their results exhibit that FPGAs are approximately 35 times larger in terms of area, between 3.4 to 4.6 times slower, and about 14 times more dynamic power consumption, compared to ASIC technology. However, they did not provide results regarding the power consumption of various arithmetic operations. Therefore, to the best of the authors' knowledge, there is no work in the literature that quantifies and compares the computational power at the operation level. It is worth noting that part of this work was published in [29], where the power complexity is evaluated only for FPGAs. This work considers both the FPGA and ASIC implementation and compares the results obtained for both cases.

## C. MOTIVATION AND CONTRIBUTION

Despite the fact that DSP power in most communications systems is becoming a non-negligible element in the system's total power budget, to the best of the authors' knowledge, very little work in the literature is dedicated towards accurately quantifying such power. Instead, the computational complexity is typically used to indicate the power consumption of various algorithms, which is defined in terms of the total number of arithmetic operations required to implement a particular algorithm [4], [5], [14]–[16], [30], [31]. Although this approach is the de facto standard for computational complexity analysis, it actually has two critical limitations. The first is that it is difficult to compare the complexity of different operations accurately. Consequently, it will be difficult to accurately compare the complexity of different algorithms that have different types of operations. Second,

counting the number of operations without evaluating the associated power consumption of the considered operations does not provide sufficient information about the feasibility of implementing a particular algorithm in practical systems. For example, the number of multiplications for a particular channel estimation algorithm used for OFDM is in the order of  $N^3$  [31], where N is the number of subcarriers. For most practical OFDM based systems N is typically greater than 128. Consequently, the power required by this algorithm is about 23 watts at a frequency of 5 MHz, as it will be shown in Section III. Clearly, such information is more informative than just the total number of multiplications, i.e.,  $N^3$ . Therefore, developing a general framework for estimating the DSP power consumption is pivotal for power optimization of wireless communications systems.

This work considers performing extensive power measurements using field programmable gate array (FPGA) and application-specific integrated circuit (ASIC) to quantify the power consumption at the operation level. The results are obtained for four basic arithmetic operations, namely, addition, subtraction, multiplication, and division. Then, the obtained results are applied to several key algorithms proposed for wireless communications systems to quantitatively evaluate and compare their power consumption. Moreover, we demonstrate the advantages of using the power consumption model for comparing various systems, and providing an informative metric for the feasibility of implementing such algorithms in practical systems. Moreover, the obtained results show that relative power of various operations using FPGA and ASIC are comparable, but not identical, which implies that accurate relative power evaluation may require some calibration based on the targeted hardware.

## D. PAPER ORGANIZATION

The rest of the paper is organized as follows. In Section II, the FPGA test-bed setup and measurement procedure are presented. Section III presents the obtained measurement results for the four fundamental arithmetic operations. Section IV discusses the ASIC design and presents simulation results. Section V compares the computational power analysis for several fundamental algorithms used in wireless receivers. Finally, conclusions and future work are presented in Section VI.

# II. FPGA TEST-BED SETUP AND MEASUREMENT PROCEDURE

An Artix-7 FPGA board, type Nexys 4 DDR from Xilinx [32], which is based on the XC7A100T, is used to perform power measurements for different arithmetic operations. A KEYSIGHT N6705B DC power analyzer is used to measure the power consumed by the FPGA core, by measuring the current flow to the FPGA core and the voltage at the power pin of the FPGA core, simultaneously. Unsigned 16-bit integers are considered while performing different number of operations at different FPGA clock



FIGURE 2. Test-bed setup [29]



FIGURE 3. Current measurement access points [29].

frequencies. In the collected measurements, the number n refers to the number of operations performed in parallel. Fig. 2 shows the testbed used to collect the measurements, whereas Fig. 3 shows the accessible test points, to which the power analyzer was connected in order to measure the core current.

In the considered designs, the output bits were all fed into a single AND gate, in order to route all the outputs of the ALU to a certain output pin to prevent the FPGA synthesis tool from eliminating any operation with unused output. Therefore, the measured power actually includes some overhead from the additional logic used, however, such logic is small relative to the overall circuitry, and thus, its impact on the system power consumption is negligible. The routing network is another factor that also dissipates part of the FPGA power [33], [34]. With an optimized interconnect design, the power dissipated by the routing network can be about 25% of the total dynamic power [34]. Therefore, it can be considered that the measured



FIGURE 4. The measured power for the addition operations in FPGA using various clock frequencies.



FIGURE 5. The measured power for the subtraction operation in FPGA using various clock frequencies.

power per operation includes the routing power for that operation.

#### **III. MEASUREMENT RESULTS**

The measured power consumption of the considered arithmetic operations and the curve fitting results are presented in Figs. 4, 5 and 6, where the clock frequency is measured in MHz. The measurement results are obtained from [29], however, because the power measurements are generally linear, we limit the fitting functions to first order polynomials, which was not enforced in [29].



FIGURE 6. The measured power for the multiplication operation in FPGA using various clock frequencies.

## A. ADDITION

Fig. 4 shows the measured power needed for the addition operation. As can be noted from the figure, the power changes linearly versus n and f, and the fitting polynomial can be expressed as

$$P_{+} = (a_1 n + a_2)f \tag{1}$$

where the polynomial coefficients are  $a_1 = 5.258 \times 10^{-6}$  and  $a_2 = 1.223 \times 10^{-4}$ . Generally speaking, the values of  $P_+$  are small, however they are non-negligible for large values of *n* and *f*.

#### **B. SUBTRACTION**

The power required for the subtraction operation is depicted in Fig. 5, and the fitting polynomial is given by

$$P_{-} = (a_1 n + a_2)f \tag{2}$$

where the polynomial coefficients are,  $a_1 = 6.6 \times 10^{-6}$  and  $a_2 = 3 \times 10^{-5}$ . To compare the power for the addition and subtraction operation, we compute the relative average power for both operations using (1) and (2) for large values of *n*, which can be expressed as

$$\eta_{+-} = \frac{\lim_{n \to \infty} P_{-}}{\lim_{n \to \infty} P_{+}}.$$
(3)

Therefore,  $\eta_{+-} = 6.6 \times 10^{-6} / 5.258 \times 10^{-6} = 1.25$ . As can be noted form the result, subtraction requires some additional power as compared to addition.

### C. MULTIPLICATION

The power measurements associated with the multiplication operation are depicted in Fig. 6, and the fitting polynomial is given by,

$$P_{\times} = (a_1 n + a_2)f. \tag{4}$$

where  $a_1 = 2.578 \times 10^{-5}$  and  $a_2 = 4 \times 10^{-5}$ . Therefore, the power required for multiplication has linear tendency as well, and for large values of *n*, the relative average power  $\eta_{\times +} =$  $2.578 \times 10^{-5}/5.258 \times 10^{-6} = 4.903$ , which is computed using (3) except that  $P_-$  is replaced by  $P_{\times}$ . It is worth noting that the relative power obtained in this work is close to the approach used in [35], which stated that  $P_{\times} = 4P_+$ .

#### D. DIVISION

Implementing large number of division operations at high frequencies in FPGA is generally challenging due to the FPGA space limitations. Therefore, the division operation power estimation is performed only using the ASIC design. As an example, if the division operation is implemented using the binary non-restoring division algorithm, then the maximum number of dividers that can be realized in the Artix<sup>TM</sup>-7 FPGA core is ~ 80. Moreover, the maximum clock frequency that can be used is about 5.88 MHz. Consequently, the results for the division operation using the FPGA would be inconsistent with the other operations. Unlike the FPGA, the ASIC implementation is using the basic gates and has on limit on the size nor any overhead on the implementation results, thus, the division results would be more reliable.

#### E. ENERGY ANALYSIS

By noting that all considered operations are base on combinational logic design, the energy consumed by each operation can be computed as the product of the measured power and maximum combinational path delay  $\tau$ , thus, the consumed energy can be expressed as  $E_+ = \tau_+ P_+$ ,  $E_- = \tau_- P_-$ , and  $E_{\times} = \tau_{\times} P_{\times}$  for addition, subtraction and multiplication, respectively. The values of maximum combinational path delay for each operation are  $\tau_+ \approx \tau_- = 2.055$  ns, and  $\tau_{\times} = 3.909$ . Consequently, for large values of *n* we obtain  $E_{\times}/E_+ = (\tau_{\times}/\tau_+)\eta_{\times +} = 9.3264$ .

## F. DISCRETE FOURIER TRANSFORM POWER CONSUMPTION

To verify that the power consumed by a certain algorithm can be estimated from the individual operations power, we consider the discrete Fourier transform (DFT) because it is the most prominent part of OFDM. Therefore, a 128-point DFT is implemented using the Radix-2 algorithm. The FFT code is generated using the Spiral DFT/FFT IP Generator [36], which generates customized DFT soft IP cores in synthesizable RTL Verilog. This code generator is configured to generate forward DFT with 16 bits fixed point precision, and is implemented with fully streaming architecture. The generated design consists of 24 multipliers and 40 adders. The throughput of this architecture is one transform every 64 cycles. The design is implemented on the FPGA through Vivado software using clock frequency of 50 and 100 MHz. The power measurement for the 50 Hz frequency shows that power consumption of the DFT is about 57.3 mW, and the total power computed by adding the power consumed by the 24 multipliers and 40 adders is 49.5 mW, and hence, the difference is about 16%. The same procedure is applied to the 100 MHz clock frequency, and the obtained results are 86.2 and 99.2 mW, for the measured and calculated power, respectively. Therefore, the difference in this case is about 13%. Consequently, the obtained results indicate that adding the powers of individual operations can be considered generally as an accurate indicator for the total power consumed by a certain of the algorithm.

## **IV. ASIC DESIGN**

The proposed model was extended to ASIC technology by examining the power consumption values provided by the ASIC synthesis tool using the Synopsys design compiler. The proposed architectures were coded using Verilog-HDL language and simulated for functional verification. The designs were synthesized using state-of-the art tools from Synopsys (Design Compiler) using an industry standard and tapeout proven standard-cell flow. The standard cell library was designed using Global Foundries 65 nm low power process, they were fully characterized in silicon and is in an industry standard tape-out ready form. Power consumption was estimated from the synthesized design using synthesis Design Compiler tools from Synopsys. It is worth noting that the 65 nm technology has been widely considered for the design of various wireless communications systems as reported in [37] and the references listed therein.

The codes we used for implementation of mathematical operations on FPGAs were synthesized with very few modifications to suite the ASIC implementation. As anticipated, the measurements show that the ASIC design exhibits much lower power consumption compared to the FPGA, which in turn affected the power ratios of various operations as compared to the addition. The minimum value of the operating frequency f used is 15 kHz and the maximum value is 5 MHz. This modification of frequency range, as compared to the FPGA, is applied to produce accurate power estimation at low operating frequencies. Therefore, the operating frequencies are close to those used in some common practical wireless receivers such as Long Term Evolution (LTE) handsets. Moreover, the number of operations was extended to 400 operations. It is worth noting that using such low operating frequencies for the FPGA is not feasible because measuring the FPGA power consumption at such low frequencies is very challenging as the consumed power is very close to the FPGA static power.

### A. ADDITION

Fig. 7 represents the addition operation power consumption reported by the ASIC synthesis tool versus the number of adders n at various clock frequencies f. As can be



FIGURE 7. The measured power for various numbers of additions with different clock frequencies using ASIC design.

noted from the figure, the consumed power increases linearly with n and f, which agrees with the trend obtained by the FPGA. However, the fitting polynomial has different constant coefficients as compared to those in (1), which is due to the substantial decrease in power consumption of the ASIC when compared to addition implemented using FPGA. The polynomial can be expressed as

$$\mathcal{P}_{+} = (a_1 + a_2 f)n \tag{5}$$

where the coefficients are  $a_1 = 2.003 \times 10^{-10}$ , and  $a_2 = 4.023 \times 10^{-7}$ . Therefore, addition in FPGA is more power demanding than the ASIC by a factor of

$$\zeta_{+} = \frac{\lim_{n \to \infty} P_{+}}{\lim_{n \to \infty} \mathcal{P}_{+}}$$
  
= 13.13. (6)

The result in (6) implies that the power consumed by  $\sim 13$  ASIC adders is equivalent to the power consumed by a single FPGA adder.

The subtraction operation was also considered using the ASIC design, and the obtained results show that  $\mathcal{P}_+ \approx \mathcal{P}_-$ . Therefore, the same polynomial will be used for addition and subtraction.

#### **B. MULTIPLICATION**

Similarly, the multiplication operation is implemented using ASIC design, where the consumed power results are reported and plotted along with their linear fitting, as shown in Fig. 8. The results show a reduction in power consumption as compared to the FPGA data. The trend clearly follows the same linear behavior of power versus the number of operations



FIGURE 8. The measured power for various numbers of multiplications with different clock frequencies using ASIC design.

and the operating frequency. The multiplication power fitting polynomial related to the ASIC design is given by,

$$\mathcal{P}_{\times} = (a_1 + a_2 f)n \tag{7}$$

where  $a_1 = 1.073 \times 10^{-9}$  and  $a_2 = 1.247 \times 10^{-6}$ . In order to compare the power consumption of addition and multiplication in ASIC technology, the relative average power for both operations is computed as,

$$\eta_{\times+} = \frac{\lim_{n \to \infty} \mathcal{P}_{\times}}{\lim_{n \to \infty} \mathcal{P}_{+}}$$
  
=  $\frac{1.073 \times 10^{-9} + 1.247 \times 10^{-6} f}{2.003 \times 10^{-10} + 4.023 \times 10^{-7} f}.$  (8)

Therefore,  $\eta_{\times+}$  depends on the clock frequency where  $\lim_{f\to 0} \eta_{\times+} = 5.35$ , and  $\lim_{f\to\infty} \eta_{\times+} = 3.099$ . As can be noted from the results, the relative multiplication power ratio is comparable to the one obtained using the FPGA platform, which is 4.903.

The number of ASIC multiplications that are equivalent to one FPGA multiplication is given by,

$$\zeta_{\times} = \frac{\lim_{n \to \infty} P_{\times}}{\lim_{n \to \infty} \mathcal{P}_{\times}}$$
$$= \frac{8.889655172 \times 10^5 f}{37 + 43000 f}.$$
(9)

As can be noted from (9),  $\zeta_{\times}$  depends on the clock frequency where  $\lim_{f\to\infty} \zeta_{\times} = 20.67$ . However,  $\zeta_{\times}$  is not sensitive to the variations of f given that f > 1 kHz.

## C. DIVISION

The ASIC power results for the division operation are given in Fig. 9. The polynomial fitting with the minimum RMSE



FIGURE 9. The measured power for various numbers of divisions with different clock frequencies using ASIC design.

is given by

$$\mathcal{P}_{\pm} = (a_1 n + a_2) f \tag{10}$$

where  $a_1 = 4.4530 \times 10^{-6}$ , and  $a_2 = 9.4207 \times 10^{-8}$ . As can be noted from the figure and (10), the division operation power changes linearly with *n* and *f*, and it is larger than all other operations. The average relative division to addition power is given by  $\div$ 

$$\eta_{\div+} = \frac{\lim_{n \to \infty} \mathcal{P}_{\div}}{\lim_{n \to \infty} \mathcal{P}_{+}} \\ = \frac{4.453 \times 10^7 f}{2003 + 4.023 \times 10^6 f}.$$
 (11)

Therefore,  $\lim_{f\to\infty} \eta_{\div+} = 11.06$ . For small values of f,  $\eta_{\div+}$  remains generally unchanged given that f > 1 kHz.

As can be noted from the obtained results, the ASIC requires much less power than FPGA, even though the FPGA is based on 28 nm process while the ASIC is using 65 nm. Such behavior is obtained because the power consumption of ASICs can be very minutely controlled and optimized, thus, the power consumption of an ASIC would be much less than an FPGA that runs the same algorithm [33, p. 44]. Nevertheless, the ratios of different operations within the two technologies are equivalent.

#### D. ENERGY ANALYSIS

Similar to the FPGA scenario, all considered operations in the ASIC design are base on combinational logic. Therefore, the energy consumed by each operation is  $\mathcal{E}_+ = \tau_+ \mathcal{P}_+$ ,  $\mathcal{E}_- = \tau_- \mathcal{P}_-$ ,  $\mathcal{E}_{\times} = \tau_{\times} \mathcal{P}_{\times}$ , and  $\mathcal{E}_{\div} = \tau_{\div} \mathcal{P}_{\div}$  for addition, subtraction, multiplication, and division, respectively. The values of maximum combinational path delay for each

| TABLE 1. The computational complexity of various PAPR reduction schemes, the |  |
|------------------------------------------------------------------------------|--|
| relative complexity is cmputed using $N = 128$ .                             |  |

|      | RM                            | RRM  | RA                          | RRA  |
|------|-------------------------------|------|-----------------------------|------|
| CORR | 3N + 1                        | -    | 3N-1                        | -    |
| PAPR | 2N                            | 1.50 | 2N                          | 1.50 |
| DSR  | $2N\left(7 + \log_2 N\right)$ | 0.10 | $N\left(8+3\log_2 N\right)$ | 0.1  |
| MSE  | 4N + 1                        | 0.75 | 2N-1                        | 1.50 |

operation are  $\tau_+ \approx \tau_- = 7.92$  ns,  $\tau_{\times} = 10.13$  ns, and  $\tau_{\div} = 106.93$  ns. As expected [38], the divider delay is substantially larger then the multiplication. By comparing the combinational logic delay in the FPGA and ASIC, it can be noted that the ASIC delay is higher, which is due to the fact that the FPGA is based on 28 nm technology, while the ASIC is based on 65 nm technology. For large values of *n* we obtain  $\mathcal{E}_{\times}/\mathcal{E}_+ = (\tau_{\times}/\tau_+)\eta_{\times +} = 6.2711$ , and  $\mathcal{E}_{\div}/\mathcal{E}_+ = (\tau_{\div}/\tau_+)\eta_{\div +} = 149.324$ .

# V. COMPUTATIONAL POWER ANALYSIS OF WIRELESS SYSTEMS

In this section, the obtained results are used to map the computational complexity to computational power for several signal processing algorithms that have been proposed for OFDM-based systems. The considered algorithms will be compared against each other using the power complexity metric, and the overall power consumption of these techniques will be discussed. The considered algorithms include carrier frequency offset (CFO) estimators, peak-to-average power ratio (PARP) reduction schemes, and channel estimators. The mapping is performed by computing the total power consumption associated with all mathematical operations required by the corresponding algorithm. Since the power analysis is based on operations in the real domain, the conversion from the complex to real domain is performed by noting that one complex multiplication (CM) requires four real multiplications (RMs) and two real additions (RAs), and one complex addition (CA) requires two RAs. The power for the considered techniques is evaluated for different operating clock frequencies in the range 0.015 to 5 MHz. For all the considered algorithms, the number of subcarriers N = 128and the cyclic prefix L = 8. It is worth noting that the minimum clock frequency is equivalent to the OFDM symbol period in the long term evolution (LTE) standard.

## A. PAPR REDUCTION ALGORITHMS

The Partial-Transmit Sequence (PTS), which is widely used for PAPR reduction is considered in [5] with different PAPR metrics such as distortion-to-signal power ratio (DSR) [39], mean squared error (MSE), and cross correlation (CORR), in addition to the actual PAPR metric [35]. The computational complexity of all considered algorithms is evaluated and compared by counting the total number of mathematical operations required for each algorithm. Table 1 shows the computational complexity of four different algorithms reported in [5]. In Table 1, the relative real multiplications (RRM) and relative real additions (RRA) are used to compare the complexity of the CORR algorithm with other PAPR reduction algorithms, where the CORR RMs and RAs are considered as the reference, i.e., the RRA is computed as the ratio of the number of RAs of the CORR to any other algorithm. It can be noted from Table 1 that the CORR is more complex than PAPR, and less complex than DSR, while the relative complexity of the MSE is unclear because it requires fewer RMs but more RAs.

The total power for the algorithms considered in Table 1, using different clock frequencies is given in Tables 2 and 3 for FPGA and ASIC implementation, respectively.

In Table 2 and 3, the computational power is computed based on the results of Table 1 for different clock frequencies. The results in Table 2 and 3 confirm the claims of [5] regarding the general trends. However, some critical observations can be made based on the proposed power analysis:

- The DSR computational power requirements are substantial as compared to other techniques. Moreover, the total computational power for DSR becomes prohibitively large at high clock frequencies.
- As shown in Table 1, the CORR and MSE algorithms have opposing requirements in terms of the multiplication and addition operations, making it difficult to compare them in terms of computational complexity. Using the proposed approach, the comparison is straightforward, and it shows that there is a clear difference between the total computational power of each algorithm.

## **B. CHANNEL ESTIMATION ALGORITHMS**

Al-Naffouri et al. [14] proposed an algorithm for blind channel estimation of OFDM-based wireless systems. The authors initially proposed a high complexity version, denoted as the blind algorithm, then, a complexity reduction is performed using carrier reordering. The computational complexity of the two proposed algorithms is compared to a training-based algorithm at high SNRs, as shown in Table 4, and the computational power of the three algorithms is given in Table 5 for FPGA implementation, and in Table 6 for ASIC design, which confirms the conclusions made in [14] regarding the considerable complexity reduction that can be achieved by the carrier reordering process. However, the results show that the computational power for the carrier reordering is still high, which is due to the massive number of required multiplications. Moreover, the computational power comparison between the training-based and the blind algorithm with carrier reordering shows a tremendous discrepancy even at typical values of N, which disagrees with the statement made in [14] regarding the comparable complexity of both algorithms. For example, the computational power ratio between the blind algorithm with carrier reordering and the trainingbased algorithm at 1.5 MHz is about 10.84 for FPGA, and 11.05 for ASIC design.

| TABLE 2. The computational power, in <i>m</i> W, of various PAPR reduction techniques for various | s clock frequencies in FPGA implementation, using $N = 128$ . |
|---------------------------------------------------------------------------------------------------|---------------------------------------------------------------|
|---------------------------------------------------------------------------------------------------|---------------------------------------------------------------|

| Clock (MHz) | 0.015 | 0.5   | 1      | 1.5   | 2.0    | 2.5    | 3.0    | <b>3.5</b> | 4.0    | 4.5    | 5.0    | Avg.   |
|-------------|-------|-------|--------|-------|--------|--------|--------|------------|--------|--------|--------|--------|
| CORR        | 0.18  | 6.10  | 12.100 | 18.20 | 24.20  | 30.30  | 36.30  | 42.40      | 48.40  | 54.50  | 60.50  | 33.30  |
| PAPR        | 0.12  | 4.10  | 8.100  | 12.20 | 16.20  | 20.30  | 24.30  | 28.40      | 32.40  | 36.50  | 40.50  | 22.30  |
| DSR         | 1.70  | 56.00 | 112.10 | 16.81 | 224.20 | 280.20 | 336.20 | 392.30     | 448.30 | 504.30 | 560.40 | 280.30 |
| MSE         | 0.22  | 7.40  | 14.70  | 22.10 | 29.50  | 36.80  | 44.20  | 51.50      | 58.90  | 66.30  | 73.60  | 40.50  |

TABLE 3. The computational power, in mW, of various PAPR reduction techniques for different clock frequencies using N = 128 in ASIC design.

| Clock (MHz) | 0.015 | 0.5  | 1    | 1.5  | 2     | 2.5   | 3     | 3.5   | 4     | 4.5   | 5    | Avg. |
|-------------|-------|------|------|------|-------|-------|-------|-------|-------|-------|------|------|
| CORR        | 0.01  | 0.31 | 0.63 | 0.95 | 1.30  | 1.60  | 1.90  | 2.20  | 2.50  | 2.90  | 3.20 | 1.59 |
| PAPR        | 0.006 | 0.21 | 0.42 | 0.63 | 0.84  | 1.10  | 1.30  | 1.50  | 1.70  | 1.90  | 2.10 | 1.07 |
| DSR         | 0.094 | 3.00 | 6.00 | 8.90 | 10.19 | 14.90 | 17.90 | 20.90 | 23.90 | 26.80 | 29.8 | 14.9 |
| MSE         | 0.011 | 0.37 | 0.74 | 1.10 | 1.50  | 1.90  | 2.20  | 2.60  | 3.00  | 3.30  | 3.70 | 1.86 |

TABLE 4. The computational complexity of various blind channel estimation algorithms

|                    | RMs                   | RAs                                       | TERAs    |
|--------------------|-----------------------|-------------------------------------------|----------|
| Blind              | $4N(3L^2 + 11L + 18)$ | $2N(3L^2 + 7L + 5) + 2N(3L^2 + 11L + 18)$ | 751, 360 |
| Carrier reordering | 4N(3L + 10)           | 4N(L+2) + 2N(3L+10)                       | 83,456   |
| Training-Based     | $4(4L^2 + 17L + 13)$  | $4(L^2 + 3L + 2) + 2(4L^2 + 17L + 13)$    | 7,650    |

TABLE 5. The computational power, in watts, of various blind channel estimation algorithms using FPGA implementation.

| Clock (MHz)        | 0.015                | 0.5   | 1     | 1.5    | 2.0   | <b>2.5</b> | 3.0   | 3.5   | 4.0   | 4.5   | 5.0   |
|--------------------|----------------------|-------|-------|--------|-------|------------|-------|-------|-------|-------|-------|
| Blind              | $7.0 \times 10^{-2}$ | 2.337 | 4.675 | 7.012  | 9.35  | 11.68      | 14.02 | 16.36 | 18.70 | 21.03 | 23.37 |
| Carrier reordering | $7.8 \times 10^{-3}$ | 0.26  | 0.521 | 0.782  | 1.043 | 1.304      | 1.564 | 1.825 | 2.086 | 2.347 | 2.608 |
| Training-Based     | 0.72                 | 0.024 | 0.048 | 0.0721 | 0.096 | 0.12       | 0.144 | 0.168 | 0.192 | 0.216 | 0.240 |

TABLE 6. The computational power, in mW, of various blind channel estimation algorithms using ASIC.

| Clock (MHz)        | 0.015 | 0.5    | 1      | 1.5    | 2.0    | 2.5    | 3.0    | 3.5    | 4.0    | 4.5    | 5.0    |
|--------------------|-------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|
| Blind              | 3.90  | 123.70 | 247.20 | 370.70 | 494.20 | 617.70 | 741.20 | 864.70 | 988.20 | 1110   | 1230   |
| Carrier reordering | 0.43  | 13.70  | 27.30  | 40.90  | 54.60  | 68.20  | 81.80  | 95.50  | 109.10 | 122.70 | 136.40 |
| Training-Based     | 0.039 | 1.20   | 2.50   | 3.70   | 5.00   | 6.20   | 7.50   | 8.70   | 10.00  | 11.20  | 12.50  |

Table 4 also shows the total equivalent real additions (TERAs) using  $\eta_{\times+} = 4$ . Comparing the TERAs for the blind estimation with carrier reordering to the trainingbased estimation shows that the ratio is about 10.9, which is generally close to the results obtained using the power measurements for both the FPGA and ASIC. Therefore, comparing the number of operations of two algorithms using the obtained mapping approach leads to accurate relative complexity comparisons. However, using the power approach provides the relative complexity between different algorithms as well as an estimate of the power requirements for each algorithm. As can be noted from Table 5, the power required for the blind channel estimator at 5 MHz is about 23 watts, which is prohibitively high for practical implementation.

#### C. CARRIER FREQUENCY OFFSET ESTIMATION

CFO estimation is a critical process for coherent data detection in digital communications systems, particularly OFDM. To evaluate and compare the computational power of various CFO estimation algorithms, the work reported in [13] is used because it compares the computational complexity of several CFO estimation algorithms. Table 7 presents the computational complexity of the four CFO estimation algorithms considered in [13], which were originally reported in [12], [40].

The Computational power computed using Table 7 is mapped to the computational power in Table 8 for FPGA,

TABLE 7. The computational complexity of various CFO estimators.

|      | RMs                     | RAs               | TERAs  |
|------|-------------------------|-------------------|--------|
| [40] | 4(N + L)                | (L + 3)(N + L)    | 3,672  |
| [11] | $3N(5+2\log_2 N)+3$     | $3N(4+3\log_2 N)$ | 43,392 |
| [12] | $3N(7 + 2\log_2 N) + 3$ | $3N(4+3\log_2 N)$ | 41,868 |
| [13] | $6N(3 + \log_2 N)$      | $N(1+9\log_2 N)$  | 38,912 |

and in Table 9 for ASIC, which agrees with the conclusions drawn in [13]. However, the computational power requirements of the technique reported in [13] is substantially larger than the one reported in [40]. Moreover, the computational powers of [11]–[13] are comparable. By comparing the TERAs in Table 7 and the FPGA and ASIC power in Tables 8 and 9, it can be noted that all results follow generally the same trend. However, the complexity of [11] is larger than [12] as shown in Table 7, while it is the opposite in Tables 8 and 9. Nevertheless, the difference is negligible.

As can be noted form the results presented in Tables 2, 5 and 8, the total estimated power for all algorithms is generally less than those in [29], which is due to the linearity constraint imposed on the fitting polynomials in this work. Nevertheless, the difference is only significant for the blind channel estimation case due to the large number of multiplication operations used in this algorithm.

To evaluate the weight of various DSP algorithms with respect to main components of the transceiver, we compare

| Clock (MHz) | 0.015                 | 0.5    | 1      | 1.5    | 2.0    | 2.5    | 3.0    | 3.5    | 4.0    | 4.5    | 5.0    |
|-------------|-----------------------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|
| [40]        | $3.30 \times 10^{-4}$ | 0.0110 | 0.0221 | 0.0331 | 0.0441 | 0.0551 | 0.0662 | 0.0772 | 0.0822 | 0.0992 | 0.1103 |
| [11]        | $3.60 \times 10^{-3}$ | 0.1195 | 0.2390 | 0.3585 | 0.4779 | 0.5974 | 0.7169 | 0.8364 | 0.9559 | 1.0754 | 1.1948 |
| [12]        | $3.90 \times 10^{-3}$ | 0.1294 | 0.2588 | 0.3882 | 0.5175 | 0.6469 | 0.7763 | 0.9057 | 1.0351 | 1.1645 | 1.2938 |
| [13]        | $3.60 \times 10^{-3}$ | 0.1207 | 0.2414 | 0.3621 | 0.4828 | 0.6035 | 0.7242 | 0.8449 | 0.9656 | 1.0862 | 1.2069 |

TABLE 8. The computational power, in watts, of various CFO estimators using FPGA.

TABLE 9. The computational power, in mW, of various CFO estimators using ASIC design.

| Clock (MHz) | 0.015 | 0.5  | 1     | 1.5   | 2.0   | 2.5   | 3.0   | 3.5   | 4.0   | 4.5   | 5.0   |
|-------------|-------|------|-------|-------|-------|-------|-------|-------|-------|-------|-------|
| [40]        | 0.02  | 0.64 | 1.30  | 1.90  | 2.60  | 3.20  | 0.38  | 4.50  | 5.10  | 5.80  | 6.40  |
| [11]        | 0.20  | 6.50 | 13.00 | 19.50 | 25.90 | 32.40 | 3.89  | 45.40 | 51.90 | 58.30 | 64.80 |
| [12]        | 0.21  | 7.00 | 13.90 | 20.20 | 27.90 | 34.80 | 41.80 | 48.70 | 55.70 | 62.70 | 69.60 |
| [13]        | 0.20  | 6.40 | 12.90 | 19.30 | 25.80 | 32.20 | 38.60 | 45.10 | 51.50 | 57.90 | 64.40 |

TABLE 10. Relative DSP power, %, to the analog transceiver RF chain.

|                    | FPGA <sub>B</sub> | $\mathbf{FPGA}_W$ | $ASIC_B$ | $ASIC_W$ |
|--------------------|-------------------|-------------------|----------|----------|
| PAPR Reduction     | 1.4               | 20.2              | 0.076    | 1.0      |
| Channel Estimation | 8.7               | 844.3             | 0.5      | 44.7     |
| CFO Estimation     | 4.0               | 46.7              | 0.2      | 2.5      |
| Total Ratio        | 12.7              | 892.1             | 2.1      | 67.4     |

the power consumed by the RF and analog baseband circuitry to the power consumed by the considered DSP algorithms. As reported in [18, Table II], the power consumed by the main blocks in the transceiver RF chain is about 553.7 mW for peak-to-average ratio (PAR) of 10 dB. The results presented in Table 10 are computed as the ratio of the DSP algorithm to the RF chain power, which is 553.7. The table presents the results for both the ASIC and FPGA, and for each of which, two cases are considered, that is {**FPGA**<sub>*B*</sub>, **FPGA**<sub>*W*</sub>}, and {**ASIC**<sub>*B*</sub>, **ASIC**<sub>*W*</sub>}, where indices Band W denote the best and worst case scenarios, respectively. For the best case scenario, we selected all DSP algorithms with the minimum power consumption, and vice versa for the worst case scenario. For both the FPGA and ASIC, the carrier frequency is set to 1 MHz. As shown in the table, the total ratio has an enormous variance depending on the technology used for implementation, and the selected DSP algorithms. In the FPGA case, the best case scenario gives 12.7%, while it is 2.1% for the ASIC. Consequently, even if the most simple DSP algorithms are used, the power consumed is non-negligible using either platform. If blind channel and CFO estimation is desired to improve the spectral efficiency, the power consumption might surge to 892.1% and 67.4%, for the FPGA and ASIC design, respectively. Therefore, the FPGA power in this case will be significantly higher than the RF chain, and the ASIC power is actually comparable to the RF chain.

### **VI. CONCLUSION AND FUTURE WORK**

In this paper, a novel approach and metric were presented to evaluate and compare the computational complexity and consumed power of various baseband DSP algorithms in wireless communications systems. The proposed metric is based on mapping the computational complexity of an algorithm to the computational power. Therefore, the total computational power can be computed and presented into a single metric, which makes the comparison between different algorithms simpler and more informative. The work is based on extensive measurements conducted to evaluate the computational power at the fundamental operation level, and hence, the total computational power can be computed accordingly. The collected measurements were obtained using an FPGA platform and an ASIC standard-cell based synthesis flow (using 65 nm GF process) with different number of mathematical operations and clock frequencies. The results showed that the computational power for the considered arithmetic operations scales almost linearly with the number operations and frequencies, which enabled the construction of simple computational power models. The developed model implies that, on average, the division process is the most power-hungry operation, with a computational power that is about 11 times the power for the addition operation in ASIC. The multiplication was less power-demanding, with a single multiplication operation is equivalent to four additions in the FPGA design, and almost the same using the ASIC for a wide range of frequencies. The subtraction is comparable to the addition, with a single subtraction power being about 1.25 times of the addition in the FPGA, while both addition and subtraction had the same values for the ASIC. The new metric was used to evaluate the total computational power for PAPR reduction, CFO, and channel estimation techniques. The obtained results showed that some of these algorithms have extremely high power requirements, which makes them infeasible for prototyping using FPGAs. Moreover, the results confirmed that the DSP power should not be neglected when designing power-constraint communications systems.

Our future work will focus on evaluating the power consumption of some essential DSP algorithms that were not included in this work such as iterative error correction codes, timing synchronization, etc.

#### ACKNOWLEDGMENT

The authors would like to thank Prof. M. S. Alouini for the valuable feedback and comments and discussions.

#### REFERENCES

- M. Kalil, A. Shami, A. Al-Dweik, and S. Muhaidat, "Low-complexity power-efficient schedulers for LTE uplink with delay-sensitive traffic," *IEEE Trans. Veh. Technol.*, vol. 64, no. 10, pp. 4551–4564, Oct. 2015.
- Ericsson. Future Mobile Data Usage and Traffic Growth. Accessed: Apr. 4, 2020. [Online]. Available: https://www.ericsson.com/49dbbb/ assets/local/mobility-report/documents/2016/ericsson-mobility-reportnovember-2016.pdf
- [3] B. Hall and W. Taylor. X- and Ku-Band Small Form Factor Radio Design. Accessed: Nov. 7, 2019. [Online]. Available: https://www.analog.com/en/technical-articles/x-and-ku-band-smallform-factor-radio-design.html
- [4] J. Hou, J. Ge, and J. Li, "Peak-to-average power ratio reduction of OFDM signals using PTS scheme with low computational complexity," *IEEE Trans. Broadcast.*, vol. 57, no. 1, pp. 143–148, Mar. 2011.
- [5] E. Al-Dalakta, A. Al-Dweik, A. Hazmi, C. Tsimenidis, and B. Sharif, "PAPR reduction scheme using maximum cross correlation," *IEEE Commun. Lett.*, vol. 16, no. 12, pp. 2032–2035, Dec. 2012.
- [6] Y. Iraqi and A. Al-Dweik, "Adaptive bit loading with reduced computational time and complexity for multicarrier wireless communications," *IEEE Trans. Aerosp. Electron. Syst.*, early access, doi: 10.1109/TAES.2019.2946505.
- [7] M. S. Ahmed S. Boussakta, A. Al-Dweik, B. Sharif, and C. C. Tsimenidis, "Efficient design of selective mapping and partial transmit sequence using T-OFDM," *IEEE Trans. Veh. Technol.*, vol. 69, no. 3, pp. 2636–2648, Mar. 2020, doi: 10.1109/TVT.2019.2928361.
- [8] F. Kalbat, A. Al-Dweik, Y. Iraqi, H. Mukhtar, B. Sharif, and G. K. Karag, "Direct bit loading with reduced complexity and overhead for precoded OFDM systems," *IEEE Trans. Veh. Technol.*, vol. 68, no. 7, pp. 7169–7173, Jul. 2019.
- [9] M. Kalil, A. Al-Dweik, Y. Iraqi, H. Mukhtar, B. Sharif, and G. K. Karagiannidis, "Efficient low-complexity scheduler for wireless resource virtualization," *IEEE Wireless Commun. Lett.*, vol. 5, no. 1, pp. 56–59, Feb. 2016.
- [10] H. Mukhtar A. Al-Dweik, M. Al-Mualla, and A. Shami, "Low complexity power optimization algorithm for multimedia transmission over wireless networks," *IEEE J. Sel. Topics Signal Process.*, vol. 9, no. 1, pp. 113–124, Feb. 2015.
- [11] Y. Yao and G. Giannakis, "Blind carrier frequency offset estimation in SISO, MIMO, and multiuser OFDM systems," *IEEE Trans. Commun.*, vol. 53, no. 1, pp. 173–183, Jan. 2005.
- [12] X. Zeng and A. Ghrayeb, "A blind carrier frequency offset estimation scheme for OFDM systems with constant modulus signaling," *IEEE Trans. Commun.*, vol. 56, no. 7, pp. 1032–1037, Jul. 2008.
- [13] A. Al-Dweik, A. Hazmi, S. Younis, B. Sharif, and C. Tsimenidis, "Blind iterative frequency offset estimator for orthogonal frequency division multiplexing systems," *IET Commun.*, vol. 4, no. 16, pp. 2008–2019, Nov. 2010.
- [14] T. Y. Al-Naffouri, A. A. Dahman, M. S. Sohail, W. Xu, and B. Hassibi, "Low-complexity blind equalization for OFDM systems with general constellations," *IEEE Trans. Signal Process.*, vol. 60, no. 12, pp. 6395–6407, Dec. 2012.
- [15] A. Saci, A. Al-Dweik, A. Shami, and Y. Iraqi, "One-shot blind channel estimation for OFDM systems over frequency-selective fading channels," *IEEE Trans. Commun.*, vol. 65, no. 12, pp. 5445–5458, Dec. 2017, doi: 10.1109/TCOMM.2017.2740925.
- [16] H. Mukhtar, A. Al-Dweik, and A. Shami, "Turbo product codes: Applications, challenges and future directions," *IEEE Commun. Surveys Tuts.*, vol. 18, Issue 4, pp. 3052–3069, 4th Quart. 2016.
- [17] S. Cui, A. Goldsmith, and A. Bahai, "Energy-constrained modulation optimization," *IEEE Trans. Wireless Commun.*, vol. 4, no. 5, pp. 2349–2360, Sep. 2005.

- [18] Y. Li, B. Bakkaloglu and C. Chakrabarti, "A system level energy model and energy-quality evaluation for integrated transceiver frontends," *IEEE Trans. Very Large Scale Integr. (VLSI) Syst.*, vol. 15, no. 1, pp. 90–103, Jan. 2007.
- [19] K. Doppler, M. Rinne, C. Wijting, C. B. Ribeiro, and K. Hugl, "Device-to-device communication as an underlay to LTE-advanced networks," *IEEE Commun. Mag.*, vol. 47, no. 12, pp. 42–49, Dec. 2009.
- [20] B. Bellalta, "IEEE 802.11ax: High-efficiency WLANS," IEEE Wireless Commun., vol. 23, no. 1, pp. 38–46, Feb. 2016.
- [21] M. Al-Jarrah, M. A. Yaseen, A. Al-Dweik, O. A. Dobre, and E. Alsusa, "Decision fusion for IoT-based wireless sensor networks," *IEEE Internet Things J.*, vol. 7, no. 2, pp. 1313–1326, Feb. 2020, doi: 10.1109/JIOT.2019.2954720.
- [22] B. Martinez, F. Adelantado, A. Bartoli, and X. Vilajosana, "Exploring the performance boundaries of NB-IoT," *IEEE Internet Things J.*, vol. 6, no. 3, pp. 5702–5712, Jun. 2019.
- [23] P. Kamalinejad, C. Mahapatra, Z. Sheng, S. Mirabbasi, V. C. M. Leung, and Y. L. Guan, "Wireless energy harvesting for the Internet of Things," *IEEE Commun. Mag.*, vol. 53, no. 6, pp. 102–108, Jun. 2015.
- [24] L. Gopal, N. S. M. Mahayadin, A. K. Chowdhury, A. A. Gopalai, and A. K. Singh, "Design and synthesis of reversible arithmetic and logic unit (ALU)," in *Proc. IEEE Int. Conf. Comput. Commun. Control Technol.*, 2014, pp. 289–293.
- [25] J. Oliver, J. P. Acle, and E. Boemo, "Power estimations vs. power measurements in Spartan-6 devices," in *Proc. Southern Conf. Program. Logic*, Buenos Aires, Argentina, 2014, pp. 1–5.
- [26] J. Oliver and E. Boemo, "Power estimations vs. power measurements in Cyclone III devices," in *Proc. Southern Conf. Program. Logic*, Córdoba, Argentina, 2011, pp. 87–90.
- [27] D. Meintanis and I. Papaefstathiou, "Power consumption estimations vs measurements for FPGA-based security cores," in *Proc. IEEE Int. Conf. Reconfig. Comput. FPGAs*, 2008, pp. 433–437.
- [28] I. Kuon and J. Rose, "Measuring the gap between FPGAs and ASICs," *IEEE Trans. Comput.-Aided Design Integr. Circuits Syst.*, vol. 26, no. 2, pp. 203–215, Feb. 2007.
- [29] M. Tariq, A. Al-Dweik, B. Mohammad, H. Saleh, and T Stouraitis, "Computational power analysis of wireless communications systems using operation-level power measurements," in *Proc. Int. Conf. Elect. Comput. Technol. Appl. (ICECTA)*, Nov. 2017, pp. 1–6.
- [30] E. Arikan, "Channel polarization: A method for constructing capacityachieving codes for symmetric binary-input memoryless channels," *IEEE Trans. Inf. Theory*, vol. 55, no. 7, pp. 3051–3073, Jul. 2009.
- [31] C.-H. Tseng, Y.-C. Cheng, and C.-D. Chung, "Subspace-based blind channel estimation for OFDM by exploiting cyclic prefix," *IEEE Commun. Lett.*, vol. 2, no. 6, pp. 691–694, Dec. 2013.
- [32] Nexys 4 DDR Artix-7 FPGA: Trainer Board. Accessed: Apr. 4, 2020. [Online]. Available: https://www.xilinx.com/products/boardsand-kits/1-60lhwl.html
- [33] D. Chinnery and K. Keutzer, Closing the Power Gap Between ASIC & Custom: Tools and Techniques for Low Power Design. New York, NY, USA: Springer, Jan. 2008.
- [34] S. Huda and J. H. Anderson, "Leveraging unused resources for energy optimization of FPGA interconnect," *IEEE Trans. Very Large Scale Integr. (VLSI) Syst.*, vol. 25, no. 8, pp. 2307–2320, Aug. 2017.
- [35] R. Baxley and G. Zhou, "Comparing selected mapping and partial transmit sequence for PAR reduction," *IEEE Trans. Broadcast.*, vol. 53, no. 4, pp. 797–803, Dec. 2007.
- [36] Spiral DFT/FFT IP Generator. Accessed: Apr. 4, 2020. [Online]. Available: https://www.spiral.net/hardware/dftgen.html
- [37] S. Hwang, S. Moon, D. Kam, I. Oh, and Y. Lee, "High-throughput and low-latency digital baseband architecture for energy-efficient wireless VR systems," *Electronics*, vol. 8, p. 815, Jul. 2019.
- [38] A. Syed. Tradeoffs Between Combinational and Sequential Dividers. Accessed: Apr. 4, 2020. [Online]. Available: https://www.synopsys. com/dw/dwtb.php?a=fp\_dividers
- [39] E. Al-Dalakta, A. Al-Dweik, A. Hazmi, C. Tsimenidis, and B. Sharif, "Efficient BER reduction technique for nonlinear OFDM transmission using distortion prediction," *IEEE Trans. Veh. Technol.*, vol. 61, no. 5, pp. 2330–2336, Jun. 2012.
- [40] J. van de Beek, M. Sandell, and P.O. Borjesson, "ML estimation of time and frequency offset in OFDM systems," *IEEE Trans. Signal Process.*, vol. 45, no. 7, pp. 1800–1805, Jul. 1997.

**MARYAM TARIQ** (Student Member, IEEE) received the B.Sc. and M.Sc. degrees in communications engineering form Khalifa University, Abu Dhabi, UAE, in 2015 and 2017, respectively. In 2015, she joined Etisalat-British Telecom Innovation Center as an intern, where she worked on localization for indoor environments. Her research interest is in power optimization and lifetime maximization for power-limited wireless networks.



**ARAFAT AL-DWEIK** (Senior Member, IEEE) received the B.Sc. degree in telecommunication engineering from Yarmouk University, Jordan, in 1994, and the M.S. (*summa cum laude*) and Ph.D. (*magna cum laude*) degrees in electrical engineering from Cleveland State University, Cleveland, OH, USA, in 1998 and 2001, respectively. He was with Efficient Channel Coding, Inc., Cleveland, from 1999 to 2001, where he was a Research and Development Engineer working on advanced modulation, coding, and synchronization techniques.

From 2001 to 2003, he was the Head of the Department of Information Technology, Arab American University, Palestine. From 2003 to 2012, he was with the Communications Engineering Department, Khalifa University, UAE. From 2013 to 2014, he was an Associate Professor with the University of Guelph, Guelph, ON, Canada. He has been a Visiting Research Fellow with the School of Electrical, Electronic and Computer Engineering, Newcastle University, Newcastle upon Tyne, U.K., since 2006. He is also a Research Professor and a Member of the School of Graduate Studies, Western University, London, ON, Canada. He has received several research awards and he was a recipient of the Fulbright Scholarship from 1997 to 1999. He was a TPC Member in several major conferences, such as IEEE GLOBECOM, ICC, PIMRC, and WCNC. He has extensive editorial experience where he serves as an Associate Editor for the IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY and *IET Communications*.



**BAKER MOHAMMAD** (Senior Member, IEEE) received the B.S. degree in ECE from the University of New Mexico, Albuquerque, the M.S. degree in ECE from Arizona State University, Tempe, and the Ph.D. degree in ECE from the University of Texas at Austin in 2008.

He worked for ten years with Intel Corporation on a wide range of micro-processors design from high performance, server chips > 100Watt (IA-64), to mobile embedded processor low power sub 1 W (xscale). He was a Senior Staff

Engineer/Manager with Qualcomm, Austin, USA, for six years, where he was engaged in designing high-performance and low-power DSP processor used for communication and multimedia application. He is the Director of the System on Chip Center and an Associate Professor of EECS, Khalifa University. He has over 16 year' industrial experience in microprocessor design with emphasis on memory, low-power circuit, and physical design. He has authored/coauthored over 100 referred journals and conference proceedings, three books, 18 U.S. patents, multiple invited seminars/panelist, and the presenter of three conference tutorials, including one tutorial on energy harvesting and power management for WSN at ISCAS in 2015. His research interests includes VLSI, power efficient computing, high yield embedded memory, emerging technology, such as memristor, STTRAM, and in-memory-computing, hardware accelerators for cyber physical system. He is engaged in microwatt range computing platform for wearable electronics and WSN focusing on energy harvesting, power management, and power conversion, including efficient dc/dc and ac/dc convertors.

Dr. Mohammad has received several awards, including the KUSTAR Staff Excellence Award in intellectual property creation, the IEEE TVLSI Best Paper Award, the 2016 IEEE MWSCAS Myrill B. Reed Best Paper Award, the Qualcomm Qstar Award for Excellence on Performance and Leadership, the SRC Techon Best Session Papers for 2016 and 2017, the 2009 Best paper Award for Qualcomm Qtech Conference, and the Intel Involve in the Community Award for Volunteer and Impact on the Community. He is an Associate Editor of the IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) Systems and *Microelectronics Journal* (Elsevier). He participates in many technical committees at IEEE conferences and reviews for journals, including IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) Systems and IEEE Circuits and Systems Society.



HANI SALEH (Senior Member, IEEE) received the Bachelor of Science degree in electrical engineering from the University of Jordan, the Master of Science degree in electrical engineering from the University of Texas at San Antonio, and the Ph.D. degree in computer engineering from the University of Texas at Austin. He worked for several leading semiconductor companies, including Intel (ATOM mobile microprocessor design), AMD (Bobcat mobile microprocessor design), Qualcomm (QDSP DSP core design for mobile

SOCs), Synopsys (a key member of Synopsys turnkey design group where he taped out many ASICs and designed the I2C DW IP included in Synopys DesignWare library), Fujitsu (SPARC compatible high-performance microprocessor design), and Motorola Australia (M210 low-power microprocessor synthesizable core design). He worked as a Senior Chip Designer (Technical Lead) with Apple Inc., where he worked on the design and implementation of Apple next-generation graphics cores for its mobile products (iPad, iPhone, etc.). He has been an Associate Professor of electronic engineering with Khalifa University since 2012. He is a Founder and an Active Researcher with the Khalifa University Research Center, where he leads multiple IoT projects for the development of wearable blood glucose monitoring SOC and a mobile surveillance SOC. He has a total of 19 years of industrial experience in ASIC chip design, microprocessor design, DSP core design, graphics core design, and embedded system design. He has 12 issued U.S. patents, eight pending patent applications, and over 100 articles published in peer reviewed conferences and journals in the areas of digital system design, computer architecture, DSP, and computer arithmetic. His research interest includes IoT design, DSP algorithms design, DSP hardware design, computer architecture, computer arithmetic, SOC design, ASIC chip design, FPGA design, and automatic computer recognition.



**THANOS STOURAITIS** (Fellow, IEEE) received the Ph.D. degree from the University of Florida.

He is a Professor and the Chair of the Department of Electrical Engineering and Computer Science, Khalifa University, UAE. He is also a Professor Emeritus with the University of Patras, and has served on the faculties with Ohio State University, the University of Florida, New York University, and the University of British Columbia. He served on the National Scientific Board for Mathematics and Informatics, Greece,

and was a Founding Council Member of the University of Central Greece. He has led several DSP processor design projects funded by the European Union, American organizations, and the Greek government and industry. He has authored about 200 technical papers, several book chapters, and holds one U.S. patent on DSP processor design. His current research interests include signal and image processing systems, application-specific processor technology and design, computer arithmetic, and design and architecture of optimal digital systems with emphasis on cryptographic systems.

Prof. Stouraitis received the IEEE Circuits and Systems Society Guillemin-Cauer Award. He served or is serving as an editor/guest editor for numerous technical journals. He served as the General Chair, a TPC Chair, and the Symposium Chair for many international conferences, such as ISCAS, SiPS, and ICECS. He has served IEEE in many ways, including as the Circuits and Systems Society President from 2012 to 2013. He is an IEEE Fellow for his contributions in digital signal processing architectures and computer arithmetic.