©2016 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. DOI: 10.1109/MWSCAS.2016.7870083

I. Damaj, A Unified Analysis Approach for Hardware and Software Implementations, The 59<sup>th</sup> IEEE International Midwest Symposium on Circuits and Systems, Abu Dhabi, UAE, 16–19 October, 2016. P 577–580.

## https://doi.org/10.1109/MWSCAS.2016.7870083

# A Unified Analysis Approach for Hardware and Software Implementations

Issam W. Damaj Department of Electrical and Computer Engineering American University of Kuwait Salmiya, Kuwait Email: idamaj@auk.edu.kw

Abstract-Modern computing systems are hybrid in nature and employ various processing technologies that range from specific- to general-purpose processors. In co-design environments, specific-purpose processors, also known as hardware, work to support software implementations under general-purpose systems to create high-performance computers. Algorithms and computationally intensive tasks are partitioned among the different processing subsystems to achieve desirable degrees of parallel processing and performance characteristics. In this paper, a unified statistical performance analysis formulation is presented. The proposed statistical formulation combines the heterogeneous characteristics of both hardware and software implementations to provide grounds for thorough evaluations. The formulation includes the development of performance profiles, key indicators, and the composition of a master indicator based-on heterogeneous measurements. The investigation includes a case-study that targets a set of simple cryptographic algorithms. The two main targeted high performance computing devices are multicore processors for software implementations and high-end Field Programmable Gate Arrays for hardware implementations.

#### I. INTRODUCTION

Modern high-performance computers (HPCs) are hybrids of multi-core processors, graphical processing units (GPUs), high-density programmable logic devices (HDPLDs), to name a few. Within hybrid systems, algorithms can be partitioned and distributed or fully-delegated to one subsystem. Hybrid HPCs are supported by rich co-analysis and co-design tools that enable unified hardware/software implementations [1]. The answer of how an algorithm implementation can perform on hybrid HPCs is built upon the analysis of one or all of the underlying subsystems. Indeed, the question is still on how to make adequate performance measurements in such systems. In computer system analysis, benchmarking is the act of measuring and evaluating the performance of computations, network processes, and connected peripherals - all under reference conditions [2]. A variety of benchmarks exist including Whetstone [3], LINPAC [4], Dhrystone [5], Standard Performance Evaluation Corporation (SPEC) [6], etc. Benchmarks are usually specialized; none are reported to extensively examine hybrid systems that explicitly targets hardware/software co-design.

Benchmarks can be classified into Algorithm Benchmarks [7], Software Benchmarks [8], Embedded Systems Benchmarks and Cryptographic Benchmarks [9]. Cryptographic Benchmarks are available in the literature; they are designed to measure the performance of different cryptographic algorithms running under different systems [10]. Indeed, the use of Benchmarks is essential for performance analysis, classification, and accordingly implementation optimization.

In this paper, we present a statistical analysis framework for performance profiling of related algorithms running under different hardware and software subsystems. The developed framework enables the deep and thorough reasoning about each hardware and software subsystem, and combines heterogeneous characteristics to provide overall rating, ranking, and classifications. The proposed framework is unique in unifying different analyses of algorithms in combined indexes. Combining analysis profiles enable the draw of conclusions on how algorithms can perform on todays hybrid processors. The proposed framework is customizable for any hybridization of processing systems and can target any model of computation or area of application. This paper includes a mathematical model for the proposed framework, a case-study from cryptography, and proposes a sample integration in development environments for hardware/software co-design. The case-study targets two high performance computing systems, namely, the *Dell Precision T7500* with its *dual quad-core Xeon processor* and 24 GB of RAM, and Altera STRATIX-II Field Programmable Gate Array (FPGA). The Software tools used for analysis are *Quartus*, ModelSim, and Intel VTune Amplifier.

The paper is organized so that Section II presents the statistical analysis framework. In Section III, the framework is contextualized using a case-study on cryptographic algorithms. Section IV presents a sample integration of the statistical framework within an integrated development environment. A thorough performance analysis and evaluation is presented in Section V. Section VI concludes the paper and address possible future directions.

## II. THE ANALYSIS FRAMEWORK

The analysis framework classifies the heterogeneous sources of measurements into hardware and software analysis profiles (APs). The development of each profile includes the identification of a set of key indicators, such as speed, propagation delay, through, and power consumption. The indicators are the most extensive part of the measurement framework and should be carefully developed within the context of application. For example, for network processors, throughputs are identified as performance indicators and measured in bitsper-second; however, in graphics processors, the same indicator can be measured in frames-per-second. The measurements associated with the identified indicators may mainly quantities. The measured quantities are then each divided by similar measurements from a reference institution for normalization and for producing performance ratios. Accordingly, we can create Combined Measurement Indicators (CMIs) using the Geometric Mean of all the calculated ratios.

To formulate the calculation of the *CMIs*, Equation 1 composes several analysis profiles:

$$CMI = AP_1 \circ AP_2 \circ \dots AP_k \tag{1}$$

where  $P_k$  is the  $k^{th}$  Profile

The measurement of every Profile is done using a statistical composition of its Key Indicators (*KIs*) as in Equation 2.

$$P_j = KI_{j,1} \circ KI_{j,2} \circ KI_{j,n} \tag{2}$$

where  $KI_i$  is the  $j^{th}$  Key Indicator

Therefore, The *CMI* is the statistical composition of all the key indicators of all Profiles as in Equation 3.

$$CMI = KI_{k.j.1} \circ KI_{k.j.2} \circ \dots KI_{k.j.n}$$
(3)

The Key Indicator values are then each divided by a reference measurements for normalization and for producing performance ratios as in Equation 4.

$$ratio_i = \frac{KI_{k.j.i}}{KI_{k.j.i}^{ref}},\tag{4}$$

where  $ratio_i$  is the  $i^{th}$  ratio, and  $i \in \{1..n\}$ 

Then, the *CMI* is the Geometric Mean of all n ratios as in Equation 5.

$$CMI = \sqrt{ratio_1 \times ratio_2 \times \dots ratio_n} \tag{5}$$

The Geometric Mean is used, for the *CMI*, as it is able to measure the central tendency of data values that are obtained from ratios [11].

## III. A CASE-STUDY ON THE LIGHTNESS OF CRYPTOGRAPHIC CIPHERS

The presented statistical framework is contextualized by analyzing the performance of a set of lightweight cryptographic ciphers as a case-study. The aims of the case-study comprise the following:

- Applying the presented framework in a computationally demanding application, such as, cryptography.
- Developing the key indicators for the hardware subsystem.
- Developing the key indicators for the software subsystem.
- Developing a *CMI* that aids the classification of cryptographic algorithms according to their lightness; the developed *CMI* is called the Lightness Indicator (*LI*).

The *LI* classifies the investigated algorithms according to a combination of their software and hardware characteristics. The *LI* combines several key indicators including speed, memory efficiency, hardware size, and more. The analyzed cryptographic algorithms are Skipjack [12], 3-WAY [13], XTEA [14], KATAN and KATANTAN [15], and Hight [16]. The reference cipher is the Advanced Encryoption Standard (AES) [17]. The literature includes a variety of implementations and performance evaluations of the addressed set of cryptographic ciphers. However, the evaluations of the targeted set of ciphers are done separately with no ground for cross-evaluation.

The identified performance metrics of the LI are classified into hardware and software profiles. The software profile includes the several indicators including the execution time, throughput, the total number of clock cycles per instruction, and the cash miss ratio. The **Execution Time** (*ET*) is the time between the start and the completion of a task [18]. The calculation of the *ET* allows for the determination of the Performance according to:

$$Performance = \frac{1}{ET}$$

The **Throughput** (*TH*) is the total amount of work done in a given time [18]. The *TH* is application specific and could be measured, for example, in bits-per-second (bps), framesper-second (fps), etc. The **Clock Cycle per Instruction** (*CPI*) is the average number of clock cycles each instruction takes to execute. Since different instructions may take different amounts of time depending on what they do, *CPI* is an average of all the instructions executed in the program [18]. the **Cache Miss Ratio** (*CMR*) is the ratio of memory accesses that cause a cache miss. The cache miss ratio of an application depends on the size of the cache. A larger cache can hold more cache lines and is therefore expected to get fewer misses [18].

The hardware profile comprises several indicators, namely, the execution time of the hardware implementation, throughput, propagation delay, the hardware area in number of lookup tables and logic registers, and power consumption. The **Propagation Delay** (*PD*) is the time required for a signal from an input pin to propagate through combinational logic and appear at an external output pin [1]. The **Look-Up Table** (*LUT*) is the number of combinational adaptive lookup tables required to implementation algorithm in hardware. The number of *LUTs* is an indicator of the size of hardware in Altera devices. In other devices, the area could be measured in terms the total number of gates, logic elements, slices, etc. **Logic Registers** (*LRs*) are the total number of logic registers in the design. The **Power Consumption** (*PC*) is the total power consumed by developed hardware in Watts [1].

The LI is formulated as the composition of several assessment profiles; two for the current study. Each assessment profile is the composition of several indicators. key indicators are benchmarked against measured reference implementations to produce ratios for each measurement. Based on Equation 5, the overall LI is defined as the geometric mean of all the calculated ratios (See Equations 6 and 7).

$$LI = \sqrt[10]{ratio_1 \cdot ratio_2 \cdot ratio_3 \dots ratio_l} \tag{6}$$

and hence

$$LI = \left(\prod_{i=1}^{l} ratio_i\right)^{\frac{1}{l}} \tag{7}$$

Where l is the number of ratios.

The LI enables the classification of cryptographic algorithms according to their lightness. A higher LI is achieved through a higher throughput, a more efficient memory performance, more compact size, less complexity, less power consumption, and less resource utilization. The master LI formula using the developed indicators is shown in Equations 8, 9, and 10. The indicators that are common to the software and hardware profiles are labeled with the profile name.

$$LI = \sqrt[10]{SWP \cdot HWP} \tag{8}$$

$$SWP = \frac{ET_{sw,ref}}{ET_{sw}} \cdot \frac{TH_{sw}}{TH_{sw,ref}} \cdot \frac{CPI_{ref}}{CPI} \cdot \frac{CMR_{ref}}{CMR} \quad (9)$$

$$HWP = \frac{ET_{hw,ref}}{ET_{hw}} \cdot \frac{TH_{hw}}{TH_{hw,ref}} \cdot \frac{PD_{ref}}{PD} \cdot \frac{LUT_{ref}}{LUT} \cdot \frac{LR_{ref}}{LR} \cdot \frac{PC_{ref}}{PC}$$
(10)

The derivations of Equations 9 and 10 are based on the fact that indicators are either directly or inversely proportional to the developed *CMI*.

## IV. PROGRAMMING INTERFACE

The developed statistical framework is embedded in a sample co-design IDE. The purpose of the proposed IDE is to automate and test the connectivity to the various analysis, synthesis, and evaluation tools employed in such a hybrid framework. The IDE is implemented using Java under Netbeans. The used implementation and performance evaluation tools comprise Altera Quartus for Hardware implementation and analysis, and Intel vTune Amplifier under Visual Studio for Software analysis. The developed IDE connects to Altera Quartus using the TCL commands to synthesize and generate timing analyses, pin assignments for FPGA boards, and generate bit files to program the targeted FPGAs. The IDE connects to Intel vTune Amplifier, using Command Line and Batch Files, to perform the software analysis and calculating the total execution time, CPI, etc. The generated Hardware and Software analysis files are exported to MS Excel to produce the complete analysis profile and charts.

### V. PERFORMANCE ANALYSIS AND EVALUATION

The presented statistical framework provides thorough performance analysis options for algorithms running under hybrid *HPCs*. The framework structure comprises an analysis profile for every processing sub-system, key indicators for each, and a formulation that produces composite indicators. The analysis profiles serve as the performance record for one processing system; this enable deep reasoning about the performance characteristics of that processing system in particular. Looking at all the analysis profiles provide an opportunity for an evaluation based on a wider range of characteristics on more than one processing system. The composite indicators, such as the lightness indicator, provides a performance analysis summary for a desired particular property. Moreover, composite indicators aid the classification and sorting of algorithms according to a combination of heterogeneous measurements.

The performance of cryptographic algorithms is a primary factor in their application integration criteria. The trade-off between level of security, cost, and performance is a main issue in designing and/or analyzing lightweight ciphers. Figures 1 and 2 depicts the classification of the analyzed set of algorithms according to their lightness. The algorithm that attained a larger indicator value is lighter, smaller in size, or faster than the algorithm with a lower indicator value.

The targeted set of cryptographic algorithms including Skipjack [12], 3-WAY [13], XTEA [14], KATAN and KATAN-TAN [15], and Hight [16] are all claimed to be simple, tiny, small, or lightweight. The composed *LI* is built upon the presented statistical framework to provide a classification based on actual, and uniform, implementations and measurements that are based-on common grounds.

#### VI. CONCLUSION

In this paper, a statistical framework is developed to provide analysis options across different processing technologies.



#### Fig. 2. The Lightness Indicator; a radar chart



The framework classifies processing subsystems into profiles, where each can be contextualized according to a specific application. The statistical framework is adopted to investigate the lightness of a set of cryptographic algorithms that are claimed to be small in size, tiny, and efficient. The developed lightness indicator ranks the *3-Way* algorithm as the lightest

among all with an *LI* of 3.38. *Hight* achieves the second best lightness with a score of 2.49. The lowest score of 0.79 was attained by *KATAN-64*. The case-study validates the statistical framework and leads to a successful classification. Future work includes the testing of reliability of the produced results through comparisons with results obtained using different methods. Future work also includes the expansion of the case-study to include additional analysis profiles and composite indicators with different performance characteristics.

#### REFERENCES

- [1] F. Vahid, Embedded System Design: A Unified Hardware/Software Introduction. New York: John Wiley & Sons, 2002.
- [2] S. Bouckaert, S. C. Phillips, J. Wilander, S. Rehman, W. Dabbous, and T. Turletti, "Benchmarking computers and computer networks," 2011, whitepaper. [Online]. Available: http://www-sop.inria.fr/members/Thierry.Turletti/WP11.pdf
- [3] H. Curnow and B. Wichman, "A synthetic benchmark," *Computer Journal*, vol. 19, no. 1, pp. 43–49, 1976.
- [4] J. Dongarra and P. Luszczek, *Encyclopedia of Parallel Computing*. Springer US, 2011, ch. LINPACK Benchmark, pp. 1033–1036.
- [5] R. P. Weicker, "Dhrystone: a synthetic systems programming benchmark," *Communications of the ACM*, vol. 27, no. 10, pp. 1013–1030, 1984.
- [6] J. L. Henning, "SPEC CPU2000: Measuring CPU performance in the new millennium," *Computer*, vol. 33, no. 7, pp. 28–35, 2000.
- [7] O. Mersmann, M. Preuss, and H. Trautmann, "Benchmarking evolutionary algorithms: Towards exploratory landscape analysis," in *PPSN* (1), 2010, pp. 73–82.
- [8] M. S. Müller, "An openmp compiler benchmark," Scientific Programming, vol. 11, no. 2, pp. 125–131, 2003.

- [9] EEMBC, "Website," 2014. [Online]. Available: http://www.eembc.org/
- [10] A. J. Menezes, P. C. Van Oorschot, and S. A. Vanstone, Handbook of applied cryptography. CRC press, 2010.
- [11] J. L. Hennessy and D. A. Patterson, *Computer architecture: a quantitative approach.* Elsevier, 2011.
- [12] E. Biham, A. Biryukov, and A. Shamir, "Cryptanalysis of skipjack reduced to 31 rounds using impossible differentials," in Advances in CryptologyEurocrypt99. Springer, 1999, pp. 12–23.
- [13] J. Kelsey, B. Schneier, and D. Wagner, "Related-key cryptanalysis of 3-way, biham-des, cast, des-x, newdes, rc2, and tea," *Information and Communications Security*, pp. 233–246, 1997.
- [14] V. R. Andem, "A cryptanalysis of the tiny encryption algorithm," Ph.D. dissertation, The University of Alabama TUSCALOOSA, 2003.
- [15] C. De Canniere, O. Dunkelman, and M. Knežević, "Katan and ktantana family of small and efficient hardware-oriented block ciphers," in *Cryp*tographic Hardware and Embedded Systems-CHES 2009. Springer, 2009, pp. 272–288.
- [16] D. Hong, J. Sung, S. Hong, J. Lim, S. Lee, B.-S. Koo, C. Lee, D. Chang, J. Lee, K. Jeong *et al.*, "Hight: A new block cipher suitable for lowresource device," in *Cryptographic Hardware and Embedded Systems-CHES 2006*. Springer, 2006, pp. 46–59.
- [17] J. Daemen and V. Rijmen, The design of Rijndael: AES-the advanced encryption standard. Springer, 2002.
- [18] D. A. Patterson and J. L. Hennessy, Computer Organization and Design: the Hardware/Software Interface, 5th ed. Morgan Kaufmann, 2013.