Abstract
This work deals with the optimization of Deep Convolutional Neural Networks (ConvNets). It elaborates on the concept of Adaptive Energy-Accuracy Scaling through multi-precision arithmetic, a solution that allows ConvNets to be adapted at run-time and meet different energy budgets and accuracy constraints. The strategy is particularly suited for embedded applications made run at the “edge” on resource-constrained platforms. After the very basics that distinguish the proposed adaptive strategy, the paper recalls the software-to-hardware vertical implementation of precision scalable arithmetic for ConvNets, then it focuses on the energy-driven per-layer precision assignment problem describing a meta-heuristic that searches for the most suited representation of both weights and activations of the neural network. The same heuristic is then used to explore the optimal trade-off providing the Pareto points in the energy-accuracy space. Experiments conducted on three different ConvNets deployed in real-life applications, i.e. Image Classification, Keyword Spotting, and Facial Expression Recognition, show adaptive ConvNets reach better energy-accuracy trade-off w.r.t. conventional static fixed-point quantization methods.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
1 Introduction
Deep Neural Networks (DNNs) are computational models that emulate the activity of the human brain during pattern recognition. They consist of deep chains of neural layers that apply non-linear transformations on the input data [1]. The projection on the new feature-space enables a more efficient classification, achieving accuracies that are close, and in some cases even above, those scored by the human brain. Convolutional Neural Networks [2] (ConvNets hereafter) are the first example of DNNs applied to problems of human-level complexity. They have brought about breakthroughs in computer vision [3] and voice recognition [4], improving the state-of-the-art in many application domains. From a practical viewpoint, the forward pass through a ConvNet is nothing more than matrix multiplications between pre-trained parameters (the synaptic weights of the hidden neurons) and the input data.
The most common use-case for ConvNets is image classification where a multi-channel image (e.g. RGB) is processed producing as output the probability that the subject depicted in the picture belongs to a specific class of objects or concepts (e.g. car, dog, airplane, etc.). One can see this end-to-end inference process as a kind of data compression: high-volume raw-data (the pixels of the image) are compressed into a highly informative tag (the resulting class). In this regard, the adoption on the Internet-of-Things (IoT) is disruptive: distributed smart-objects with embedded ConvNets may implement data-analytics at the edge, near the source of data [5], with advantages in terms of predictability of the service response time, energy efficiency, privacy and, in general, scalability of the IoT infrastructure.
The design of embedded ConvNets encompasses a training stage during which the synaptic weights of the hidden neurons are learned using a back-propagation algorithm (e.g. the Stochastic Gradient Descent [6]). The learning is supervised and accuracy-driven, namely, it adjusts the weights such that an accuracy loss function evaluated over a set of labeled samples is minimized. Once trained, the ConvNet can be flashed on the smart-object and deployed at the edge, where it runs inference on never occurred samples. To notice that ConvNets presented in the literature show different depth (number of layers) and size (number of neurons per layer); also the topology may change due to optional layers used to reduce the cardinality of the intermediate activations, e.g., local pooling layers, or their sparsity, e.g. Rectified Linear Units (ReLU). Regardless of the internal structure, ConvNets show a common characteristic, complexity. Even the most simple model, e.g. AlexNet [2] or the more compact MobileNets [7], show millions of synaptic weights to be stored and tens of thousands of matrix convolutions to be run [5]. This prevents their use on low-power embedded platforms which offer low storage capacity, low compute power, and limited energy budget. How to design ConvNets that fit the stringent resource constraints while preserving classification accuracy is the new challenge indeed.
Recent works introduced several optimization strategies, both at the software level and hardware level [8]. They mainly exploit the intrinsic redundancy of ConvNets in order to reduce (i) the number of weights/neurons (the so-called pruning methods) or (ii) the arithmetic precision (quantization methods) or (iii) both [9]. Precision scaling is of practical interest due to its simplicity and the solid theories developed in the past for DSP applications. It concurrently reduces the memory footprint (the lower the bit-width, the lower the memory footprint) and the execution latency (the lower the bit-width, the faster the execution). The use of fixed-point arithmetic with 16- and 8-bit [10] instead of the 32-bit floating-point, or even below, e.g. 6 and 4-bit [11], has shown remarkable savings with no, or very marginal accuracy drop. Aggressive binarization [12] is an alternative approach provided that large accuracy loss is acceptable. Obviously, the implementation of quantized ConvNets asks for integer units that can process data with reduced representations; recent hardware designs, both from industry and academia, follow this trend [13,14,15].
Most of the existing optimizations, both pruning and quantization, were originally conceived as static methods. Let’s consider quantization. For a given ConvNet the numeric precision of the weights is defined at design-time and then kept constant during run-time. Therefore, the design effort is that of finding the proper bit-width such that accuracy losses are minimized [16]. Although effective, this approach is very conservative as inference always operates at full speed and hence under maximum resource usage. Adaptive strategies that speculate on the quality of results to reach higher energy efficiency are a more interesting option for portable devices deployed on non-critical missions [17]. There exist applications or use-cases for which the classification accuracy can be relaxed without affecting much the user perception, or, alternatively, conditions under which other extra-functional properties of the system, e.g. energy budget or latency, get higher priority. For such cases, one may use the arithmetic precision as a control knob to manage the resources. This concept of energy-accuracy scaling is a well-established technique for VLSI designs [18], while it represents a less explored option for ConvNets (and DNNs in general).
The idea of energy-accuracy scalable ConvNets through dynamic precision scaling was first introduced in [19] and then elaborated in [20] with the introduction of an energy-driven optimization framework. The method applies to software-programmable arithmetic accelerators where precision scaling is achieved through variable-latency Multiply & Accumulate (MAC) instructions. This implementation applies for any general purposes MCUs (e.g. [21]) or application-specific processors with a multi-precision instruction-set (e.g. Google TPU [22]); it can also be extended to dedicated architecture (both ASIC or FPGA accelerators [23]). This chapter further investigates on this strategy introducing a Pareto analysis of the energy-accuracy space. An optimization engine is used to identify the arithmetic precision that minimizes energy and accuracy loss concurrently. The obtained precision settings can be loaded at run-time with minimal overhead thus to allow ConvNets to reach the operating conditions that satisfy the requirements imposed at the system-level. As test-benches we used three real-life applications built upon state-of-the-art ConvNets, i.e. Image Classification [24], Keyword Spotting [25], and Facial Expression Recognition [26]. Experimental results suggest the proposed strategy is a practical solution for the development of flexible, yet efficient IoT applications.
The remaining sections are organized as follows. Section 2 gives an overview of related works in the field. Section 3 describes the implementation details for the single weight-set multi-precision arithmetic used in scalable ConvNets. Section 4 recalls the optimization engine and the energy-accuracy models adopted. Finally, Sect. 5 shows the Pareto analysis over the three benchmarks and the performance of the optimization heuristic.
2 Related Works
With the emerging of the edge-computing paradigm, the reduction of ConvNets complexity has become the new challenge for the IoT segment. The problem is being addressed from different perspectives: with the design of custom hardware that improves the execution of data-intensive loops achieving energy efficiencies of few pico-Joules/operation [11, 27]; with new learning strategies that generate less complex networks [28]; with iso-accuracy compression techniques aimed at squeezing the model complexity. A thorough review is reported in [29]. To notice that while many existing techniques are conceived as static methods, the dynamic management of ConvNets is a less explored field. This work deals with this latter aspect.
2.1 Adaptive ConvNets
Following the recent literature, the concept of adaptive ConvNets may have multiple interpretations and hence different implementations. On the one hand, there are solutions that adapt to the complexity of the input data. On the other hand, solutions that adapt to external conditions or triggers, regardless of data complexity.
The former class is mainly represented by techniques that implement the general principle of coarse-to-fine computation [30]. These methods make use of branches in the internal network topology generating conditional deep neural nets [31]. In its most simple implementation, a conditional ConvNet is made up of a chain of two classifiers, a coarse classifier (for “easy” inputs) and a fine classifier (for “hard” inputs) [32]; the coarse classifier is always-on, while the fine classifier is occasionally activated for “hard” inputs (statistically less frequent). As a result, ConvNets can adapt to the complexity of data at run-time. An extension with deeper chains of quantized micro-classifiers is proposed in [33], while in [34] authors propose the use of Dynamic Voltage Accuracy Frequency Scaling (DVAFS) for the recognition of objects of different complexity.
Concerning the second class, that is the main target of this work, adaptivity is achieved by tuning the computational effort of the ConvNet depending on the desired accuracy. The control knob is the arithmetic precision of the convolutional layers. The work described in [19] is along this direction as it introduces an HW-SW co-design to implement multi-precision arithmetic at run-time. Depending on the parallelism of the HW integer units (e.g. 16- or 8-bits), weights can be loaded and processed using different bit-widths thus to achieve different degrees of accuracy under different energy budgets. This is the enabler for accuracy-energy scaling adaptive ConvNets. To notice that unlike static quantization methods where different accuracy levels could be achieved using multiple pre-trained weight-sets stored as separate entities, here the precision scaling is achieved using a single set of weights and incomplete arithmetic operations. The same strategy is adopted in this work.
Hybrid solutions may jointly exploit the complexity of the input problem with the accuracy imposed at the application level. For instance, the authors of [35] introduce the concept of multi-level classification where the classification task can be performed at different levels of semantic abstraction: the higher the abstraction, the easier the classification problem. Then, depending on the abstraction level and the desired accuracy, the ConvNet is tuned to achieve the maximum energy efficiency.
2.2 Fixed-Point Quantization
Since the multi-precision strategy adopted in this work encompasses the quantization to fixed-point, this subsection gives a brief taxonomy of the existing literature on the subject.
Complexity reduction through fixed-point quantization exploits the characteristics of the weight distributions across different convolutional layers in order to find the most efficient data representation [36]. Two main stages are involved: the definition of the bit-width, i.e. the data parallelism, and the radix-point scaling, i.e. the position of the radix point. A common practice is to define the bit-width depending on hardware availability (e.g. 16-, 8-bit for most of the architectures), then find the radix-point position that minimizes the quantization error. The existing techniques, mainly from the DSP theory, differ in the radix-point scaling scheme. A complete review is out of the scope of this work and the interested reader can refer to [8]. It is worth emphasizing that a one-size-fits-all solution does not exist as efficiency is affected by the kind of neural networks under analysis and the characteristics of the adopted hardware.
A more relevant discriminant factor is the spatial granularity at which the fixed-point format is applied, per-net or per-layer. In the former case all the layers share the same representation; in the latter case, each layer has its own representation. Since the weights distribution may substantially differ from layer to layer, a finer, i.e. per-layer, approach achieves lower accuracy loss [36].
Whatever the granularity is, existing works from the machine-learning community, e.g. [36, 37], focused on accuracy-driven optimal precision scaling. Only a few papers take hardware resources into account, which is paramount when dealing with embedded systems. The authors of [16] briefly describe a greedy approach where low precision is assigned starting from the first layer of the net (topological order) without considering the complexity of the layer. In [10] authors describe the design of embedded ConvNets for FPGAs and propose a per-layer precision scaling that is aware of the number of memory accesses. Only very few works, e.g. [20, 29], bring energy consumption as a direct variable in the optimization loop.
3 Energy-Accuracy Scalable Convolution
The proposed adaptive ConvNet strategy leverages precision scalable arithmetic. This section introduces a possible implementation of matrix convolution using software-programmable multi-precision Multiply & Accumulate (MAC) instructions. It first describes the algorithmic details, then it presents a custom processing element that accelerates the variable-latency MAC with minimal design overhead.
3.1 SW: Multiprecision Convolution
For a given layer in a ConvNet, the convolution between the \(M \times M\) input map matrix I and the \(M \times M\) weight matrix of a kernel W is the dot-product of the two unrolled vectors I and W of length (\(M \times M\)). The dot-product between I and W is the sum of the (\(M \times M\)) products \(I_i\times W_i\), as shown in Fig. 1.
Assuming a N-bit fixed-point representation (N = 16 in this work), \(I_i\) and \(W_i\) can be seen as two concatenated halfwords of \(K=N/2\) bits (\(K=8\)); the most significant parts \(I_i^H\) and \(W_i^H\) and the least significant parts \(I_i^L\) and \(W_i^L\). As pictorially described in Fig. 1, each single product \(I_i\times W_i\) is implemented by means of a four-cycles procedure where the most significant and least significant halfwords are iteratively multiplied, shifted and accumulated. To notice that \(I_i^H\) and \(W_i^H\) are signed integers, \(I_i^L\) and \(W_i^L\) are unsigned. Different precision options can be reached by stopping the execution at earlier cycles: half (\(K\times K\)) 1 cycle, mixed (\(K\times N\)) 2 cycles and full (\(N\times N\)-bit) 4 cycles; an additional mixed precision option (\(N\times K\)) is also obtained by swapping the second and the third cycle (2 cycles).
The same four options can be extended to the dot-product procedure as described in Algorithm 1. At half-precision, both the operands \(I_i\) and \(W_i\) are reduced to K bits. The first loop (lines 1–2) operates on the most significant parts \(I_i^{\text {H}}\), \(W_i^{\text {H}}\). The result is then returned (line 3). At mixed-precision, only one operand, the input \(I_i\) (or the weight \(W_i\), not shown in the pseudo-code), is reduced to K bits. First, the partial result r is shifted of K-bits (line 4), then the second loop (lines 5–6) iterates on \(I_i^{\text {H}}\) and \(W_i^{\text {L}}\) (\(I_i^{\text {L}}\) and \(W_i^{\text {H}}\)) and the result is returned (line 7). At full-precision, both \(W_i\) and \(I_i\) are taken as N bit operands. In this case the last two loops (lines 8–12) come into play and they iterate on the least significant parts \(W_i^{\text {L}}\) and \(I_i\) (both H and L) thus to complete the remaining part of the product. To summarize, with \(N=16\), the available precision options are: half (\(K\times K\), i.e. \(8\times 8\)), mixed (\(N\times K\), i.e. \(16\times 8\) or \(K\times N\), i.e. \(8\times 16\)), full (\(N\times N\), i.e. \(16\times 16\)). Given the regular structure of the algorithm, all them can be implemented on the same \(K\times K\) MAC unit.

This straightforward algorithm offers a simple way to adjust the precision of the results and the resource usage. Firstly, it allows the computational effort, and hence the energy consumption, to scale with the arithmetic precision; secondly, it alleviates the memory bandwidth as less bits need to be moved from/to the memory banks at lower precisionsFootnote 1.
3.2 HW: Variable-Latency Processing Element
Figure 2 gives the RTL-view of the proposed processing element (PE) for \(N=16\). The PE is composed by \(9\times 9\) multiplier, where the 9\(^{th}\) bit is used for the sign extension of the operands. As described in the previous subsection, the most significant parts (\(I_i^{\text {H}}\), \(W_i^{\text {H}}\)) are signed, while the least significant parts (\(I_i^{\text {L}}\), \(W_i^{\text {L}}\)) are unsigned. Therefore, the MSB of (\(I_i^{\text {L}}\), \(W_i^{\text {L}}\)) belongs to the module, while that of (\(I_i^{\text {H}}\), \(W_i^{\text {H}}\)) is the sign. In order to account for this issue we implemented the following mechanism: when (\(I_i^{\text {H}}\), \(W_i^{\text {H}}\)) are processed, the sign is extended to the 9\(^{th}\) by concatenating the MSB (i.e. the sign) of I and W; when (\(I_i^{\text {L}}\), \(W_i^{\text {L}}\)) are processed a 0 is concatenated. The selection is done through the control signals signed-I and signed-W driven by the local control unit (omitted in the picture for the sake of space). The same control unit is in charge of feeding the MAC with the right sequence of data (H or L) fetched from a local memory.
The accumulator has 16 guard bits and an embedded saturation logic to handle underflow and overflow. The role of the programmable shifter is two-fold. First, to shift the partial results when needed (see Algorithm 1). Second, to implement the dynamic fixed point arithmetic by moving the radix point of the final accumulation result depending on the desired fractional length [39]. A range check logic drives bit saturation if the result does not fit the word-length.
In order to minimize the dynamic power consumption, a zero-skipping strategy [34] is implemented by means of latch-based operand isolation and clock-gating. If one of the operands is zero, then the latches prevent the propagation of inputs minimizing the switching activity, while the clock-gating cell disables the clock signal thus reducing the equivalent load capacitance of the clock signal.
3.3 Hardware Characterization
The proposed SW-HW precision scaling strategy can be implemented using both FPGA and ASIC technologies. In this work we designed and characterized the \(8\times 8\) MAC unit using a commercial 28 nm UTBB FDSOI technology and the Synopsys Galaxy Platform, versions L-2016.03. The frequency constraint is set to 1 GHz at 0.90 V in a typical process corner (compliant with recent works that used the same technology [40]). Power consumption is extracted using Synopsys PrimeTime L-2016.06 with SAIF back-annotation. Collected results show a standard cell area of 1443 \({{\upmu }}\mathrm{m}^{2}\) and total average power consumption of 0.95 mW. Compared to a traditional \(8 {\times } 8\) MAC unit, the proposed architecture shows 3.7% area penalty.
Table 1 shows the latency (\(N_\text {cycles}\)) and the energy consumption per MAC operation (\(E_\text {MAC}\)) for the four precisions available. As one can see, each row in the table corresponds to a different implementation point in the precision-energy space. If one of the two operands is zero, energy \(E_\text {zero}\) reduces substantially due to the zero-skipping logic: \(E_\text {zero}= 0.103E_\text {MAC}\).
4 Energy-Driven Precision Assignment
4.1 Fixed-Point Quantization
The shift from floating-point to fixed-point is a well-known problem in the DSP domain. In this sub-section, we review the basic theory and the main aspects involving this work.
A floating-point value V can be represented with a binary word Q of N bits using the following mapping function:
FL indicates the fraction length, i.e. the position of the radix-point in Q. Given a set of real values, the choice of N and FL affects the information loss due to quantization. Since the bit-width N is usually given as a design constraint (e.g. 16-bit in this work), the problem reduces to searching the optimal FL (the integer length IL is then given by N-IL). The choice of FL affects the maximum representable value \(|V_\text {max}|\) and the minimum quantization error \(Q_\text {step}\). Concerning \(|V_\text {max}|\), the relationship is described in the following equation:
A trade-off does exit: the lower the FL the lower the \(V_\text {max}\); the larger the FL the lower the \(Q_\text {step}\). The decision of which constraint to guard more (\(|V_\text {max}|\) or \(Q_\text {step}\)) mainly depends on the distribution of the original floating-point weights and their importance in the neural model under quantization.
A dynamic fixed-point scheme is implemented where the fraction length is defined layer-by-layer. The \(FL_\text {opt}\) that minimizes the L2 distance between the original 32-bit floating point values and the quantized values is searched among \(N-1\) possible values. The search is done over a calibration set built by randomly picking 100 samples from the training set. To be noted that our problem formulation applies a symmetric linear quantization using a binary radix-point scaling.
As an additional piece of information, it is important to underline that quantization is not followed by retraining, a very time-consuming procedure even for small ConvNets.
4.2 Multiprecision Fixed-Point ConvNets
Problem Formulation. For a ConvNet of L layers, the classification accuracy can be scaled to different values by optimally selecting the arithmetic precision of each layer. The choice of such optimal precision should be done for the input map (I) and the weight (W) matrices of each layer each layer i, and for the output map matrix (O) of the last layerFootnote 2.
Assuming the availability of the four accuracy options described in Sect. 3, i.e. full (\(16\times 16\)), mixed (\(16\times 8\) or \(8\times 16\)), half (\(8\times 8\)), the precision for I and W of each layer, and that of O for last layer, can be assigned to 8-bit or 16-bit. We encode the unknown of the problem as a vector X of (2 \(\times \) L + 1) Boolean variables \(x_i\), where the variable \(x_{2\times L + 1}\) refers to O. The encoding map is: \(x=0 \rightarrow 8\)-bit, \(x=1\rightarrow 16\)-bit. The optimal assignment is the one that minimizes the total energy consumption E(X) while ensuring an accuracy loss \(\lambda (X)\) lower than a user-defined constraint \(\lambda _\text {max}\).
Energy-Driven Precision Assignment. The optimal precision assignment to each layer is carried out using a custom meta-heuristic based on Simulated Annealing (SA). Algorithm 2 shows the pseudo-code of the SA. It gets as inputs the parameters listed in Table 2.
In all the experiments, the starting solution \(X_0\) is the full-precision (16-bit) to all the L layers (both I and W, and O. The estimation of the accuracy drop is done on a subset of images randomly picked from the training set, referred to as the calibration set. Its size is defined by the \(cal\_set\) parameter.

At each iteration, the next state is generated as a random perturbation of the current state (line 6). For those states that satisfy the accuracy constraint (line 7), the energy cost function E is evaluated (line 8) through the function energy. If \(\varDelta \)E (line 9) reduces (line 10), the new state is accepted (lines 11–12). If not, the new state is accepted following a Boltzmann probability function (lines 10–12); the acceptance ratio gets smaller as T reduces. States that show minimum energy are iteratively saved as best solutions (lines 13–15). Once the total number of iterations is reached (line 5), the temperature T is cooled down (line 17). The process iterates till the minimum temperature \(T_\text {f}\) is reached (line 4).
The bottleneck of the algorithm is the call to the function accuracy_drop. For this reason, the algorithm takes trace of already processed states; this information is fed to the accuracy_drop function which can eventually by-pass accuracy estimation (line 16).
Energy. The system-level architecture depicted in Fig. 3 serves as a general template to describe Application-Specific Processors for ConvNets computing, e.g. [11]. It consists of a planar array of processing elements (PE), in our case the MAC units described in Sect. 3, a set of SRAM buffers for storing temporal data (Input Buffer, Weight Buffer, and Output Buffer), an off-chip memory (DRAM) and its DMA, a control unit (RISC) that schedules the operations.
The total energy consumption E is the sum of two main contributions: \(E = E^\text {comp} + E^\text {mem}\). \(E^\text {comp}\) is the energy consumed by the PE array, \(E^\text {mem}\) is the energy consumed due to data movement through the memory hierarchy.
The first term is defined as:
L is the number of layers of the ConvNet. \(E^\text {MAC}\) is the energy consumption of the half-precision MAC (row 8 \(\times \) 8 in Table 1). \(N_\text {cycles}\) is the latency of a single MAC operation of the i-th layer; it is given as multiple of the latency of the half-precision MAC (row 8 \(\times \) 8 in Table 1) and it is function of the precision \(x_i\). \(N_i^\text {MAC}\) is the number of non-zero MAC operations of the i-th layer. \(E_i^\text {zero}\) is the energy consumed under zero-skipping (mostly due to leakage). \(N_i^\text {zero}\) is the number of zero MAC.
The second term is defined as:
\(E^\text {MAC}\) is the same as in Eq. 3, while \(\alpha _i\), \(\beta _i\) and \(\gamma _i\) are three parameters that describe the energy consumed by the i-th layer due to reading/writing the input map (\(\alpha _i\)), the weights (\(\beta _i\)), the output map (\(\gamma _i\)). More specifically they represent the ratio between the energy consumption of the memory and the energy consumption of the PE array; here again, the energy unit is the half-precision MAC (row 8 \(\times \) 8 in Table 1) [11]. Obviously, \(\alpha \) and \(\beta \) do not contribute for the final output layer: \(\alpha _{L+1}=0\) and \(\beta _{L+1}=0\).
All the three parameters are function of the layer precision \(x_i\): both fetch and write-back operations depend on (i) the accuracy of the MAC algorithm, and (ii) the number of zero-multiplications (switching activity to/from memory may change substantially). Moreover \(\alpha _i\), \(\beta _i\), \(\gamma _i\) change depending on the ConvNet model: number and size of weights/channels per layer, stride and padding. Finally, they also differ depending on the size of the hardware components (PE array, and global buffers). Since the target of this work is not the energy model per se, not even the evaluation of different architectural solutions, \(\alpha _i\), \(\beta _i\), \(\gamma _i\) are extracted for the architecture proposed in [11] and then scaled to our precision reduction strategy. The same \(E^\text {mem}\) model applies to different architectures by proper tuning of the three parameters.
Accuracy Drop. The accuracy drop is computed as the ratio between the number of miss-classified images and the total number of images in the calibration set (\(cal\_set\)), hence its estimation implies the execution of \(\mathcal {S}\) feed-forward inferences using the quantized fixed-point model (\(\mathcal {S}\) as the cardinality of \(cal\_set\)).
Unfortunately, common GPUs do not have integer units. To address this issue we implemented the fake quantization proposed in [37]. It is a SW strategy that emulates the loss of information due to fixed-point arithmetic still using floating-point data-type. Each layer is wrapped with a software module that converts its input data and weights (32-bit floating-point) into a fake integer, namely, still a 32-bit floating-point number subtracted of an amount equal to the error that the fixed-point representation would have brought. The advantage is that all the fixed-point operations are physically run by the high-performance FP units.
5 Results
5.1 Experimental Set-up
The objective of this work is to provide a Pareto analysis of adaptive ConvNets implemented with the proposed energy-accuracy scaling strategy. As benchmarks we adopted three different applications which are reaching widespread use in several domains: Image Classification (IC), Keyword Spotting (KWS), Facial Expression Recognition (FER). Additional details provided in the next subsection. The exploration in the energy-accuracy space is conducted using the SA engine introduced in Sect. 4. More specifically, the algorithm is made run under different accuracy loss constraints, from 1% to 15% with step 1%, and collecting the energy consumption reached by the optimal precision settings.
Table 3 summarizes the SA parameters used in the experiments. For all the networks we selected the same hyper-parameters, except for the number of iterations iter at a given temperature T. As described in the next sub-section, the three ConvNets have different number of layers, hence different complexity; as the cardinality of the search space increases, more iterations are needed to explore the cost function.
5.2 Benchmarks
Image Classification (IC): the typical image recognition on the popular CIFAR-10 dataset. The dataset collects 60000 \(32 \times 32\) RGB images [24] evenly split in 10 classes, with 50000 and 10000 samples for the train-set and test-set respectively. The adopted ConvNet is taken from the Caffe framework [41], which consists of three convolutional layers interleaved with max-pooling and one fully-connected layer.
The three benchmarks under analysis serve very different purposes; their functionality and main characteristics, as well as their training set, are described separately therefore.
Keyword Spotting (KWS): a standard problem in the field of speech recognition. We considered a simplified version of the problemFootnote 3. The reference dataset is the Speech Commands Dataset [25]; it counts of 65k 1 s-long audio samples collected during the repetition of 30 different words by thousands of different people. The goal is to recognize 10 specific keywords, i.e. “Yes”, “No”, “Up”, “Down”,“Left”, “Right”, “On”, “Off”, “Stop”, “Go”, out of the 30 available words; samples that do not fall in these 10 categories are labeled as “unknown”. There is also an additional “silence” class made up of background noise samples (pink noise, white noise, and human-made sounds). The training set and test set collect 56196 and 7518, respectively. The adopted ConvNet is the cnn-one-fstride4 described in [42]; it has two convolutional layers, one max-pooling layer and four fully-connected layers. The ConvNet is fed with the spectrogram of the recorded signal which is obtained through the pre-processing pipeline introduced in [42] (extraction of \(time \times frequency = 32\times 40\) inputs w/o any data augmentation).
Facial Expression Recognition (FER): it is about inferring the emotional state of people from their facial expression. Quite popular in the field of vision reasoning, this task is very challenging as many face images might convey multiple emotions. The reference dataset is the Fer2013 dataset given by the Kaggle competition [26]. It collects 32297 \(48 \times 48\) gray-scale facial images split into 7 categories, i.e. “Angry”, “Disgust”, “Fear”, “Happy”, “Sad”, “Surprise”, “Neutral”. The training set counts of 28708 examples, while the remaining 3589 are in the test set. The adopted ConvNetFootnote 4 consists of nine convolutional layers evenly spaced by three max-polling layers, and one fully-connected layer.
Each benchmark is powered by a different model whose topology is described in Table 4. Within the same table we also collected additional information: the top-1 classification accuracy achieved with the original 32-bit floating-point model (Top-1 Acc.) training w/o any optimization; the overall number of MAC instructions for one inference run using 32-bit floating-point representations (#MAC); the number of possible precision configurations, namely the number of possible operating points in the parameters space (#Op. Points).
Concerning the Top-1 accuracy reported in Table 4, the results are consistent with the state-of-the-art. They were obtained with a dedicated training and testing framework integrated into PyTorch, version 0.4.1, with the following settings: 150 training epochs using the Adam algorithm [43]; learning rate 1e−3; linear decay 0.1 every 50 epochs; batch size of 128 samples randomly picked from the training set; non-overlapping testing set and training set.
5.3 Results
Table 5 shows the top-1 prediction accuracy achieved with a coarse per-net precision scaling scheme in which all the layers share the same precision.
The table collects the results for the original 32-bit floating-point model and the four fixed-point precision options made available with the multi-precision arithmetic described in Sect. 3. To notice that we do not run any retraining after quantization. This allows storing a single set of weights for any desired precision. Previous works suggest a re-training stage to recover the loss due to quantization and this would imply that each precision is coupled with a different fine-tuned model. What we propose instead is the use of a unique set of weights trained at full-precision (i.e. 16-bit for both weights and activations), then, at run-time, data are fetched and processed with the right precision. This is the key advantage of the proposed multi-precision scheme and the main enabler for adaptive ConvNets.
As reported in the table, the full-precision fixed-point ConvNets (column \(16{\times }16\) Fix) keeps almost the same accuracy of the original floating-point model (the maximum relative drop is 0.02% for KWS). The results are in line with previous works and motivate the choice of 16 \(\times \) 16 as the baseline for comparison. Concerning the mixed-precision options, \(8{\times }16\) assigns 8-bit to input maps (I) and 16-bit to the weights (W); \(16{\times }8\) does the opposite. The \(8{\times }16\) option is by far more accurate than \(16{\times }8\): minimum drop of 0.93% for IC; maximum drop of 2.07% for FER. The half-precision (column \(8{\times }8\)) shows larger loss: minimum drop of 4.90% for KWS; maximum drop 10.97% for IC. These numbers suggest the per-net granularity is too weak for effective deployment of adaptive ConvNets. Among the four available precision options, a very small set per se, only three are of practical use, i.e. \(16{\times }16\), \(8{\times }16\), \(16{\times }8\). Indeed, when precision is reduced to 8 \(\times \) 8 all the three benchmarks show a dramatic quality degradation. For instance, when shifted from 8 \(\times \) 16 to 8 \(\times \) 8, the IC shows a 10\(\times \) drop (from 0.93% to 10.97%). This calls for a finer precision assignment policy, which is the technique proposed in this work.
A detailed analysis of the results is provided by means of a Pareto analysis, Fig. 4. The plots show the possible operating points in the energy-accuracy space achieved with a per-net precision scaling (blue \(\times \)) and the proposed per-layer precision scaling (red \(\bullet \)). Each point comes with a different precision setting. The accuracy drop and the energy savings are normalized with respect to full-precision (rightmost \(\times \) marker with 0% accuracy drop). The dotted lines connect the points at the Pareto frontier. As aforementioned, with the per-net granularity only three among four points are Pareto. Moreover, the shift from one operating point to another is very coarse with substantial accuracy drop. The advantage of the per-layer is twofold. First, the Pareto curve is more dense and hence it gives more options for a finer control; this aspect is evident in larger ConvNets (e.g. FER). Second, the Pareto curve is dominating the per-net solutions, thus enabling larger (or comparable) average energy savings.
Table 6 reports some statistics over the subset of Pareto points, both per-net and per-layer. The column #Op. Points gives the number of Pareto Points; column Av. Drop refer to the accuracy drop averaged over the Pareto points; column Av. Savings does the same for the energy savings. For all the three benchmarks the energy-accuracy scaling operated with an optimal per-layer multi-precision assignment ensures optimality and usability on several context scenarios. Table 6 also shows the average execution time taken by the SA engine to draw a Pareto point, column Av. Exec. Time. Results are collected on a workstation powered by an Intel i7-8700K CPU and an NVIDIA GTX-1080 GPU with CUDA 9.0. As expected, time gets larger with network complexity. For the largest benchmark (FER) the tool consumes 66 min and 18 s.
A viable option to improve performance is to reduce the granularity at which the SA explores the parameters space. This can be achieved by constraining the number of iterations for each explored temperature T (parameter iter in Table 2). A quantitative comparison is given in Fig. 5, whose plot shows the Pareto curves obtained with iter = 1000 (the original value), 500 and 250 for the FER benchmark. The execution time reduces linearly, i.e. (66 min, 18 s) with iter = 1000, (33 min, 35 s) with iter = 500, (16 min, 36 s) with iter = 250, while the quality of results reveal more interesting trends. Whereas it is generally true that a larger iter leads to better absolute numbers, the gain practically fades when considering the relative distance between the obtained curves. With iter = 1000 the average savings across the Pareto points (39.3%) is just 5% larger than that obtained using iter = 500 (34.6%) and iter = 250 (34.4%); both iter = 1000 and iter = 500 collects the same number of Pareto points, 7 overall; only with iter = 250 the number of Pareto points reduces from 7 to 5. This analysis suggests that for larger ConvNets there’s a margin for tuning the SA to reasonable execution time w/o degrading much the quality.
6 Conclusions
The evolution of ConvNets has been driven by accuracy improvement. High accuracy reflected on large-scale network topologies which turned the inference into a too expensive task for low-power, energy-constrained embedded systems. ConvNets compression is therefore an urgent need for the growth of neural computing at the edge. While most of the existing techniques mainly focus on static optimizations, dynamic resource management represents a viable option to further improve energy efficiency. This chapter introduced a practical implementation of adaptive ConvNets. The proposed strategy allows ConvNets to relax their computational effort, and hence their energy consumption, leveraging the accuracy margin typical of non-critical applications. The technique is built upon a low overhead implementation of dynamic multi-precision arithmetic. The resulting ConvNets are free to move in the energy-accuracy space achieving better trade-offs. A Pareto analysis conducted on three representative applications (Image Recognition, Keyword Spotting, Facial Expression Recognition) quantified the energy savings suggesting potential improvement for the Simulated Annealing (SA) optimization engine. Future works will bring this adaptive strategy to larger ConvNets deployed on real HW implementations.
Notes
- 1.
We assume the availability of memories that support both word (N-bit) and halfword (K-bit) accesses [38].
- 2.
The precision of O does not impact computation as it only affects the number of memory accesses.
- 3.
- 4.
Inspired by https://github.com/JostineHo/mememoji.
References
LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., et al.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015)
Hinton, G., Deng, L., Yu, D., Dahl, G.E., Mohamed, A.-R., et al.: Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process. Mag. 29(6), 82–97 (2012)
Xu, X., Ding, Y., Hu, S.X., Niemier, M., Cong, J., et al.: Scaling for edge inference of deep neural networks. Nat. Electron. 1(4), 216 (2018)
Bottou, L.: Large-scale machine learning with stochastic gradient descent. In: Lechevallier, Y., Saporta, G. (eds.) Proceedings of COMPSTAT’2010, pp. 177–186. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-7908-2604-3_16
Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., et al.: MobileNets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017)
Sze, V., Chen, Y.-H., Yang, T.-J., Emer, J.: Efficient processing of deep neural networks: a tutorial and survey. arXiv preprint arXiv:1703.09039 (2017)
Grimaldi, M., Tenace, V., Calimera, A.: Layer-wise compressive training for convolutional neural networks. Future Internet 11(1) (2018). http://www.mdpi.com/1999-5903/11/1/7
Szegedy, C., Liu, C., Jia, Y., Sermanet, P., Reed, S., et al.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015)
Chen, Y.-H., Krishna, T., Emer, J.S., Sze, V.: Eyeriss: an energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE J. Solid State Circ. 52(1), 127–138 (2017)
Courbariaux, M., Bengio, Y., David, J.-P.: BinaryConnect: training deep neural networks with binary weights during propagations. In: Advances in Neural Information Processing Systems, pp. 3123–3131 (2015)
Flamand, E., Rossi, D., Conti, F., Loi, I., Pullini, A., et al.: Gap-8: a RISC-V SoC for AI at the edge of the IoT. In: 2018 IEEE 29th International Conference on Application-Specific Systems, Architectures and Processors (ASAP), pp. 1–4. IEEE (2018)
Moons, B., Verhelst, M.: A 0.3-2.6 TOPS, W precision-scalable processor for real-time large-scale ConvNets. In: IEEE Symposium on VLSI Circuits (VLSI-Circuits), pp. 1–2. IEEE (2016)
Albericio, J., Delmás, A., Judd, P., Sharify, S., O’Leary, G., et al.: Bit-pragmatic deep neural network computing. In: Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 382–394. ACM (2017)
Moons, B., De Brabandere, B., Van Gool, L., Verhelst, M.: Energy-efficient ConvNets through approximate computing. In: 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1–8. IEEE (2016)
Shafique, M., Hafiz, R., Javed, M.U., Abbas, S., Sekanina, L.: Adaptive and energy-efficient architectures for machine learning: challenges, opportunities, and research roadmap. In: 2017 IEEE Computer Society Annual Symposium on VLSI (ISVLSI), pp. 627–632. IEEE (2017)
Alioto, M., De, V., Marongiu, A.: Energy-quality scalable integrated circuits and systems: continuing energy scaling in the twilight of moore’s law. IEEE J. Emerg. Sel. Top. Circuits Syst. 8(4), 653–678 (2018)
Peluso, V., Calimera, A.: Weak-MAC: arithmetic relaxation for dynamic energy-accuracy scaling in ConvNets. In: IEEE International Symposium on Circuits and Systems (ISCAS), pp. 1–5. IEEE (2018)
Peluso, V., Calimera, A.: Energy-driven precision scaling for fixed-point ConvNets. In: 2018 IFIP/IEEE International Conference on Very Large Scale Integration (VLSI-SoC), pp. 1–6. IEEE (2018)
Lai, L., Suda, N.: Enabling deep learning at the IoT edge. In: Proceedings of the International Conference on Computer-Aided Design, p. 135. ACM (2018)
Jouppi, N.P., Young, C., Patil, N., Patterson, D., Agrawal, G., et al.: In-datacenter performance analysis of a tensor processing unit. In: Proceedings of the 44th Annual International Symposium on Computer Architecture, ISCA 2017, pp. 1–12. ACM, New York (2017). http://doi.acm.org/10.1145/3079856.3080246
Moons, B., Verhelst, M.: An energy-efficient precision-scalable ConvNet processor in 40-nm CMOS. IEEE J. Solid State Circuits 52(4), 903–914 (2017)
Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images. Technical report, Citeseer (2009)
Warden, P.: Speech commands: a dataset for limited-vocabulary speech recognition. arXiv preprint arXiv:1804.03209 (2018)
Challenges in representation learning: facial expression recognition challenge. http://www.kaggle.com/c/challenges-in-representation-learning-facial-expression-recognition-challenge
Andri, R., Cavigelli, L., Rossi, D., Benini, L.: YodaNN: an architecture for ultra-low power binary-weight CNN acceleration. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 37, 48–60 (2017)
Gu, J., Wang, Z., Kuen, J., Ma, L., Shahroudy, A., et al.: Recent advances in convolutional neural networks. Pattern Recogn. (2017). http://www.sciencedirect.com/science/article/pii/S0031320317304120
Yang, T.J., Chen, Y.H., Sze, V.: Designing energy-efficient convolutional neural networks using energy-aware pruning. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6071–6079, July 2017
Fleuret, F., Geman, D.: Coarse-to-fine face detection. Int. J. Comput. Vis. 41(1), 85–107 (2001)
Panda, P., Sengupta, A., Roy, K.: Conditional deep learning for energy-efficient and enhanced pattern recognition. In: Proceedings of the 2016 Conference on Design, Automation & Test in Europe, DATE 2016, pp. 475–480. EDA Consortium, San Jose (2016). http://dl.acm.org/citation.cfm?id=2971808.2971918
Yan, Z., Zhang, H., Piramuthu, R., Jagadeesh, V., DeCoste, D., et al.: HD-CNN: hierarchical deep convolutional neural networks for large scale visual recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2740–2748 (2015)
Neshatpour, K., Behnia, F., Homayoun, H., Sasan, A.: ICNN: an iterative implementation of convolutional neural networks to enable energy and computational complexity aware dynamic approximation. In: Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 551–556. IEEE (2018)
Moons, B., Uytterhoeven, R., Dehaene, W., Verhelst, M.: 14.5 envision: A 0.26-to-10TOPS, W subword-parallel dynamic-voltage-accuracy-frequency-scalable convolutional neural network processor in 28 nm FDSOI. In: IEEE International Solid-State Circuits Conference (ISSCC), pp. 246–247. IEEE (2017)
Peluso, V., Calimera, A.: Scalable-effort ConvNets for multilevel classification. In: 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), pp. 1–8. IEEE (2018)
Lin, D., Talathi, S., Annapureddy, S.: Fixed point quantization of deep convolutional networks. In: International Conference on Machine Learning, pp. 2849–2858 (2016)
Shan, L., Zhang, M., Deng, L., Gong, G.: A dynamic multi-precision fixed-point data quantization strategy for convolutional neural network. In: Xu, W., Xiao, L., Li, J., Zhang, C., Zhu, Z. (eds.) NCCET 2016. CCIS, vol. 666, pp. 102–111. Springer, Singapore (2016). https://doi.org/10.1007/978-981-10-3159-5_10
Jahnke, S.R., Hamakawa, H.: Micro-controller direct memory access (DMA) operation with adjustable word size transfers and address alignment/incrementing. US Patent 6,816,921, 9 November 2004
Courbariaux, M., Bengio, Y., David, J.-P.: Training deep neural networks with low precision multiplications. arXiv preprint arXiv:1412.7024 (2014)
Desoli, G., Chawla, N., Boesch, T., Singh, S.-P., Guidetti, E.: 14.1 A 2.9 TOPS, W deep convolutional neural network SoC in FD-SOI 28 nm for intelligent embedded systems. In: 2017 IEEE International Solid-State Circuits Conference (ISSCC), pp. 238–239. IEEE (2017)
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., et al.: Caffe: convolutional architecture for fast feature embedding. In: Proceedings of the 22nd ACM International Conference on Multimedia, pp. 675–678. ACM (2014)
Sainath, T.N., Vinyals, O., Senior, A., Sak, H.: Convolutional, long short-term memory, fully connected deep neural networks. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4580–4584. IEEE (2015)
Kingma, D., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 IFIP International Federation for Information Processing
About this paper
Cite this paper
Peluso, V., Calimera, A. (2019). Energy-Accuracy Scalable Deep Convolutional Neural Networks: A Pareto Analysis. In: Bombieri, N., Pravadelli, G., Fujita, M., Austin, T., Reis, R. (eds) VLSI-SoC: Design and Engineering of Electronics Systems Based on New Computing Paradigms. VLSI-SoC 2018. IFIP Advances in Information and Communication Technology, vol 561. Springer, Cham. https://doi.org/10.1007/978-3-030-23425-6_6
Download citation
DOI: https://doi.org/10.1007/978-3-030-23425-6_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-23424-9
Online ISBN: 978-3-030-23425-6
eBook Packages: Computer ScienceComputer Science (R0)