Skip to main content
Log in

A systematic study on benchmarking AI inference accelerators

  • Regular Paper
  • Published:
CCF Transactions on High Performance Computing Aims and scope Submit manuscript

Abstract

AI inference accelerators have drawn extensive attention. But none of the previous work performs a holistic and systematic benchmarking on AI inference accelerators. First, an end-to-end AI inference pipeline consists of six stages on both host and accelerators. However, previous work mainly evaluates hardware execution performance, which is only one stage on accelerators. Second, there is a lack of a systematic evaluation of different optimizations on AI inference accelerators. Along with six representative AI workloads and a typical AI inference accelerator–Diannao based on Cambricon ISA, we implement five frequently-used AI inference optimizations as user-configurable hyper-parameters. We explore the optimization space by sweeping the hyper-parameters and quantifying each optimization’s effect on the chosen metrics. We also provide cross-platform comparisons between Diannao and traditional platforms (Intel CPUs and Nvidia GPUs). Our evaluation provides several new observations and insights, which sheds light on the comprehensive understanding of AI inference accelerators’ performance and instructs the co-design of the upper-level optimizations and underlying hardware architecture.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Notes

  1. The common pre-processing includes image decoding, image resizing, image padding, image cropping, channel arrangement, and normalization, etc. Different DNN workloads adopt different pre-processing techniques according to their requirements.

References

  • AnandTech: https://www.anandtech.com/show/12815/cambricon-makers-of-huaweis-kirin-npu-ip-build-a-big-ai-chip-and-pcie-card, (2018)

  • Cambricon: Cambricon cnrt. http://www.cambricon.com/index.php?m=content&c=index&a=lists&catid=71

  • Cambricon MLU100, http://www.cambricon.com/index.php?c=page&id=20

  • Chen, W. et al.: Compressing neural networks with the hashing trick. In: Proceedings of the International Conference on Machine Learning, pp. 2285–2294 (2015)

  • Chen, T., et al.: Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. ACM ASPLOS 49(4), 269–284 (2014)

    Google Scholar 

  • Courbariaux, et al. (2015) Binaryconnect: Training deep neural networks with binary weights during propagations. In: NeurIPS, pp. 3123–3131

  • DeepBench, https://github.com/baidu-research/DeepBench

  • Denil, M., et al.: Predicting parameters in deep learning. Adv Neural Inform Process Syst 26, 2148–2156 (2013)

    Google Scholar 

  • Dean J et al. (2012) Large scale distributed deep networks. In: Advances in Neural Information Processing Systems 25. Curran Associates, Inc., pp. 1223–1231

  • Deng J, et al. (2009) Imagenet: A large-scale hierarchical image database. In: IEEE conference on computer vision and pattern recognition, pp 248–255

  • Everingham, M. et al. (2012) The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results

  • Google: Edge-tpu. https://cloud.google.com/edge-tpu

  • Google: What Makes TPU Fine Tuned to Deep Learning. https://cloud.google.com/blog/products/ai-machine-learning/what-makes-tpus-fine-tuned-for-deep-learning

  • Gray J (1993) Database and transaction processing performance handbook

  • Hao T et al. (2018) Edge AIBench: towards comprehensive end-to-end edge computing benchmarking. International Symposium on Benchmarking, Measuring and Optimization, Springer, Cham, pp. 23-30

  • Han S, Mao H, Dally WJ (2016) Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. ICLR

  • Hennessy, J.L., Patterson, D.A.: A new golden age for computer architecture. Commun ACM 62(2), 48–60 (2019)

    Article  Google Scholar 

  • He K et al. (2015) Deep residual learning for image recognition. CoRR, vol. abs/1512.03385

  • Huawei: Huawei Ascend 310 Accelerator. http://ascend.huawei.com (2020)

  • Huang G et al. (2016) Densely connected convolutional networks. CoRR, vol. abs/1608.06993

  • Howard AG et al. (2017) Mobilenets: Efficient convolutional neural networks for mobile vision applications. CoRR, vol. abs/1704.04861

  • Iandola FN et al. (2016) Squeezenet: Alexnet-level accuracy with 50x fewer parameters and \(<\)1mb model size. CoRR, vol. abs/1602.07360

  • Jain, S., et al.: Trained quantization thresholds for accurate and efficient fixed-point inference of deep neural networks. Proc Mach Learn Syst 2, 112–128 (2020)

    Google Scholar 

  • Jiang Z et al. (2021) Hpc ai500 v2. 0: The methodology, tools, and metrics for benchmarking hpc ai systems. IEEE CLUSTER

  • Jouppi, N.P. et al.: In-datacenter performance analysis of a tensor processing unit. In: ACM/IEEE ISCA. IEEE, pp. 1–12 (2017)

  • Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp. 1097–1105

  • Lee D, Kim B (2018) Retraining-based iterative weight quantization for deep neural networks. CoRR, vol. abs/1805.11233

  • Li J. et al.: Characterizing the i/o pipeline in the deployment of cnns on commercial accelerators. IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking. IEEE, pp. 137-144 (2020)

  • Liu, S., et al.: Cambricon: an instruction set architecture for neural networks. ACM/IEEE ISCA 44(3), 393–405 (2016)

    Google Scholar 

  • Liu W et al.: Ssd: single shot multibox detector. (2016), to appear. [Online]. http://arxiv.org/abs/1512.02325

  • Luo C et al. (2018) AIoT bench: towards comprehensive benchmarking mobile and embedded device intelligence. International Symposium on Benchmarking, pp. 31–35. Springer, Cham, Measuring and Optimization

  • Ma X et al. (2019) PCONV: the missing but desirable sparsity in DNN weight pruning for real-time execution on mobile devices. CoRR, vol. abs/1909.05073

  • Mishra R et al. (2020) A Survey on Deep Neural Network Compression: Challenges, Overview, and Solutions. CoRR, vol. abs/2010.03954

  • Mittal D et al. (2018)ecovering from random pruning: On the plasticity of deep convolutional neural networks. CoRR, vol. abs/1801.10447

  • Niu W et al. (2020) Patdnn: Achieving real-time dnn execution on mobile devices with pattern-based weight pruning. In: ACM ASPLOS, pp. 907–922

  • Reddi VJ et al. (2020) Mlperf inference benchmark. In: ACM/IEEE ISCA, pp. 446–459

  • Sze, V., et al.: How to evaluate deep neural network processors: Tops/w (alone) considered harmful. IEEE Solid-State Circ Mag 12(3), 28–41 (2020)

    Article  Google Scholar 

  • Tang F et al. (2021) AIBench Training: Balanced Industry-Standard AI Training Benchmarking. IEEE Computer Society, In IEEE ISPASS

  • Tao, J.-H., et al.: Bench ip: Bencharking intelligence processors. J Comput Sci Technol 33(1), 1–23 (2018)

    Article  Google Scholar 

  • Turner J et al. (2018) Characterising across-stack optimisations for deep convolutional neural networks. In: IISWC, pp 101–110

  • Wang Y et al. A systematic methodology for analysis of deep learning hardware and software platforms. In: Proceedings of Machine Learning and Systems

  • Williams, S., et al.: Roofline: an insightful visual performance model for multicore architectures. Commun ACM 52(4), 65–76 (2009)

    Article  Google Scholar 

  • Zhao, R., et al.: Improving neural network quantization without retraining using outlier channel splitting. Ser Proc Mach Learn Res 97, 7543–7552 (2019). (PMLR)

    Google Scholar 

  • Zhou A, Yao A, Guo Y, Xu L, Chen Y Incremental network quantization: Towards lossless cnns with low-precision weights. CoRR, vol. abs/1702.03044, (2017). [Online]. http://arxiv.org/abs/1702.03044

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zihan Jiang.

Appendix A

Appendix A

1.1 A.1 Implementation Details on Diannao

Considering the diversity of the network architecture, there is no-one-size-fits-all algorithm for the quantization and pruning. Research Jain (2020) shows some networks need to tailor the dedicated algorithm or retrain-based training method to make a compensate for the drop in model quality. Studying more general pruning and quantization algorithms Mishra et al. (2020) is still an open problem and beyond the scope of this paper. Here we briefly introduce our implementation of pruning and quantization.

1.1.1 A.1.1 Quantization

Diannao is equipped with large numbers of INT8-based ALUs. We implement INT8 quantization, which means that parameters of the model are stored using 8-bit fix-point integers instead of original floating-point numbers (Diannao use FP16 as its floating-point numbers). These model parameters are usually composed of three parts: weights, activations and bias. Considering that the proportion of bias in the overall parameters is small, we only quantify weights and activations. The computation process of quantified parameters can be summarized by the following formula:

$$\begin{aligned} real\_number = stored\_integers * scaling\_factor \end{aligned}$$
(A1)

where \(real\_number\) refers to the parameters before quantization and \(stored\_integers\) refers to the parameters after quantization. And \(scaling\_factor\) aims to prevent over or underflows when computing the lower precision results.

1.1.2 A.1.2 Weight Pruning

Inside Diannao, there are also large numbers of sparse computing units. We only prune the weight of convolutional and fully-connected layers, because the weights of these two types of layers occupy most of the parameters of the entire model. Sparsity is a decimal between 0 and 1, referring to the percentage of zero-valued weights in the model. We use sparsity to reflect the effects of weight pruning optimization. Motivated by Deep Compression Han et al. (2016), in each convolutional and fully-connected layer, we sort the weights and then zero out the weights that with the lowest magnitude based on the sparsity. To show the effect of weight punning on model quality and inference throughput, we gradually increase the sparsity from 0.01 to 0.9 with the increment step of 0.01 while keeping other optimizations fixed.

1.2 A.2 An Example of Efficient Network Deployment

Table 6 presents the best configuration candidates in terms of the end-to-end throughput. We get these configurations by looking up the database (discussed in Sect. 6.2). To illustrate the trade-off process between the throughput and model quality, we present 4 configurations for each workload. The pre-defined target quality as the minimum requirement for model quality.

For DenseNet121, the target quality is 0.73, it achieves the highest end-to-end throughput at the configuration (sparse, INT8, 1, 4, 1, 8). However, the accuracy of the model does not meet the requirements, so this configuration will still be discarded. Then the configuration (Dense, FP16, 1, 4, 1, 8) that reaches the second highest end-to-end throughput is chosen since the accuracy requirement is satisfied. We followed the same method to select the best configuration for the remaining workload.

Table 6 Best optimization configurations in terms of end-to-end throughput for each DNNs on Diannao

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Jiang, Z., Li, J., Liu, F. et al. A systematic study on benchmarking AI inference accelerators. CCF Trans. HPC 4, 87–103 (2022). https://doi.org/10.1007/s42514-022-00105-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s42514-022-00105-z

Keywords

Navigation