ABSTRACT
DNN-based object detection operates on large data volumes to fetch images and DNN weights, which leads to high power and bandwidth demands. Solutions to mitigate those demands, such as weight clustering, are normally studied in limited examples of a much smaller scale than target applications, which poses difficulties to determine the best tradeoff to implement. This paper performs an at-scale (using a real life application) assessment of weight clustering for a DNN-based object detection system - You Only Look Once (YOLO) - considering real driving videos. Our case study shows that an Output Stationary accelerator (e.g. a systolic array) restricting weights to only between 32 (5-bit) and 256 (8-bit) different values allows preserving the accuracy of the original 32-bit weights of YOLO while decreasing bandwidth requirements to around 30%-40% of the original bandwidth, and overall energy consumption to around 45% of the original consumption. Overall, our case study provides key insights on which to take design decisions for an accelerator for camera-based object detection.
- 2018. Apollo, an open autonomous driving platform. http://apollo.auto/.Google Scholar
- T. Chen et al. 2014. DianNao: A Small-Footprint High-Throughput Accelerator for Ubiquitous Machine-Learning. In ASPLOS.Google Scholar
- Yoojin Choi et al. 2016. Towards the Limit of Network Quantization. CoRR abs/1612.01543 (2016). arXiv:1612.01543 http://arxiv.org/abs/1612.01543Google Scholar
- I.S. Dhillon and D.S. Modha. 2000. A Data-Clustering Algorithm on Distributed Memory Multiprocessors. In Large-Scale Parallel Data Mining. Springer Berlin Heidelberg, 245--260.Google Scholar
- Y. Gong et al. 2014. Compressing Deep Convolutional Networks using Vector Quantization. arXiv:cs.CV/1412.6115Google Scholar
- S. Han et al. 2016. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding. arXiv:cs.CV/1510.00149Google Scholar
- S. Han et al. 2016. EIE: Efficient Inference Engine on Compressed Deep Neural Network. In ISCA. 243--254. Google ScholarDigital Library
- J. Redmon and A. Farhadi. 2018. Yolo v3: An incremental improvement.Google Scholar
- K.T. Johnson, A.R. Hurson, and B. Shirazi. 1993. General-purpose systolic arrays. Computer 26, 11 (1993), 20--31. Google ScholarDigital Library
- S. Kung. 1985. VLSI Array processors. IEEE ASSP Magazine 2, 3 (1985), 4--22. Google ScholarCross Ref
- S. Li et al. 2011. CACTI-P: Architecture-level modeling for SRAM-based structures with advanced leakage reduction techniques. In ICCAD.Google Scholar
- S. Li et al. 2020. DRAMsim3: A Cycle-Accurate, Thermal-Capable DRAM Simulator. IEEE Computer Architecture Letters 19, 2 (2020), 106--109.Google ScholarDigital Library
- J.B. MacQueen. 1967. Some methods for classification and analysis of multivariate observations. In Berkeley Symposium on Mathematical Statistics and Probability.Google Scholar
- Microsoft. [n. d.]. COCO - Detection Evaluation. https://cocodataset.org/detection-eval.Google Scholar
- R. Padilla et al. 2020. A Survey on Performance Metrics for Object-Detection Algorithms. In 2020 Int'l Conf. on Systems, Signals and Image Processing. 237--242.Google Scholar
- S. Seo and J. Kim. 2019. Efficient Weights Quantization of Convolutional Neural Networks Using Kernel Density Estimation based Non-uniform Quantizer. Applied Sciences 9, 12 (2019).Google Scholar
- V. Sze et al. 2017. Efficient Processing of Deep Neural Networks: A Tutorial and Survey. Proc. IEEE 105, 12 (2017), 2295--2329.Google ScholarCross Ref
- Hamid Tabani, Jose-Maria Arnau, Jordi Tubella, and Antonio Gonzalez. 2017. An ultra low-power hardware accelerator for acoustic scoring in speech recognition. In 2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT). IEEE, 41--52.Google Scholar
- F. Tung and G. Mori. 2018. CLIP-Q: Deep Network Compression Learning by In-parallel Pruning-Quantization. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7873--7882.Google Scholar
- J. Utah. 2020. Rio 4K - Copacabana Beach - Morning Drive, [2:50 - 3:20]. https://www.youtube.com/watch?v=_hWCN1yV9TY.Google Scholar
- Z. Wang et al. 2020. Sparse-YOLO: Hardware/Software Co-Design of an FPGA Accelerator for YOLOv2. IEEE Access 8 (2020), 116569--116585.Google ScholarCross Ref
- S. Ye et al. 2018. A unified framework of dnn weight pruning and weight clustering/quantization using admm. arXiv preprint arXiv:1811.01907 (2018).Google Scholar
Index Terms
- At-scale assessment of weight clustering for energy-efficient object detection accelerators
Recommendations
At-scale evaluation of weight clustering to enable energy-efficient object detection
AbstractAccelerators implementing Deep Neural Networks (DNNs) for image-based object detection operate on large volumes of data due to fetching images and neural network parameters, especially if they need to process video streams, hence with ...
Energy-Efficient In-SRAM Accumulation for CMOS-based CNN Accelerators
GLSVLSI '22: Proceedings of the Great Lakes Symposium on VLSI 2022State-of-the-art convolutional neural network (CNN) accelerators are typically communication-dominate architectures. To reduce the energy consumption of data accesses and also to maintain the high performance, researches have adopted large amounts of on-...
Energy Efficient Weight Based Clustering in MANET
ICGSP '17: Proceedings of the 1st International Conference on Graphics and Signal ProcessingEnergy is a critical issue in cluster based routing protocol in MANET (Mobile Ad hoc Network). When death of a cluster head occurs, re-clustering technique needs to be invoked to select another cluster head. This technique again involves explicit ...
Comments