Abstract
With recent advance of computer vision techniques, an increasing amount of image and video content is consumed by machines. However, existing image and video compression schemes are mainly designed for human vision, which are not optimized concerning machine vision. In this paper, we propose a saliency guided learned image compression scheme for machines, where object detection is considered as an example task. To obtain salient regions for machine vision, a saliency map is obtained for each detected object using an existing black-box explanation of neural networks, and maps for multiple objects are merged sophistically into one. Based on a neural network-based image codec, a bitrate allocation scheme has been designed which prunes the latent representation of the image according to the saliency map. During the training of end-to-end image codec, both pixel fidelity and machine vision fidelity are used for performance evaluation, where the degradation in detection accuracy is measured without ground-truth annotation. Experimental results demonstrate that the proposed scheme can achieve up to 14.1% reduction in bitrate with the same detection accuracy compared with the baseline learned image codec.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Arrieta, A.B., et al.: Explainable artificial intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI. Inf. Fusion 58, 82–115 (2020)
Bjontegaard, G.: Calculation of average PSNR differences between RD-curves. In: VCEG-M33 (2001)
Cai, Q., Chen, Z., Wu, D., Liu, S., Li, X.: A novel video coding strategy in HEVC for object detection. IEEE Trans. Circuits Syst. Video Technol. 31(12), 4924–4937 (2021)
Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The PASCAL visual object classes challenge. http://host.robots.ox.ac.uk/pascal/VOC/
Gao, W., Liu, S., Xu, X., Rafie, M., Zhang, Y., Curcio, I.: Recent standard development activities on video coding for machines (2021). https://arxiv.org/abs/2105.12653
Hu, Y., Yang, W., Liu, J.: Coarse-to-fine hyper-prior modeling for learned image compression. In: Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, New York, NY, USA, 7–12 February 2020, pp. 11013–11020 (2020)
Huang, Z., Jia, C., Wang, S., Ma, S.: Visual analysis motivated rate-distortion model for image coding. In: 2021 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6. IEEE (2021)
Vtm reference software for vvc (2021). https://vcgit.hhi.fraunhofer.de/jvet/VVCSoftware_VTM/-/tree/VTM-16.0
Le, N., Zhang, H., Cricri, F., Ghaznavi-Youvalari, R., Tavakoli, H.R., Rahtu, E.: Learned image coding for machines: a content-adaptive approach. In: IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6 (2021)
Li, Y., et al.: Joint rate-distortion optimization for simultaneous texture and deep feature compression of facial images. In: Fourth IEEE International Conference on Multimedia Big Data, BigMM 2018, Xi’an, China, pp. 1–5 (2018)
Petsiuk, V., Das, A., Saenko, K.: Rise: randomized input sampling for explanation of black-box models. In: British Machine Vision Conference 2018, BMVC 2018, Newcastle, UK, 3–6 September (2018)
Petsiuk, V., et al.: Black-box explanation of object detectors via saliency maps. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11438–11447 (2021)
Redmon, J., Farhadi, A.: Yolov3: an incremental improvement (2018). 10.48550/ARXIV.1804.02767
Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-cam: visual explanations from deep networks via gradient-based localization. In: IEEE International Conference on Computer Vision (ICCV), pp. 618–626 (2017)
Sullivan, G.J., Ohm, J.R., Han, W.J., Wiegand, T.: Overview of the high efficiency video coding (HEVC) standard. IEEE Trans. Circuits Syst. Video Technol. 22(12), 1649–1668 (2012)
Ultralytics: Yolov3 implementation (2021). https://doi.org/10.5281/zenodo.6222936, https://github.com/ultralytics/yolov3
Wang, S., et al.: Teacher-student learning with multi-granularity constraint towards compact facial feature representation. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 8503–8507 (2021)
Wang, S., Wang, Z., Wang, S., Ye, Y.: End-to-end compression towards machine vision: network architecture design and optimization. IEEE Open J. Circuits Syst. 2, 675–685 (2021)
Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2921–2929 (2016)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Xiong, H., Xu, Y. (2023). Saliency-Guided Learned Image Compression for Object Detection. In: Tanveer, M., Agarwal, S., Ozawa, S., Ekbal, A., Jatowt, A. (eds) Neural Information Processing. ICONIP 2022. Communications in Computer and Information Science, vol 1791. Springer, Singapore. https://doi.org/10.1007/978-981-99-1639-9_27
Download citation
DOI: https://doi.org/10.1007/978-981-99-1639-9_27
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-1638-2
Online ISBN: 978-981-99-1639-9
eBook Packages: Computer ScienceComputer Science (R0)