Scalable image coding with enhancement features for human and machine

Wu, Ying; An, Ping; Yang, Chao; Huang, XinPeng

doi:10.1007/s00530-024-01279-y

Scalable image coding with enhancement features for human and machine

Regular Paper
Published: 10 March 2024

Volume 30, article number 77, (2024)
Cite this article

Multimedia Systems Aims and scope Submit manuscript

Ying Wu¹,
Ping An¹,
Chao Yang¹ &
…
XinPeng Huang¹

We’re sorry, something doesn't seem to be working properly.

Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

Abstract

The past decade has seen significant advancements in computer vision technologies, resulting in an increasing consumption of images and videos by both human and machine. Although machines are usually the primary consumers, there are many applications where human involvement is indispensable. In this paper, we propose a novel image coding technique that targets machines while ensuring compatibility with human consumption. The proposed codec generates two distinct bitstreams: the reconstruction feature bitstreams and the enhancement feature bitstreams. The former are designed to facilitate image reconstruction for human consumption and vision tasks for machine consumption, while the latter are optimized for high-quality vision tasks. To achieve this goal, we introduce the Mask Multilayer Fusion Encoder (MMFE), which integrates multi-scale visual prior masks into partial channel features of the encoder. Additionally, due to the significant distortion of features at low bitrates, we propose a Local Feature Fusion Module (LFFM) that aggregates semantic information from the reconstruction features to obtain enhancement features, so as to improve the performance of vision tasks. Our experimental results demonstrate that our scalable codec provides significant bitrate savings of 26–77\(\%\) on machine vision tasks compared to state-of-the-art image codecs, while maintaining comparable performance in terms of image reconstruction. Our proposed codec represents a significant advancement in the field of image coding, with the potential to improve both human and machine consumption of visual media.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

End-to-end optimized image compression with the frequency-oriented transform

Article 07 February 2024

Just Recognizable Distortion for Machine Vision Oriented Image and Video Coding

Article 13 August 2021

SDNet: A Versatile Squeeze-and-Decomposition Network for Real-Time Image Fusion

Article 30 July 2021

Data availability

The data underlying this article is available from the corresponding author on reasonable request.

References

Wallace, G.K.: The jpeg still picture compression standard. IEEE Trans. Consum. Electron. 38(1), 18–34 (1992)
Article Google Scholar
Lee, D.T.: Jpeg 2000: retrospective and new developments. Proc. IEEE 93(1), 32–41 (2005). https://doi.org/10.1109/JPROC.2004.839613
Article Google Scholar
Li, L., Wei, S.L.: Webp: A new image compression format based on vp8 encoding. Microcontrollers Embedded Syst 3(1), 40–43 (2012)
Google Scholar
Wiegand, T., Sullivan, G.J., Bjontegaard, G., Luthra, A.: Overview of the h.264/avc video coding standard. IEEE Trans. Circuits Syst. Video Technol. 13(7), 560–576 (2003)
Article Google Scholar
Sullivan, G.J., Ohm, J.R., Han, W.J., Wiegand, T.: Overview of the high efficiency video coding (hevc) standard. IEEE Trans. Circuits Syst. Video Technol. 22(12), 1649–1668 (2012). https://doi.org/10.1109/TCSVT.2012.2221191
Article Google Scholar
Bross, B., Wang, Y.K., Ye, Y., et al.: Overview of the versatile video coding (vvc) standard and its applications. IEEE Trans. Circuits Syst. Video Technol. 31(10), 3736–3764 (2021). https://doi.org/10.1109/TCSVT.2021.3101953
Article Google Scholar
Toderici, G., O’Malley, S.M., Hwang, S.J., et al .: Variable rate image compression with recurrent neural networks. ArXiv preprint at (2015) arXiv: org/abs/1511.06085
Toderici, G., Vincent, D., Johnston, N., et al.: Full resolution image compression with recurrent neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 5435-5443 (2017)
Ballé, J., Laparra, V., Simoncelli, E.: End-to-end optimized image compression. In: 5th International Conference on Learning Representations, pp. 1-27 (2017)
Ballé, J., Minnen, D., Singh, S., Hwang, S.J., Johnston, N.: Variational image compression with a scale hyperprior. In: 6th International Conference on Learning Representations, pp. 1-10 (2018)
Minnen, D., Ballé, J., Toderici, G.: Joint autoregressive and hierarchical priors for learned image compression. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 10771-10780 (2018)
Cheng, Z.X., Sun, H.M., Takeuchi, M., Katto, J.: Learned image compression with discretized gaussian mixture likelihoods and attention modules. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 7939-7948 (2020)
Agustsson, E., Tschannen, M., Mentzer, F., Timofte, R., VanGool, L.: Generative adversarial networks for extreme learned image compression. ArXiv preprint at (2018) arXiv: org/abs/1804.02958
Chen, F.D., Xu, Y.M., Wang, L.: Two-stage octave residual network for end-to-end image compression. In: 36th AAAI Conference on Artificial Intelligence, pp. 3922-3929 (2022)
Kim, J.H., Heo, B., Lee, J.S.: Joint global and local hierarchical priors for learned mage compression. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 5982-5991. (2022) https://doi.org/10.1109/CVPR52688.2022.00590
Zou, R.J., Song, C.F., Zhang, Z.X.: The devil is in the details: Window-based attention for image compression. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 17471-17480. (2022) https://doi.org/10.1109/CVPR52688.2022.01697
Zhu, X.S., Song, J.K., Gao, L.L., Zheng, F., Shen, H.T.: Unified multivariate gaussian mixture for efficient neural image compression. IEEE Conf. Comput. Vis. Patt. Recogn. (2022). https://doi.org/10.1109/CVPR52688.2022.01709
Article Google Scholar
He, D.L., Yang, Z.M., Peng, W.K., Ma, R., Qin, H.W., Wang, Y.: Elic: Efficient learned image compression with unevenly grouped space-channel contextual. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 5708-5717. (2022) https://doi.org/10.1109/CVPR52688.2022.00563
Redmon, J., Divvala, S., Girshick, R., Nah, S., Farhadi, A.: You only look once: Unified, real-time object detection. In: IEEE Conference on Computer Vision and Pattern Recognition. pp. 779-788. (2016) https://doi.org/10.1109/CVPR.2016.91
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y.: Ssd: Single shot multibox detector. In: 14th European Conference on Computer Vision. pp. 21-37. (2016) https://doi.org/10.1007/978-3-319-46448-0_2
Redmon, J., Farhadi, A.: Yolov3: An incremental improvement. ArXiv preprint at (2018) arXiv org/abs/1804.02767
Ren, S.Q., He, K.M., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39(6), 1137–1149 (2016)
Article Google Scholar
He, K.M., Gkioxari, G., Dollar, P., Girshick, R.: Mask r-cnn. In: 16th IEEE International Conference on Computer Vision. pp. 2980-2988. (2017) https://doi.org/10.1109/ICCV.2017.322
Chen, L.C., Papandreou, G., Schroff, F., Adam, H.: Rethinking atrous convolution for semantic image segmentation. ArXiv preprint at (2017) arXiv: org/abs/1706.05587
Le, N., Zhang, H.L., Cricri, F., Ghaznavi-Youvalari, R., Rahtu, E.: Image coding for machines: An end-to-end learned approach. IEEE Int. Conf. Acoust. Speech Signal Process (2021). https://doi.org/10.1109/ICASSP39728.2021.9414465
Article Google Scholar
Gao, C.S., Liu, D., Li, L., Wu, F.: Towards task-generic image compression: A study of semantics-oriented metrics. IEEE Trans. Multimed. 25, 721–735 (2023). https://doi.org/10.1109/TMM.2021.3130754
Article Google Scholar
Kristian, F., Fabian, B., André, K.: Boosting neural image compression for machines using latent space masking. IEEE Trans. Circuits Syst. Video Technol. (2022). https://doi.org/10.1109/TCSVT.2022.3195322
Article Google Scholar
Feng, R.Y., Jin, X., Guo, Z.Y., Feng, R.S., Gao, Y.X., He, T.Y.: Image coding for machines with omnipotent feature learning. In: 17th European Conference on Computer Vision. pp. 510-528 (2022)
Mei, Y.X., Li, F., Li, L., Li, Z.: Learn a compression for objection detection - vae with a bridge. In: IEEE International Conference on Visual Communications and Image Processing. (2021) https://doi.org/10.1109/VCIP53242.2021.9675387
Wang, S.R., Wang, Z., Wang, S.Q., Ye, Y.: End-to-end compression towards machine vision: Network architecture design and optimization. IEEE Open J. Circuits Syst. 2, 675–685 (2021). https://doi.org/10.1109/OJCAS.2021.3126061
Article Google Scholar
Luo, S.H., Yang, Y.Z., Yin, Y.L., Shen, C.C., Zhao, .Y, Song, M.L. : Deepsic: Deep semantic image compression. In: 25th International Conference on Neural Information Processing, pp. 96-106 (2018)
Codevilla, F., Simard, J.G., Goroshin, R., Pal, C.: Learned image compression for machine perception. (2021)ArXiv preprint at arXiv: org/abs/2111.02249
Liu, L.F., Chen, T., Liu, H.J., Pu, S.L., Wang, L., Shen, Q.: 2c-net: integrate image compression and classification via deep neural network. Multimed. Syst. (2022). https://doi.org/10.1007/s00530-022-01026-1
Article Google Scholar
Ma, S.W., Zhang, X., Wang, S.Q., Zhang, X.F., Jia, C.M., Wang, S.S.: Joint feature and texture coding: Toward smart video representation via front-end intelligence. IEEE Trans. Circuits Syst. Video Technol. 29(10), 3095–3105 (2018). https://doi.org/10.1109/TCSVT.2018.2873102
Article Google Scholar
Wang, S.R., Wang, S.Q., Yang, W.H., et al.: Towards analysis-friendly face representation with scalable feature and texture compression. IEEE Trans. Multimed. 24, 3169–3181 (2021). https://doi.org/10.1109/TMM.2021.3094300
Article Google Scholar
Yan, N., Gao, C., Liu, D., Li, H., Li, L., Wu, F.: Sssic: Semantics-to-signal scalable image coding with learned structural representations. IEEE Trans. Image Process. 30, 8939–8954 (2021). https://doi.org/10.1109/TIP.2021.3121131
Article Google Scholar
Choi, H., Bajic, I.V.: Scalable image coding for humans and machines. IEEE Trans. Image Process. 31, 2739–2754 (2022). https://doi.org/10.1109/TIP.2022.3160602
Article Google Scholar
Wang, Z.X., Li, F., Xu, J., Cosman, P.C.: Human-machine interaction-oriented image coding for resource-constrained visual monitoring in iot. IEEE Internet Things J. 9(17), 16181–16195 (2022)
Article Google Scholar
Wang, Z., Simoncelli, E..P., Bovik, A..C.: Multiscale structural similarity for image quality assessment. In: 37th Asilomar Conference on Signals. Syst. Comput. 2, 1398–1402 (2003)
Google Scholar
Duan, L.Y., Chandrasekhar, V., Chen, J., Lin, J., Wang, Z., et al.: Overview of the mpeg-cdvs standard. IEEE Trans. Image Process. 25(1), 179–194 (2015)
Article MathSciNet Google Scholar
Duan, L.Y., Lou, Y., Bai, Y., Huang, T., Gao, W., et al.: Compact descriptors for video analysis: the emerging mpeg standard. IEEE Multimed. 26(2), 44–54 (2018)
Article Google Scholar
Agustsson, E., Timofte, R.: Ntire 2017 challenge on single image super-resolution: Dataset and study. In: 30th IEEE Conference on Computer Vision and Pattern Recognition Workshops. pp. 1122-1131. (2017) https://doi.org/10.1109/CVPRW.2017.150
Lim, B., Son, S., Kim, H., Nah, S., Lee, K.M.: Enhanced deep residual networks for single image super-resolution. In: 30th IEEE Conference on Computer Vision and Pattern Recognition Workshops. pp. 1132-1140. (2017) https://doi.org/10.1109/CVPRW.2017.151
Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The pascal visual object classes (voc) challenge. Int. J. Comput. Vision 88(2), 303–338 (2010)
Article Google Scholar
(2010) The kodak photocd dataset. http://r0k.us/graphics/kodak/
(2021) Vvc reference software (vtm 12.3). https://vcgit.hhi.fraunhofer.de/jvet/VVCSoftware VTM/-/tags/VTM-12.3, Accessed 10 December 2021
G. B.: Calculation of average psnr differences between rd-curves. Accessed April 2021, VCEG-M33,(2001)https://www.itu.int/wftp3/av-arch/video-site/0104Aus/VCEG-M33.doc,

Download references

Acknowledgements

This work is supported in part by the National Natural Science Foundation of China [Grants 62071287, 62020106011, 61901252, 62171002], Science and Technology Commission of Shanghai Municipality under Grant 20DZ2290100 and Grant 22ZR1424300.

Author information

Authors and Affiliations

Key Laboratory of Specialty Fiber Optics and Optical Access Networks, School of Communication and Information Engineering, Shanghai University, Shanghai, 200444, China
Ying Wu, Ping An, Chao Yang & XinPeng Huang

Authors

Ying Wu
View author publications
You can also search for this author in PubMed Google Scholar
Ping An
View author publications
You can also search for this author in PubMed Google Scholar
Chao Yang
View author publications
You can also search for this author in PubMed Google Scholar
XinPeng Huang
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Ying Wu wrote the main manuscript text. All authors reviewed the manuscript.

Corresponding author

Correspondence to Ping An.

Ethics declarations

Conflict of interest

The authors declare no conflict of interest.

Additional information

Communicated by Q. Shen.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Wu, Y., An, P., Yang, C. et al. Scalable image coding with enhancement features for human and machine. Multimedia Systems 30, 77 (2024). https://doi.org/10.1007/s00530-024-01279-y

Download citation

Received: 29 June 2023
Accepted: 31 January 2024
Published: 10 March 2024
DOI: https://doi.org/10.1007/s00530-024-01279-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Scalable image coding with enhancement features for human and machine

Abstract

Access this article

Similar content being viewed by others

End-to-end optimized image compression with the frequency-oriented transform

Just Recognizable Distortion for Machine Vision Oriented Image and Video Coding

SDNet: A Versatile Squeeze-and-Decomposition Network for Real-Time Image Fusion

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Scalable image coding with enhancement features for human and machine

Abstract

Access this article

Similar content being viewed by others

End-to-end optimized image compression with the frequency-oriented transform

Just Recognizable Distortion for Machine Vision Oriented Image and Video Coding

SDNet: A Versatile Squeeze-and-Decomposition Network for Real-Time Image Fusion

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation