skip to main content
10.1145/3581783.3612118acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Visual Redundancy Removal of Composite Images via Multimodal Learning

Published: 27 October 2023 Publication History

Abstract

Composite images are generated by combining two or more different photographs, and their content is typically heterogeneous. However, existing unimodal visual redundancy prediction methods are difficult to accurately model the complex characteristics of this image type. In this paper, we investigate the visual redundancy modeling of composite images from an end-to-end multimodal perspective, including four cross-media modalities (i.e., text, brightness, color, and segmentation). Specifically, we design a two-stage cross-modal alignment module based on self-attention mechanism and contrastive learning, and develop a fusion module based on a cross-modal augmentation paradigm. Further, we establish the first cross-media visual redundancy dataset for composite images, which contains 413 groups of cross-modal data and generates 13629 realistic compression distortions using the latest versatile video coding (VVC) standard. Experimental results on nine benchmark datasets demonstrate the effectiveness of our method, outperforming seven representative methods.

Supplemental Material

MP4 File
This video briefly introduces our work "Visual Redundancy Removal of Composite Images via Multimodal Learning". In the video, we present our work in 7 parts, including introduction, related work, database, methods, experiments, results, and conclusion. We hope this video helps get a better knowledge of our work.

References

[1]
Sung-Ho Bae, Jaeil Kim, and Munchurl Kim. 2016. HEVC-based Perceptually Adaptive Video Coding Using a DCT-based Local Distortion Detection Probability Model. IEEE Transactions on Image Processing, Vol. 25, 7 (2016), 3343--3357.
[2]
Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. 2021. BEiT: BERT Pre-Training of Image Transformers. In International Conference on Learning Representations (ICLR). 1469--1479.
[3]
Cláudio Bartolomeu, Rui Nóbrega, and David Semedo. 2022. Understanding News Text and Images Connection with Context-enriched Multimodal Transformers. In ACM International Conference on Multimedia (MM). 5821--5832.
[4]
Zhenzhong Chen and Wei Wu. 2019. Asymmetric Foveated Just-Noticeable-Difference Model for Images with Visual Field Inhomogeneities. IEEE Transactions on Circuits and Systems for Video Technology, Vol. 30, 11 (2019), 4064--4074.
[5]
Chun-Hsien Chou and Yun-Chin Li. 1995. A Perceptually Tuned Subband Image Coder Based on the Measure of Just-Noticeable-Distortion Profile. IEEE Transactions on Circuits and Systems for Video Technology, Vol. 5, 6 (1995), 467--476.
[6]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
[7]
Ammarah Farooq, Muhammad Awais, Josef Kittler, and Syed Safwan Khalid. 2022. AXM-Net: Implicit Cross-Modal Feature Alignment for Person Re-identification. In AAAI Conference on Artificial Intelligence. 4477--4485.
[8]
Jingru Gan, Jinchang Luo, Haiwei Wang, Shuhui Wang, Wei He, and Qingming Huang. 2021. Multimodal Entity Linking: A New Dataset and A Baseline. In ACM International Conference on Multimedia (MM). 993--1001.
[9]
Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. 2020. Momentum Contrast for Unsupervised Visual Representation Learning. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 9729--9738.
[10]
Qin Huang, Haiqiang Wang, Sung Chang Lim, Hui Yong Kim, Se Yoon Jeong, and C-C Jay Kuo. 2017. Measure and Prediction of HEVC Perceptually Lossy/Lossless Boundary QP Values. In data compression conference (DCC). 42--51.
[11]
Qiuping Jiang, Zhentao Liu, Shiqi Wang, Feng Shao, and Weisi Lin. 2022. Toward Top-Down Just Noticeable Difference Estimation of Natural Images. IEEE Transactions on Image Processing, Vol. 31, 2 (2022), 3697--3712.
[12]
Lina Jin, Joe Yuchieh Lin, Sudeng Hu, Haiqiang Wang, Ping Wang, Ioannis Katsavounidis, Anne Aaron, and C-C Jay Kuo. 2016. Statistical Study on Perceived JPEG Image Quality via MCL-JCI Dataset Construction and Analysis. Electronic Imaging, Vol. 2016, 13 (2016), 1--9.
[13]
Sehwan Ki, Sung-Ho Bae, Munchurl Kim, and Hyunsuk Ko. 2018. Learning-Based Just-Noticeable-Quantization-Distortion Modeling for Perceptual Video Coding. IEEE Transactions on Image Processing, Vol. 27, 7 (2018), 3178--3193.
[14]
Eric Cooper Larson and Damon Michael Chandler. 2010. Most Apparent Distortion: Full-Reference Image Quality Assessment and the Role of Strategy. SPIE Journal of electronic imaging, Vol. 19, 1 (2010), 1116--1137.
[15]
Hanhe Lin, Vlad Hosu, and Dietmar Saupe. 2019. KADID-10k: A Large-Scale Artificially Distorted IQA Database. In International Conference on Quality of Multimedia Experience (QoMEX). 1--3.
[16]
Qing Lin, Bo Yan, Jichun Li, and Weimin Tan. 2020. MMFL: Multimodal Fusion Learning For Text-Guided Image Inpainting. In ACM International Conference on Multimedia (MM). 1094--1102.
[17]
Qing Lin, Bo Yan, and Weimin Tan. 2021. Multimodal Asymmetric Dual Learning for Unsupervised Eyeglasses Removal. In ACM International Conference on Multimedia (MM). 5092--5100.
[18]
Anmin Liu, Weisi Lin, Manoranjan Paul, Chenwei Deng, and Fan Zhang. 2010. Just Noticeable Difference for Images with Decomposition Model for Separating Edge and Textured Regions. IEEE Transactions on Circuits and Systems for Video Technology, Vol. 20, 11 (2010), 1648--1652.
[19]
Huanhua Liu, Yun Zhang, Huan Zhang, Chunling Fan, Sam Kwong, C-C Jay Kuo, and Xiaoping Fan. 2019. Deep Learning-Based Picture-Wise Just Noticeable Distortion Prediction Model for Image Compression. IEEE Transactions on Image Processing, Vol. 29, 2 (2019), 641--656.
[20]
Xueqin Liu, Xuanni Zhan, and Miaohui Wang. 2020. A Novel Edge-Pattern-Based Just Noticeable Difference Model for Screen Content Images. In 2020 IEEE 5th International Conference on Signal and Image Processing (ICSIP). 386--390.
[21]
Zhangkai Ni, Lin Ma, Huanqiang Zeng, Jing Chen, Canhui Cai, and Kai-Kuang Ma. 2017. ESIM: Edge Similarity for Screen Content Image Quality Assessment. IEEE Transactions on Image Processing, Vol. 26, 10 (2017), 4818--4831.
[22]
Heidi A. Peterson and Albert J. Ahumada Jr. 1992. Luminance-Model-Based DCT Quantization for Color Image Compression. In Human Vision, Visual Processing, and Digital Display III. 365--374.
[23]
Nikolay Ponomarenko, Lina Jin, Oleg Ieremeiev, Vladimir Lukin, Karen Egiazarian, Jaakko Astola, Benoit Vozel, Kacem Chehdi, Marco Carli, Federica Battisti, et al. 2015. Image Database TID2013: Peculiarities, Results and Perspectives. Elsevier Signal Processing: Image Communication, Vol. 30, 6 (2015), 57--77.
[24]
K Ram Prabhakar, V Sai Srikar, and R Venkatesh Babu. 2017. Deepfuse: A Deep Unsupervised Approach for Exposure Fusion with Extreme Exposure Image Pairs. In IEEE International Conference on Computer Vision (ICCV). 4714--4722.
[25]
Hamid R Sheikh, Zhou Wang, Lawrence Cormack, and Alan C Bovik. 2005. LIVE Image Quality Assessment Database Release 2. URL http://live.ece.utexas.edu/research/quality (2005).
[26]
Xuelin Shen, Zhangkai Ni, Wenhan Yang, Xinfeng Zhang, Shiqi Wang, and Sam Kwong. 2020. Just Noticeable Distortion Profile Inference: a Patch-Level Structural Visibility Learning Approach. IEEE Transactions on Image Processing, Vol. 30, 3 (2020), 26--38.
[27]
Jingxian Sun, Lichao Zhang, Yufei Zha, Abel Gonzalez-Garcia, Peng Zhang, Wei Huang, and Yanning Zhang. 2021. Unsupervised Cross-Modal Distillation for Thermal Infrared Tracking. In ACM International Conference on Multimedia (MM). 2262--2270.
[28]
Tao Tian, Hanli Wang, Sam Kwong, and C-C Jay Kuo. 2021. Perceptual Image Compression with Block-Level Just Noticeable Difference Prediction. ACM Transactions on Multimedia Computing, Communications, and Applications, Vol. 16, 4 (2021), 1--15.
[29]
Haiqiang Wang, Weihao Gan, Sudeng Hu, Joe Yuchieh Lin, Lina Jin, Longguang Song, Ping Wang, Ioannis Katsavounidis, Anne Aaron, and C-C Jay Kuo. 2016a. MCL-JCV: A JND-Based H.264/AVC Video Quality Assessment Dataset. In IEEE International Conference on Image Processing (ICIP). 1509--1513.
[30]
Huiqi Wang, Lin Wang, Xuelin Hu, Qin Tu, and Aidong Men. 2014. Perceptual Video Coding Based on Saliency and Just Noticeable Distortion for H.265/HEVC. In International Symposium on Wireless Personal Multimedia Communications (WPMC). 106--111.
[31]
Miaohui Wang, Xueqin Liu, Wuyuan Xie, and Long Xu. 2021. Perceptual redundancy estimation of screen images via multi-domain sensitivities. IEEE Signal Processing Letters, Vol. 28 (2021), 1440--1444.
[32]
Miaohui Wang, Zhuowei Xu, Xueqin Liu, Jian Xiong, and Wuyuan Xie. 2022. Perceptually Quasi-Lossless Compression of Screen Content Data Via Visibility Modeling and Deep Forecasting. IEEE Transactions on Industrial Informatics, Vol. 18, 10 (2022), 6865--6875.
[33]
Shiqi Wang, Lin Ma, Yuming Fang, Weisi Lin, Siwei Ma, and Wen Gao. 2016b. Just Noticeable Difference Estimation for Screen Content Images. IEEE Transactions on Image Processing, Vol. 25, 8 (2016), 3838--3851.
[34]
Xintao Wang, Ke Yu, Chao Dong, and Chen Change Loy. 2018. Recovering Realistic Texture in Image Super-Resolution By Deep Spatial Feature Transform. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 606--615.
[35]
Jinjian Wu, Leida Li, Weisheng Dong, Guangming Shi, Weisi Lin, and C-C Jay Kuo. 2017. Enhanced Just Noticeable Difference Model for Images with Pattern Complexity. IEEE Transactions on Image Processing, Vol. 26, 6 (2017), 2682--2693.
[36]
Jinjian Wu, Weisi Lin, Guangming Shi, Xiaotian Wang, and Fu Li. 2013. Pattern Masking Estimation in Image with Structural Uncertainty. IEEE Transactions on Image Processing, Vol. 22, 12 (2013), 4892--4904.
[37]
Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin. 2018. Unsupervised Feature Learning via Non-Parametric Instance Discrimination. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 3733--3742.
[38]
Fangxiong Xiao, Lixi Deng, Jingjing Chen, Houye Ji, Xiaorui Yang, Zhuoye Ding, and Bo Long. 2022. From Abstract to Details: A Generative Multimodal Fusion Framework for Recommendation. In ACM International Conference on Multimedia (MM). 258--267.
[39]
Wuyuan Xie, Shukang Wang, Sukun Tian, Lirong Huang, Ye Liu, and Miaohui Wang. [n.,d.]. Just Noticeable Visual Redundancy Forecasting: A Deep Multimodal-Driven Approach.
[40]
Dingkang Yang, Shuai Huang, Haopeng Kuang, Yangtao Du, and Lihua Zhang. 2022. Disentangled Representation Learning for Multimodal Emotion Recognition. In ACM International Conference on Multimedia (MM). 1642--1651.
[41]
Huan Yang, Yuming Fang, and Weisi Lin. 2015. Perceptual Quality Assessment of Screen Content Images. IEEE Transactions on Image Processing, Vol. 24, 11 (2015), 4408--4421.
[42]
X. K. Yang, W. S. Lin, Z. K. Lu, E. P. Ong, and S. S. Yao. 2005. Just Noticeable Distortion Model and Its Applications in Video Coding. Elsevier Signal Processing: Image Communication, Vol. 20, 7 (2005), 662--680.
[43]
Haibing Yin, Hongkui Wang, Li Yu, Junhui Liang, and Guangtao Zhai. 2023. Feedforward and Feedback Modulations Based Foveated JND Estimation for Images. ACM Transactions on Multimedia Computing, Communications, and Applications, Vol. 25, 1 (2023), 2635--2744.
[44]
Bo Zong, Qi Song, Martin Renqiang Min, Wei Cheng, Cristian Lumezanu, Daeki Cho, and Haifeng Chen. 2018. Deep Autoencoding Gaussian Mixture Model for Unsupervised Anomaly Detection. In International Conference on Learning Representations (ICLR). 1125--1134.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '23: Proceedings of the 31st ACM International Conference on Multimedia
October 2023
9913 pages
ISBN:9798400701085
DOI:10.1145/3581783
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. human visual system
  2. multimodal information
  3. visual redundancy

Qualifiers

  • Research-article

Funding Sources

Conference

MM '23
Sponsor:
MM '23: The 31st ACM International Conference on Multimedia
October 29 - November 3, 2023
Ottawa ON, Canada

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 166
    Total Downloads
  • Downloads (Last 12 months)83
  • Downloads (Last 6 weeks)5
Reflects downloads up to 15 Feb 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media