research-article

SIR-Former: Stereo Image Restoration Using Transformer

Authors:

Feng ZhaoAuthors Info & Claims

MM '22: Proceedings of the 30th ACM International Conference on Multimedia

Pages 6377 - 6385

https://doi.org/10.1145/3503161.3548177

Published: 10 October 2022 Publication History

Abstract

Stereo image pairs record the scene from two different views and introduce cross-view information for image restoration. However, there are two challenges in utilizing the cross-view information for stereo image restoration: cross-view alignment and information fusion. Most existing methods adopt convolutional neural networks to align the views and fuse the information locally, which has difficulty in capturing the global correspondence across stereo images for view alignment and makes it hard to integrate the long-term information across views. In this paper, we propose to address the stereo image restoration with transformer by leveraging its powerful capability of modeling long-range context dependencies. Specifically, we construct a stereo image restoration transformer (SIR-Former) to effectively exploit the cross-view correlations. First, to explore the global correspondence for view alignment effectively, we devise a stereo alignment transformer (SAT) module across stereo images, enabling robust alignment under the epipolar constraint. Then, we design a stereo fusion transformer (SFT) module for aggregating the cross-view information in a small horizontal neighborhood, aiming to enhance important features for succeeding restoration. Extensive experiments show that SIR-Former can remarkably boost quantitative and qualitative quality on various image restoration tasks (e.g., super-resolution, deblurring, deraining, and low-light enhancement), which demonstrate the effectiveness of the proposed framework.

Supplementary Material

MP4 File (MM22-fp1754.mp4)

Presentation video for the paper: SIR-Former: Stereo Image Restoration Using Transformer.

Download
5.32 MB

References

[1]

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer normalization. arXiv preprint arXiv:1607.06450 (2016).

[2]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in Neural Information Processing Systems, Vol. 33 (2020), 1877--1901.

[3]

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. In European Conference on Computer Vision. Springer, 213--229.

Digital Library

[4]

Canqiang Chen, Chunmei Qing, Xiangmin Xu, and Patrick Dickinson. 2021a. Cross Parallax Attention Network for Stereo Image Super-Resolution. IEEE Transactions on Multimedia (2021).

[5]

Hanting Chen, Yunhe Wang, Tianyu Guo, Chang Xu, Yiping Deng, Zhenhua Liu, Siwei Ma, Chunjing Xu, Chao Xu, and Wen Gao. 2021b. Pre-trained image processing transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12299--12310.

[6]

Qinyan Dai, Juncheng Li, Qiaosi Yi, Faming Fang, and Guixu Zhang. 2021b. Feedback Network for Mutually Boosted Stereo Image Super-Resolution and Disparity Estimation. In Proceedings of the 29th ACM International Conference on Multimedia. 1985--1993.

Digital Library

[7]

Zhigang Dai, Bolun Cai, Yugeng Lin, and Junying Chen. 2021a. Up-detr: Unsupervised pre-training for object detection with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . 1601--1610.

[8]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).

[9]

Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang. 2014. Learning a deep convolutional network for image super-resolution. In European Conference on Computer Vision. Springer, 184--199.

[10]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).

[11]

Andreas Geiger, Philip Lenz, and Raquel Urtasun. 2012. Are we ready for autonomous driving? the kitti vision benchmark suite. In 2012 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 3354--3361.

Digital Library

[12]

Shuting He, Hao Luo, Pichao Wang, Fan Wang, Hao Li, and Wei Jiang. 2021b. Transreid: Transformer-based object re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 15013--15022.

[13]

Tianyu He, Xin Jin, Xu Shen, Jianqiang Huang, Zhibo Chen, and Xian-Sheng Hua. 2021a. Dense Interaction Learning for Video-based Person Re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1490--1501.

[14]

Dan Hendrycks and Kevin Gimpel. 2016. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415 (2016).

[15]

Yiwen Hua, Puneet Kohli, Pritish Uplavikar, Anand Ravi, Saravana Gunaseelan, Jason Orozco, and Edward Li. 2020. Holopix50k: A large-scale in-the-wild stereo image dataset. arXiv preprint arXiv:2003.11172 (2020).

[16]

Jie Huang, Xueyang Fu, Zeyu Xiao, Feng Zhao, and Zhiwei Xiong. 2022. Low-Light Stereo Image Enhancement. IEEE Transactions on Multimedia (2022).

Digital Library

[17]

Daniel S Jeon, Seung-Hwan Baek, Inchang Choi, and Min H Kim. 2018. Enhancing the spatial resolution of stereo images using a parallax prior. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 1721--1730.

[18]

Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, et al. 2017. Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 4681--4690.

[19]

Zhaoshuo Li, Xingtong Liu, Nathan Drenkow, Andy Ding, Francis X Creighton, Russell H Taylor, and Mathias Unberath. 2021. Revisiting stereo depth estimation from a sequence-to-sequence perspective with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 6197--6206.

[20]

Jingyun Liang, Jiezhang Cao, Guolei Sun, Kai Zhang, Luc Van Gool, and Radu Timofte. 2021. Swinir: Image restoration using swin transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1833--1844.

[21]

Bee Lim, Sanghyun Son, Heewon Kim, Seungjun Nah, and Kyoung Mu Lee. 2017. Enhanced deep residual networks for single image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops . 136--144.

[22]

Moritz Menze and Andreas Geiger. 2015. Object scene flow for autonomous vehicles. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3061--3070.

[23]

Jeong-Yun Na and Kuk-Jin Yoon. 2018. Stereo vision aided image dehazing using deep neural network. In Proceedings of the 1st Workshop and Challenge on Comprehensive Video Understanding in the Wild . 15--19.

Digital Library

[24]

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners. OpenAI blog, Vol. 1, 8 (2019), 9.

[25]

Daniel Scharstein, Heiko Hirschmüller, York Kitajima, Greg Krathwohl, Nera Nevs ić, Xi Wang, and Porter Westling. 2014. High-resolution stereo datasets with subpixel-accurate ground truth. In German conference on Pattern Recognition. Springer, 31--42.

[26]

Wenzhe Shi, Jose Caballero, Ferenc Huszár, Johannes Totz, Andrew P Aitken, Rob Bishop, Daniel Rueckert, and Zehan Wang. 2016. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE Conference on Computer Rision and Pattern Recognition. 1874--1883.

[27]

Wonil Song, Sungil Choi, Somi Jeong, and Kwanghoon Sohn. 2020. Stereoscopic image super-resolution with stereo consistent feature. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 12031--12038.

[28]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in Neural Information Processing Systems, Vol. 30 (2017).

[29]

Longguang Wang, Yulan Guo, Yingqian Wang, Zhengfa Liang, Zaiping Lin, Jungang Yang, and Wei An. 2020. Parallax attention for unsupervised stereo correspondence learning. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020).

[30]

Longguang Wang, Yingqian Wang, Zhengfa Liang, Zaiping Lin, Jungang Yang, Wei An, and Yulan Guo. 2019a. Learning parallax attention for stereo image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . 12250--12259.

[31]

Yingqian Wang, Longguang Wang, Jungang Yang, Wei An, and Yulan Guo. 2019b. Flickr1024: A large-scale dataset for stereo image super-resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops . 0--0.

[32]

Yingqian Wang, Xinyi Ying, Longguang Wang, Jungang Yang, Wei An, and Yulan Guo. 2021b. Symmetric parallax attention for stereo image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . 766--775.

[33]

Zhendong Wang, Xiaodong Cun, Jianmin Bao, and Jianzhuang Liu. 2021a. Uformer: A general u-shaped transformer for image restoration. arXiv preprint arXiv:2106.03106 (2021).

[34]

Qingyu Xu, Longguang Wang, Yingqian Wang, Weidong Sheng, and Xinpu Deng. 2021. Deep bilateral learning for stereo image super-resolution. IEEE Signal Processing Letters, Vol. 28 (2021), 613--617.

[35]

Bo Yan, Chenxi Ma, Bahetiyaer Bare, Weimin Tan, and Steven CH Hoi. 2020. Disparity-aware domain adaptation in stereo image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . 13179--13187.

[36]

Mingde Yao, Zhiwei Xiong, Lizhi Wang, Dong Liu, and Xuejin Chen. 2019. Spectral-depth imaging with deep learning based reconstruction. Optics Express, Vol. 27, 26 (2019), 38312--38325.

[37]

Xinyi Ying, Yingqian Wang, Longguang Wang, Weidong Sheng, Wei An, and Yulan Guo. 2020. A stereo attention module for stereo image super-resolution. IEEE Signal Processing Letters, Vol. 27 (2020), 496--500.

[38]

Syed Waqas Zamir, Aditya Arora, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan, and Ming-Hsuan Yang. 2021a. Restormer: Efficient Transformer for High-Resolution Image Restoration. arXiv preprint arXiv:2111.09881 (2021).

[39]

Syed Waqas Zamir, Aditya Arora, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan, Ming-Hsuan Yang, and Ling Shao. 2021b. Multi-stage progressive image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14821--14831.

[40]

Dong Zhang, Hanwang Zhang, Jinhui Tang, Meng Wang, Xiansheng Hua, and Qianru Sun. 2020b. Feature pyramid transformer. In European Conference on Computer Vision. Springer, 323--339.

Digital Library

[41]

Kai Zhang, Luc Van Gool, and Radu Timofte. 2020a. Deep unfolding network for image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3217--3226.

[42]

Kai Zhang, Wangmeng Zuo, Yunjin Chen, Deyu Meng, and Lei Zhang. 2017. Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising. IEEE Transactions on Image Processing, Vol. 26, 7 (2017), 3142--3155.

Digital Library

[43]

Yulun Zhang, Kunpeng Li, Kai Li, Lichen Wang, Bineng Zhong, and Yun Fu. 2018a. Image super-resolution using very deep residual channel attention networks. In Proceedings of the European Conference on Computer Vision (ECCV). 286--301.

Digital Library

[44]

Yulun Zhang, Yapeng Tian, Yu Kong, Bineng Zhong, and Yun Fu. 2018b. Residual dense network for image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2472--2481.

[45]

Zhaoyang Zhang, Yitong Jiang, Jun Jiang, Xiaogang Wang, Ping Luo, and Jinwei Gu. 2021. STAR: A Structure-Aware Lightweight Transformer for Real-Time Image Enhancement. In Proceedings of the IEEE/CVF International Conference on Computer Vision . 4106--4115.

[46]

Shangchen Zhou, Jiawei Zhang, Wangmeng Zuo, Haozhe Xie, Jinshan Pan, and Jimmy S Ren. 2019. Davanet: Stereo deblurring with view aggregation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10996--11005.

[47]

Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. 2020. Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159 (2020).iogr

Cited By

Liu ALi SChang YHou Y(2025)EdgeStereoSR: A multi-task network with transformers for stereo image super-resolution considering edge priorSignal Processing10.1016/j.sigpro.2024.109719227(109719)Online publication date: Feb-2025
https://doi.org/10.1016/j.sigpro.2024.109719
Gao HDang D(2025)Learning accurate and enriched features for stereo image super-resolutionPattern Recognition10.1016/j.patcog.2024.111170159(111170)Online publication date: Mar-2025
https://doi.org/10.1016/j.patcog.2024.111170
Li HMa HCheng CShen ZSong XWu X(2025)Conti-Fuse: A novel continuous decomposition-based fusion framework for infrared and visible imagesInformation Fusion10.1016/j.inffus.2024.102839117(102839)Online publication date: May-2025
https://doi.org/10.1016/j.inffus.2024.102839
Show More Cited By

Index Terms

SIR-Former: Stereo Image Restoration Using Transformer
1. Computing methodologies

Recommendations

Transformer-Based Image Dehazing with Accurate Color Restoration
ICCPR '24: Proceedings of the 2024 13th International Conference on Computing and Pattern Recognition
In recent years, deep learning-based image dehazing algorithms have achieved significant improvements in effectiveness. However, existing algorithms rarely consider the issue of color restoration, leading to high chromatic aberration between restored and ...
Edge-Guided Image Inpainting with Transformer
Advances in Visual Computing
Abstract
Image inpainting aims to complete missing regions by extracting the features of the image through the information of the known region. Traditional image inpainting approaches like patch-based and diffusion-based methods are robust for simple ...
Image restoration
Abstract
True images are usually degraded during image acquisition. Image restoration is for restoring true images from their observed but degraded versions; it is often used for preprocessing observed images so that subsequent image processing and ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '22: Proceedings of the 30th ACM International Conference on Multimedia

October 2022

7537 pages

ISBN:9781450392037

DOI:10.1145/3503161

General Chairs:
João Magalhães
NOVA University of Lisbon, Portugal
,
Alberto del Bimbo
University of Florence, Italy
,
Shin'ichi Satoh
National Institute of Informatics, Japan
,
Nicu Sebe
University of Trento, Italy
,
Program Chairs:
Xavier Alameda-Pineda
Inria, Grenoble, France
,
Qin Jin
Renmin University of China, China
,
Vincent Oria
New Jersey Institute of Technology, USA
,
Laura Toni
University College London, UK

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 October 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Anhui Provincial Natural Science Foundation

Conference

MM '22

Sponsor:

SIGMM

MM '22: The 30th ACM International Conference on Multimedia

October 10 - 14, 2022

Lisboa, Portugal

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

17
Total Citations
View Citations
488
Total Downloads

Downloads (Last 12 months)153
Downloads (Last 6 weeks)10

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Liu ALi SChang YHou Y(2025)EdgeStereoSR: A multi-task network with transformers for stereo image super-resolution considering edge priorSignal Processing10.1016/j.sigpro.2024.109719227(109719)Online publication date: Feb-2025
https://doi.org/10.1016/j.sigpro.2024.109719
Gao HDang D(2025)Learning accurate and enriched features for stereo image super-resolutionPattern Recognition10.1016/j.patcog.2024.111170159(111170)Online publication date: Mar-2025
https://doi.org/10.1016/j.patcog.2024.111170
Li HMa HCheng CShen ZSong XWu X(2025)Conti-Fuse: A novel continuous decomposition-based fusion framework for infrared and visible imagesInformation Fusion10.1016/j.inffus.2024.102839117(102839)Online publication date: May-2025
https://doi.org/10.1016/j.inffus.2024.102839
Zhang RYu ZSheng ZYing JCao SChen SYang BLi JShen H(2025)SGDFormer: One-stage transformer-based architecture for cross-spectral stereo image guided denoisingInformation Fusion10.1016/j.inffus.2024.102603113(102603)Online publication date: Jan-2025
https://doi.org/10.1016/j.inffus.2024.102603
Ji DJin WLu HZhao FLarson K(2024)PPTFormerProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence10.24963/ijcai.2024/99(893-901)Online publication date: 3-Aug-2024
https://dl.acm.org/doi/10.24963/ijcai.2024/99
Gao HYang JZhang YYang JMa BDang DCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Learning Optimal Combination Patterns for Lightweight Stereo Image Super-ResolutionProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680816(5566-5574)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3680816
Wang JWei YZhang ZFan JZhao YYang YWang M(2024)Progressive Stereo Image Dehazing Network via Cross-View Region InteractionIEEE Transactions on Multimedia10.1109/TMM.2024.336891826(7490-7502)Online publication date: 22-Feb-2024
https://dl.acm.org/doi/10.1109/TMM.2024.3368918
Cao HHuang WYang W(2024)NLSIT: A Non-Local Stereo Interaction Transformer for Stereo Image Super-ResolutionICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10447088(3965-3969)Online publication date: 14-Apr-2024
https://doi.org/10.1109/ICASSP48485.2024.10447088
Shi YChen QLi QLuan HWang QHu YGao FSai X(2024)EpiRiskNet: incorporating graph structure and static data as prior knowledge for improved time-series forecastingApplied Intelligence10.1007/s10489-024-05514-x54:17-18(7864-7877)Online publication date: 14-Jun-2024
https://doi.org/10.1007/s10489-024-05514-x
Guan YXu RYao MWang LXiong ZEl Saddik AMei TCucchiara RBertini MTobon Vallejo DAtrey PHossain M(2023)Mutual-Guided Dynamic Network for Image FusionProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612261(1779-1788)Online publication date: 26-Oct-2023
https://dl.acm.org/doi/10.1145/3581783.3612261
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten