skip to main content
10.1145/3503161.3548177acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

SIR-Former: Stereo Image Restoration Using Transformer

Published: 10 October 2022 Publication History

Abstract

Stereo image pairs record the scene from two different views and introduce cross-view information for image restoration. However, there are two challenges in utilizing the cross-view information for stereo image restoration: cross-view alignment and information fusion. Most existing methods adopt convolutional neural networks to align the views and fuse the information locally, which has difficulty in capturing the global correspondence across stereo images for view alignment and makes it hard to integrate the long-term information across views. In this paper, we propose to address the stereo image restoration with transformer by leveraging its powerful capability of modeling long-range context dependencies. Specifically, we construct a stereo image restoration transformer (SIR-Former) to effectively exploit the cross-view correlations. First, to explore the global correspondence for view alignment effectively, we devise a stereo alignment transformer (SAT) module across stereo images, enabling robust alignment under the epipolar constraint. Then, we design a stereo fusion transformer (SFT) module for aggregating the cross-view information in a small horizontal neighborhood, aiming to enhance important features for succeeding restoration. Extensive experiments show that SIR-Former can remarkably boost quantitative and qualitative quality on various image restoration tasks (e.g., super-resolution, deblurring, deraining, and low-light enhancement), which demonstrate the effectiveness of the proposed framework.

Supplementary Material

MP4 File (MM22-fp1754.mp4)
Presentation video for the paper: SIR-Former: Stereo Image Restoration Using Transformer.

References

[1]
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer normalization. arXiv preprint arXiv:1607.06450 (2016).
[2]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in Neural Information Processing Systems, Vol. 33 (2020), 1877--1901.
[3]
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. In European Conference on Computer Vision. Springer, 213--229.
[4]
Canqiang Chen, Chunmei Qing, Xiangmin Xu, and Patrick Dickinson. 2021a. Cross Parallax Attention Network for Stereo Image Super-Resolution. IEEE Transactions on Multimedia (2021).
[5]
Hanting Chen, Yunhe Wang, Tianyu Guo, Chang Xu, Yiping Deng, Zhenhua Liu, Siwei Ma, Chunjing Xu, Chao Xu, and Wen Gao. 2021b. Pre-trained image processing transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12299--12310.
[6]
Qinyan Dai, Juncheng Li, Qiaosi Yi, Faming Fang, and Guixu Zhang. 2021b. Feedback Network for Mutually Boosted Stereo Image Super-Resolution and Disparity Estimation. In Proceedings of the 29th ACM International Conference on Multimedia. 1985--1993.
[7]
Zhigang Dai, Bolun Cai, Yugeng Lin, and Junying Chen. 2021a. Up-detr: Unsupervised pre-training for object detection with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . 1601--1610.
[8]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
[9]
Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang. 2014. Learning a deep convolutional network for image super-resolution. In European Conference on Computer Vision. Springer, 184--199.
[10]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
[11]
Andreas Geiger, Philip Lenz, and Raquel Urtasun. 2012. Are we ready for autonomous driving? the kitti vision benchmark suite. In 2012 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 3354--3361.
[12]
Shuting He, Hao Luo, Pichao Wang, Fan Wang, Hao Li, and Wei Jiang. 2021b. Transreid: Transformer-based object re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 15013--15022.
[13]
Tianyu He, Xin Jin, Xu Shen, Jianqiang Huang, Zhibo Chen, and Xian-Sheng Hua. 2021a. Dense Interaction Learning for Video-based Person Re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1490--1501.
[14]
Dan Hendrycks and Kevin Gimpel. 2016. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415 (2016).
[15]
Yiwen Hua, Puneet Kohli, Pritish Uplavikar, Anand Ravi, Saravana Gunaseelan, Jason Orozco, and Edward Li. 2020. Holopix50k: A large-scale in-the-wild stereo image dataset. arXiv preprint arXiv:2003.11172 (2020).
[16]
Jie Huang, Xueyang Fu, Zeyu Xiao, Feng Zhao, and Zhiwei Xiong. 2022. Low-Light Stereo Image Enhancement. IEEE Transactions on Multimedia (2022).
[17]
Daniel S Jeon, Seung-Hwan Baek, Inchang Choi, and Min H Kim. 2018. Enhancing the spatial resolution of stereo images using a parallax prior. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 1721--1730.
[18]
Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, et al. 2017. Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 4681--4690.
[19]
Zhaoshuo Li, Xingtong Liu, Nathan Drenkow, Andy Ding, Francis X Creighton, Russell H Taylor, and Mathias Unberath. 2021. Revisiting stereo depth estimation from a sequence-to-sequence perspective with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 6197--6206.
[20]
Jingyun Liang, Jiezhang Cao, Guolei Sun, Kai Zhang, Luc Van Gool, and Radu Timofte. 2021. Swinir: Image restoration using swin transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1833--1844.
[21]
Bee Lim, Sanghyun Son, Heewon Kim, Seungjun Nah, and Kyoung Mu Lee. 2017. Enhanced deep residual networks for single image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops . 136--144.
[22]
Moritz Menze and Andreas Geiger. 2015. Object scene flow for autonomous vehicles. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3061--3070.
[23]
Jeong-Yun Na and Kuk-Jin Yoon. 2018. Stereo vision aided image dehazing using deep neural network. In Proceedings of the 1st Workshop and Challenge on Comprehensive Video Understanding in the Wild . 15--19.
[24]
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners. OpenAI blog, Vol. 1, 8 (2019), 9.
[25]
Daniel Scharstein, Heiko Hirschmüller, York Kitajima, Greg Krathwohl, Nera Nevs ić, Xi Wang, and Porter Westling. 2014. High-resolution stereo datasets with subpixel-accurate ground truth. In German conference on Pattern Recognition. Springer, 31--42.
[26]
Wenzhe Shi, Jose Caballero, Ferenc Huszár, Johannes Totz, Andrew P Aitken, Rob Bishop, Daniel Rueckert, and Zehan Wang. 2016. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE Conference on Computer Rision and Pattern Recognition. 1874--1883.
[27]
Wonil Song, Sungil Choi, Somi Jeong, and Kwanghoon Sohn. 2020. Stereoscopic image super-resolution with stereo consistent feature. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 12031--12038.
[28]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in Neural Information Processing Systems, Vol. 30 (2017).
[29]
Longguang Wang, Yulan Guo, Yingqian Wang, Zhengfa Liang, Zaiping Lin, Jungang Yang, and Wei An. 2020. Parallax attention for unsupervised stereo correspondence learning. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020).
[30]
Longguang Wang, Yingqian Wang, Zhengfa Liang, Zaiping Lin, Jungang Yang, Wei An, and Yulan Guo. 2019a. Learning parallax attention for stereo image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . 12250--12259.
[31]
Yingqian Wang, Longguang Wang, Jungang Yang, Wei An, and Yulan Guo. 2019b. Flickr1024: A large-scale dataset for stereo image super-resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops . 0--0.
[32]
Yingqian Wang, Xinyi Ying, Longguang Wang, Jungang Yang, Wei An, and Yulan Guo. 2021b. Symmetric parallax attention for stereo image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . 766--775.
[33]
Zhendong Wang, Xiaodong Cun, Jianmin Bao, and Jianzhuang Liu. 2021a. Uformer: A general u-shaped transformer for image restoration. arXiv preprint arXiv:2106.03106 (2021).
[34]
Qingyu Xu, Longguang Wang, Yingqian Wang, Weidong Sheng, and Xinpu Deng. 2021. Deep bilateral learning for stereo image super-resolution. IEEE Signal Processing Letters, Vol. 28 (2021), 613--617.
[35]
Bo Yan, Chenxi Ma, Bahetiyaer Bare, Weimin Tan, and Steven CH Hoi. 2020. Disparity-aware domain adaptation in stereo image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . 13179--13187.
[36]
Mingde Yao, Zhiwei Xiong, Lizhi Wang, Dong Liu, and Xuejin Chen. 2019. Spectral-depth imaging with deep learning based reconstruction. Optics Express, Vol. 27, 26 (2019), 38312--38325.
[37]
Xinyi Ying, Yingqian Wang, Longguang Wang, Weidong Sheng, Wei An, and Yulan Guo. 2020. A stereo attention module for stereo image super-resolution. IEEE Signal Processing Letters, Vol. 27 (2020), 496--500.
[38]
Syed Waqas Zamir, Aditya Arora, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan, and Ming-Hsuan Yang. 2021a. Restormer: Efficient Transformer for High-Resolution Image Restoration. arXiv preprint arXiv:2111.09881 (2021).
[39]
Syed Waqas Zamir, Aditya Arora, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan, Ming-Hsuan Yang, and Ling Shao. 2021b. Multi-stage progressive image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14821--14831.
[40]
Dong Zhang, Hanwang Zhang, Jinhui Tang, Meng Wang, Xiansheng Hua, and Qianru Sun. 2020b. Feature pyramid transformer. In European Conference on Computer Vision. Springer, 323--339.
[41]
Kai Zhang, Luc Van Gool, and Radu Timofte. 2020a. Deep unfolding network for image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3217--3226.
[42]
Kai Zhang, Wangmeng Zuo, Yunjin Chen, Deyu Meng, and Lei Zhang. 2017. Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising. IEEE Transactions on Image Processing, Vol. 26, 7 (2017), 3142--3155.
[43]
Yulun Zhang, Kunpeng Li, Kai Li, Lichen Wang, Bineng Zhong, and Yun Fu. 2018a. Image super-resolution using very deep residual channel attention networks. In Proceedings of the European Conference on Computer Vision (ECCV). 286--301.
[44]
Yulun Zhang, Yapeng Tian, Yu Kong, Bineng Zhong, and Yun Fu. 2018b. Residual dense network for image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2472--2481.
[45]
Zhaoyang Zhang, Yitong Jiang, Jun Jiang, Xiaogang Wang, Ping Luo, and Jinwei Gu. 2021. STAR: A Structure-Aware Lightweight Transformer for Real-Time Image Enhancement. In Proceedings of the IEEE/CVF International Conference on Computer Vision . 4106--4115.
[46]
Shangchen Zhou, Jiawei Zhang, Wangmeng Zuo, Haozhe Xie, Jinshan Pan, and Jimmy S Ren. 2019. Davanet: Stereo deblurring with view aggregation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10996--11005.
[47]
Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. 2020. Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159 (2020).iogr

Cited By

View all
  • (2025)EdgeStereoSR: A multi-task network with transformers for stereo image super-resolution considering edge priorSignal Processing10.1016/j.sigpro.2024.109719227(109719)Online publication date: Feb-2025
  • (2025)Learning accurate and enriched features for stereo image super-resolutionPattern Recognition10.1016/j.patcog.2024.111170159(111170)Online publication date: Mar-2025
  • (2025)Conti-Fuse: A novel continuous decomposition-based fusion framework for infrared and visible imagesInformation Fusion10.1016/j.inffus.2024.102839117(102839)Online publication date: May-2025
  • Show More Cited By

Index Terms

  1. SIR-Former: Stereo Image Restoration Using Transformer

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '22: Proceedings of the 30th ACM International Conference on Multimedia
    October 2022
    7537 pages
    ISBN:9781450392037
    DOI:10.1145/3503161
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 10 October 2022

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. image super-resolution
    2. stereo images restoration
    3. transformer

    Qualifiers

    • Research-article

    Funding Sources

    • Anhui Provincial Natural Science Foundation

    Conference

    MM '22
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)153
    • Downloads (Last 6 weeks)10
    Reflects downloads up to 03 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)EdgeStereoSR: A multi-task network with transformers for stereo image super-resolution considering edge priorSignal Processing10.1016/j.sigpro.2024.109719227(109719)Online publication date: Feb-2025
    • (2025)Learning accurate and enriched features for stereo image super-resolutionPattern Recognition10.1016/j.patcog.2024.111170159(111170)Online publication date: Mar-2025
    • (2025)Conti-Fuse: A novel continuous decomposition-based fusion framework for infrared and visible imagesInformation Fusion10.1016/j.inffus.2024.102839117(102839)Online publication date: May-2025
    • (2025)SGDFormer: One-stage transformer-based architecture for cross-spectral stereo image guided denoisingInformation Fusion10.1016/j.inffus.2024.102603113(102603)Online publication date: Jan-2025
    • (2024)PPTFormerProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence10.24963/ijcai.2024/99(893-901)Online publication date: 3-Aug-2024
    • (2024)Learning Optimal Combination Patterns for Lightweight Stereo Image Super-ResolutionProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680816(5566-5574)Online publication date: 28-Oct-2024
    • (2024)Progressive Stereo Image Dehazing Network via Cross-View Region InteractionIEEE Transactions on Multimedia10.1109/TMM.2024.336891826(7490-7502)Online publication date: 22-Feb-2024
    • (2024)NLSIT: A Non-Local Stereo Interaction Transformer for Stereo Image Super-ResolutionICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10447088(3965-3969)Online publication date: 14-Apr-2024
    • (2024)EpiRiskNet: incorporating graph structure and static data as prior knowledge for improved time-series forecastingApplied Intelligence10.1007/s10489-024-05514-x54:17-18(7864-7877)Online publication date: 14-Jun-2024
    • (2023)Mutual-Guided Dynamic Network for Image FusionProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612261(1779-1788)Online publication date: 26-Oct-2023
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media