TempFormer: Temporally Consistent Transformer for Video Denoising

Song, Mingyang; Zhang, Yang; Aydın, Tunç O.

doi:10.1007/978-3-031-19800-7_28

Mingyang Song^12,13,
Yang Zhang¹³ &
Tunç O. Aydın¹³

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13679))

Included in the following conference series:

European Conference on Computer Vision

2871 Accesses
6 Citations

Abstract

Video denoising is a low-level vision task that aims to restore high quality videos from noisy content. Vision Transformer (ViT) is a new machine learning architecture that has shown promising performance on both high-level and low-level image tasks. In this paper, we propose a modified ViT architecture for video processing tasks, introducing a new training strategy and loss function to enhance temporal consistency without compromising spatial quality. Specifically, we propose an efficient hybrid Transformer-based model, TempFormer, which composes Spatio-Temporal Transformer Blocks (STTB) and 3D convolutional layers. The proposed STTB learns the temporal information between neighboring frames implicitly by utilizing the proposed Joint Spatio-Temporal Mixer module for attention calculation and feature aggregation in each ViT block. Moreover, existing methods suffer from temporal inconsistency artifacts that are problematic in practical cases and distracting to the viewers. We propose a sliding block strategy with recurrent architecture, and use a new loss term, Overlap Loss, to alleviate the flickering between adjacent frames. Our method produces state-of-the-art spatio-temporal denoising quality with significantly improved temporal coherency, and requires less computational resources to achieve comparable denoising quality with competing methods (Fig. 1).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Ali, A., et al.: XCIT: cross-covariance image transformers. Adv. Neural Inf. Process. Syst. 34, 1–10 (2021)
Google Scholar
Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for video understanding. arXiv preprint arXiv:2102.05095 (2021)
Buades, A., Coll, B., Morel, J.M.: A non-local algorithm for image denoising. In: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2005), vol. 2, pp. 60–65. IEEE (2005)
Google Scholar
Cai, Z., Zhang, Y., Manzi, M., Oztireli, C., Gross, M., Aydin, T.O.: Robust image denoising using kernel predicting networks (2021)
Google Scholar
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Chapter Google Scholar
Chan, K.C., Zhou, S., Xu, X., Loy, C.C.: BasicVSR++: improving video super-resolution with enhanced propagation and alignment. arXiv preprint arXiv:2104.13371 (2021)
Chen, H., et al.: Pre-trained image processing transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12299–12310 (2021)
Google Scholar
Dai, X., Chen, Y., Xiao, B., Chen, D., Liu, M., Yuan, L., Zhang, L.: Dynamic head: unifying object detection heads with attentions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7373–7382 (2021)
Google Scholar
Dai, Z., Liu, H., Le, Q., Tan, M.: CoatNet: marrying convolution and attention for all data sizes. Adv. Neural Inf. Process. Syst. 34, 1–12 (2021)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16 \(\times \) 16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Lai, W.-S., Huang, J.-B., Wang, O., Shechtman, E., Yumer, E., Yang, M.-H.: Learning blind video temporal consistency. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11219, pp. 179–195. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01267-0_11
Chapter Google Scholar
Lei, C., Xing, Y., Chen, Q.: Blind video temporal consistency via deep video prior. Adv. Neural Inf. Process. Syst. 33, 1083–1093 (2020)
Google Scholar
Li, K., et al.: Uniformer: unified transformer for efficient spatiotemporal representation learning. arXiv preprint arXiv:2201.04676 (2022)
Liang, J., et al.: VRT: a video restoration transformer. arXiv preprint arXiv:2201.12288 (2022)
Liang, J., Cao, J., Sun, G., Zhang, K., Van Gool, L., Timofte, R.: SwinIR: image restoration using Swin transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1833–1844 (2021)
Google Scholar
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030 (2021)
Maggioni, M., Boracchi, G., Foi, A., Egiazarian, K.: Video denoising, deblocking, and enhancement through separable 4-d nonlocal spatiotemporal transforms. IEEE Trans. Image Process. 21(9), 3952–3966 (2012)
Article MathSciNet MATH Google Scholar
Maggioni, M., Huang, Y., Li, C., Xiao, S., Fu, Z., Song, F.: Efficient multi-stage video denoising with recurrent spatio-temporal fusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3466–3475 (2021)
Google Scholar
Perazzi, F., Pont-Tuset, J., McWilliams, B., Van Gool, L., Gross, M., Sorkine-Hornung, A.: A benchmark dataset and evaluation methodology for video object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 724–732 (2016)
Google Scholar
Tassano, M., Delon, J., Veit, T.: DVDNet: a fast network for deep video denoising. In: 2019 IEEE International Conference on Image Processing (ICIP), pp. 1805–1809. IEEE (2019)
Google Scholar
Tassano, M., Delon, J., Veit, T.: FastDVDNet: towards real-time deep video denoising without flow estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1354–1363 (2020)
Google Scholar
Vaksman, G., Elad, M., Milanfar, P.: Patch craft: video denoising by deep modeling and patch matching. arXiv preprint arXiv:2103.13767 (2021)
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)
Google Scholar
Wang, C., Zhou, S.K., Cheng, Z.: First image then video: A two-stage network for spatiotemporal video denoising. arXiv preprint arXiv:2001.00346 (2020)
Xu, M., et al.: End-to-end semi-supervised object detection with soft teacher. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3060–3069 (2021)
Google Scholar
Yang, J., et al.: Focal self-attention for local-global interactions in vision transformers (2021). arXiv preprint arXiv:2107.00641
Yu, W., et al.: Metaformer is actually what you need for vision. arXiv preprint arXiv:2111.11418 (2021)
Yuan, L., et al.: Florence: a new foundation model for computer vision. arXiv preprint arXiv:2111.11432 (2021)
Yue, H., Cao, C., Liao, L., Chu, R., Yang, J.: Supervised raw video denoising with a benchmark dataset on dynamic scenes. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2301–2310 (2020)
Google Scholar
Zamir, S.W., Arora, A., Khan, S., Hayat, M., Khan, F.S., Yang, M.H.: Restormer: efficient transformer for high-resolution image restoration. arXiv preprint arXiv:2111.09881 (2021)
Zhai, X., Kolesnikov, A., Houlsby, N., Beyer, L.: Scaling vision transformers (2021)
Google Scholar
Zhang, K., Zuo, W., Chen, Y., Meng, D., Zhang, L.: Beyond a gaussian denoiser: residual learning of deep CNN for image denoising. IEEE Trans. Image Process. 26(7), 3142–3155 (2017)
Article MathSciNet MATH Google Scholar
Zhang, K., Zuo, W., Gu, S., Zhang, L.: Learning deep CNN denoiser prior for image restoration. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3929–3938 (2017)
Google Scholar
Zhang, K., Zuo, W., Zhang, L.: FFDNet: toward a fast and flexible solution for CNN-based image denoising. IEEE Trans. Image Process. 27(9), 4608–4622 (2018)
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

ETH Zurich, Zurich, Switzerland
Mingyang Song
DisneyResearch|Studios, Zurich, Switzerland
Mingyang Song, Yang Zhang & Tunç O. Aydın

Authors

Mingyang Song
View author publications
You can also search for this author in PubMed Google Scholar
Yang Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Tunç O. Aydın
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yang Zhang .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (zip 19340 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Song, M., Zhang, Y., Aydın, T.O. (2022). TempFormer: Temporally Consistent Transformer for Video Denoising. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13679. Springer, Cham. https://doi.org/10.1007/978-3-031-19800-7_28

Download citation

DOI: https://doi.org/10.1007/978-3-031-19800-7_28
Published: 09 November 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19799-4
Online ISBN: 978-3-031-19800-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

TempFormer: Temporally Consistent Transformer for Video Denoising