Hierarchical Separable Video Transformer for Snapshot Compressive Imaging

Wang, Ping; Zhang, Yulun; Wang, Lishun; Yuan, Xin

doi:10.1007/978-3-031-73004-7_7

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15139))

Included in the following conference series:

European Conference on Computer Vision

299 Accesses

Abstract

Transformers have achieved the state-of-the-art performance on solving the inverse problem of Snapshot Compressive Imaging (SCI) for video, whose ill-posedness is rooted in the mixed degradation of spatial masking and temporal aliasing. However, previous Transformers lack an insight into the degradation and thus have limited performance and efficiency. In this work, we tailor an efficient reconstruction architecture without temporal aggregation in early layers and Hierarchical Separable Video Transformer (HiSViT) as building block. HiSViT is built by multiple groups of Cross-Scale Separable Multi-head Self-Attention (CSS-MSA) and Gated Self-Modulated Feed-Forward Network (GSM-FFN) with dense connections, each of which is conducted within a separate channel portions at a different scale, for multi-scale interactions and long-range modeling. By separating spatial operations from temporal ones, CSS-MSA introduces an inductive bias of paying more attention within frames instead of between frames while saving computational overheads. GSM-FFN further enhances the locality via gated mechanism and factorized spatial-temporal convolutions. Extensive experiments demonstrate that our method outperforms previous methods by >0.5 dB with comparable or fewer parameters and complexity. The source codes and pretrained models are released at https://github.com/pwangcs/HiSViT.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Hybrid CNN-Transformer Architecture for Efficient Large-Scale Video Snapshot Compressive Imaging

Article 19 May 2024

Dual Attention with the Self-Attention Alignment for Efficient Video Super-resolution

Article 15 May 2021

RealViformer: Investigating Attention for Real-World Video Super-Resolution

References

Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., Schmid, C.: ViViT: a video vision transformer. In: International Conference on Computer Vision, pp. 6836–6846 (2021)
Google Scholar
Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for video understanding? In: International Conference on Machine Learning, vol. 2, p. 4 (2021)
Google Scholar
Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J., et al.: Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends® Mach. Learn. 3(1), 1–122 (2011)
Google Scholar
Brown, T., et al.: Language models are few-shot learners. Adv. Neural Inform. Process. Syst. 33, 1877–1901 (2020)
Google Scholar
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Chapter Google Scholar
Chan, K.C., Zhou, S., Xu, X., Loy, C.C.: Investigating tradeoffs in real-world video super-resolution. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 5962–5971 (2022)
Google Scholar
Chen, H., et al.: Pre-trained image processing transformer. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 12299–12310 (2021)
Google Scholar
Chen, Z., Zhang, Y., Gu, J., Kong, L., Yang, X.: Recursive generalization transformer for image super-resolution. In: International Conference on Learning Representation (2024)
Google Scholar
Chen, Z., Zhang, Y., Gu, J., Kong, L., Yang, X., Yu, F.: Dual aggregation transformer for image super-resolution. In: International Conference on Computer Vision, pp. 12312–12321 (2023)
Google Scholar
Chen, Z., Zhang, Y., Gu, J., Kong, L., Yuan, X., et al.: Cross aggregation transformer for image restoration. Adv. Neural Inform. Process. Syst. 35, 25478–25490 (2022)
Google Scholar
Cheng, Z., et al.: Memory-efficient network for large-scale video compressive sensing. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 16246–16255 (2021)
Google Scholar
Cheng, Z., et al.: Recurrent neural networks for snapshot compressive imaging. IEEE Trans. Pattern Anal. Mach. Intell. 45(2), 2264–2281 (2022)
Article Google Scholar
Cheng, Z., et al.: BIRNAT: bidirectional recurrent neural networks with adversarial training for video snapshot compressive imaging. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12369, pp. 258–275. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58586-0_16
Chapter Google Scholar
Dosovitskiy, A., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representation (2020)
Google Scholar
Duarte, M.F., et al.: Single-pixel imaging via compressive sampling. IEEE Sign. Process. Mag. 25(2), 83–91 (2008)
Article Google Scholar
Gao, L., Liang, J., Li, C., Wang, L.V.: Single-shot compressed ultrafast photography at one hundred billion frames per second. Nature 516(7529), 74–77 (2014)
Article Google Scholar
Hitomi, Y., Gu, J., Gupta, M., Mitsunaga, T., Nayar, S.K.: Video from a single coded exposure photograph using a learned over-complete dictionary. In: International Conference on Computer Vision, pp. 287–294 (2011)
Google Scholar
Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 7132–7141 (2018)
Google Scholar
Kenton, J.D.M.W.C., Toutanova, L.K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT, pp. 4171–4186 (2019)
Google Scholar
Koller, R., et al.: High spatio-temporal resolution video with compressed sensing. Opt. Express 23(12), 15992–16007 (2015)
Article Google Scholar
Lai, Z., Yan, C., Fu, Y.: Hybrid spectral denoising transformer with guided attention. In: International Conference on Computer Vision, pp. 13065–13075 (2023)
Google Scholar
Liang, J., et al.: VRT: a video restoration transformer. IEEE Trans. Image Process. 33, 2171–2182 (2024)
Google Scholar
Liang, J., Cao, J., Sun, G., Zhang, K., Van Gool, L., Timofte, R.: Swinir: Image restoration using swin transformer. In: International Conference on Computer Vision Workshop, pp. 1833–1844 (2021)
Google Scholar
Liang, J., et al.: Recurrent video restoration transformer with guided deformable attention. Adv. Neural Inform. Process. Syst. 35, 378–393 (2022)
Google Scholar
Liao, X., Li, H., Carin, L.: Generalized alternating projection for weighted-2,1 minimization with applications to model-based compressive sensing. SIAM J. Imag. Sci. 7(2), 797–823 (2014)
Article MathSciNet Google Scholar
Liu, Y., Yuan, X., Suo, J., Brady, D.J., Dai, Q.: Rank minimization for snapshot compressive imaging. IEEE Trans. Pattern Anal. Mach. Intell. 41(12), 2990–3006 (2019)
Article Google Scholar
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: International Conference on Computer Vision, pp. 10012–10022 (2021)
Google Scholar
Liu, Z., et al.: Video swin transformer. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3202–3211 (2022)
Google Scholar
Llull, P., et al.: Coded aperture compressive temporal imaging. Opt. Express 21(9), 10526–10545 (2013)
Article Google Scholar
Lu, J., et al.: Soft: softmax-free transformer with linear complexity. Adv. Neural Inform. Process. Syst. 34, 21297–21309 (2021)
Google Scholar
Ma, J., Liu, X.Y., Shou, Z., Yuan, X.: Deep tensor admm-net for snapshot compressive imaging. In: International Conference on Computer Vision, pp. 10223–10232 (2019)
Google Scholar
Martel, J.N., Mueller, L.K., Carey, S.J., Dudek, P., Wetzstein, G.: Neural sensors: Learning pixel exposures for HDR imaging and video compressive sensing with programmable sensors. IEEE Trans. Pattern Anal. Mach. Intell. 42(7), 1642–1653 (2020)
Article Google Scholar
Mei, Y., et al.: Pyramid attention network for image restoration. Int. J. Comput. Vis. 131(12), 3207–3225 (2023)
Google Scholar
Meng, Z., Yuan, X., Jalali, S.: Deep unfolding for snapshot compressive imaging. Int. J. Comput. Vis. 131(11), 2933–2958 (2023)
Article Google Scholar
Park, N., Kim, S.: How do vision transformers work? In: International Conference on Learning Representation (2022)
Google Scholar
Pont-Tuset, J., Perazzi, F., Caelles, S., Arbeláez, P., Sorkine-Hornung, A., Van Gool, L.: The 2017 Davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675 (2017)
Qiao, M., Meng, Z., Ma, J., Yuan, X.: Deep learning for video compressive sensing. APL Photon. 5(3) (2020)
Google Scholar
Qu, G., Wang, P., Yuan, X.: Dual-scale transformer for large-scale single-pixel imaging. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 25327–25337 (2024)
Google Scholar
Reddy, D., Veeraraghavan, A., Chellappa, R.: P2C2: programmable pixel compressive camera for high speed imaging. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 329–336 (2011)
Google Scholar
Sun, J., Li, H., Xu, Z., et al.: Deep ADMM-net for compressive sensing MRI. Adv. Neural Inform. Process. Syst. 29, 10–18 (2016)
Google Scholar
Vaswani, A., et al.: Attention is all you need. Adv. Neural Inform. Process. Syst. 30 (2017)
Google Scholar
Voigtman, E., Winefordner, J.D.: Low-pass filters for signal averaging. Rev. Sci. Instrum. 57(5), 957–966 (1986)
Article Google Scholar
Wang, L., Cao, M., Yuan, X.: EfficientSci: densely connected network with space-time factorization for large-scale video snapshot compressive imaging. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 18477–18486 (2023)
Google Scholar
Wang, L., Cao, M., Zhong, Y., Yuan, X.: Spatial-temporal transformer for video snapshot compressive imaging. IEEE Trans. Pattern Anal. Mach. Intell. 45(7), 9072–9089 (2022)
Google Scholar
Wang, P., et al.: KVT: k-NN attention for boosting vision transformers. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13684, pp. 285–302. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20053-3_17
Chapter Google Scholar
Wang, P., Wang, L., Qiao, M., Yuan, X.: Full-resolution and full-dynamic-range coded aperture compressive temporal imaging. Opt. Lett. 48(18), 4813–4816 (2023)
Article Google Scholar
Wang, P., Wang, L., Yuan, X.: Deep optics for video snapshot compressive imaging. In: International Conference on Computer Vision, pp. 10646–10656 (2023)
Google Scholar
Wang, P., Yuan, X.: SaUNet: spatial-attention unfolding network for image compressive sensing. In: ACM International Conference on Multimedia, pp. 5099–5108 (2023)
Google Scholar
Wang, W., et al.: Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. In: International Conference on Computer Vision, pp. 568–578 (2021)
Google Scholar
Wang, Z., Cun, X., Bao, J., Zhou, W., Liu, J., Li, H.: UFormer: a general U-shaped transformer for image restoration. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 17683–17693 (2022)
Google Scholar
Wang, Z., Zhang, H., Cheng, Z., Chen, B., Yuan, X.: MetaSci: scalable and adaptive reconstruction for video compressive sensing. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2083–2092 (2021)
Google Scholar
Wu, Z., Zhang, J., Mou, C.: Dense deep unfolding network with 3D-CNN prior for snapshot compressive imaging. In: International Conference on Computer Vision, pp. 4892–4901 (2021)
Google Scholar
Yang, C., Zhang, S., Yuan, X.: Ensemble learning priors driven deep unfolding for scalable video snapshot compressive imaging. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13683, pp. 600–618. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20050-2_35
Chapter Google Scholar
Yuan, X.: Generalized alternating projection based total variation minimization for compressive sensing. In: IEEE International Conference on Image Processing, pp. 2539–2543 (2016)
Google Scholar
Yuan, X., Liu, Y., Suo, J., Dai, Q.: Plug-and-play algorithms for large-scale snapshot compressive imaging. In: IEEE Conference on Computer Vision and Pattern Recognition (2020)
Google Scholar
Yuan, X., Brady, D.J., Katsaggelos, A.K.: Snapshot compressive imaging: theory, algorithms, and applications. IEEE Sign. Process. Mag. 38(2), 65–88 (2021)
Article Google Scholar
Yuan, X., Liu, Y., Suo, J., Durand, F., Dai, Q.: Plug-and-play algorithms for video snapshot compressive imaging. IEEE Trans. Pattern Anal. Mach. Intell. 44(10), 7093–7111 (2021)
Article Google Scholar
Yuan, X., et al.: Low-cost compressive sensing for color video and depth. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3318–3325 (2014)
Google Scholar
Zamir, S.W., Arora, A., Khan, S., Hayat, M., Khan, F.S., Yang, M.H.: Restormer: Efficient transformer for high-resolution image restoration. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 5728–5739 (2022)
Google Scholar
Zhang, J., Zhang, Y., Gu, J., Zhang, Y., Kong, L., Yuan, X.: Accurate image restoration with attention retractable transformer. In: International Conference on Learning Representation (2023)
Google Scholar
Zheng, S., Yuan, X.: Unfolding framework with prior of convolution-transformer mixture and uncertainty estimation for video snapshot compressive imaging. In: International Conference on Computer Vision, pp. 12738–12749 (2023)
Google Scholar
Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable DETR: deformable transformers for end-to-end object detection. In: International Conference on Learning Representation (2021)
Google Scholar

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (grant number 62271414), Zhejiang Provincial Distinguished Young Scientist Foundation (grant number LR23F010001), Zhejiang “Pioneer” and “Leading Goose” R&D Program (grant number 2024SDXHDX0006, 2024C03182), the Key Project of Westlake Institute for Optoelectronics (grant number 2023GD007), the 2023 International Sci-tech Cooperation Projects under the purview of the “Innovation Yongjiang 2035” Key R&D Program (grant number 2024Z126), Shanghai Municipal Science and Technology Major Project (2021SHZDZX0102), and the Fundamental Research Funds for the Central Universities.

Author information

Authors and Affiliations

Zhejiang University, Hangzhou, China
Ping Wang
Westlake University, Hangzhou, China
Ping Wang, Lishun Wang & Xin Yuan
Shanghai Jiao Tong University, Shanghai, China
Yulun Zhang

Authors

Ping Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yulun Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Lishun Wang
View author publications
You can also search for this author in PubMed Google Scholar
Xin Yuan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xin Yuan .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Hessen, Germany
Stefan Roth
Princeton University, Palo Alto, CA, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 1220 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, P., Zhang, Y., Wang, L., Yuan, X. (2025). Hierarchical Separable Video Transformer for Snapshot Compressive Imaging. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15139. Springer, Cham. https://doi.org/10.1007/978-3-031-73004-7_7

Download citation

DOI: https://doi.org/10.1007/978-3-031-73004-7_7
Published: 01 November 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73003-0
Online ISBN: 978-3-031-73004-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Hierarchical Separable Video Transformer for Snapshot Compressive Imaging