Skip to main content

Hierarchical Separable Video Transformer for Snapshot Compressive Imaging

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Abstract

Transformers have achieved the state-of-the-art performance on solving the inverse problem of Snapshot Compressive Imaging (SCI) for video, whose ill-posedness is rooted in the mixed degradation of spatial masking and temporal aliasing. However, previous Transformers lack an insight into the degradation and thus have limited performance and efficiency. In this work, we tailor an efficient reconstruction architecture without temporal aggregation in early layers and Hierarchical Separable Video Transformer (HiSViT) as building block. HiSViT is built by multiple groups of Cross-Scale Separable Multi-head Self-Attention (CSS-MSA) and Gated Self-Modulated Feed-Forward Network (GSM-FFN) with dense connections, each of which is conducted within a separate channel portions at a different scale, for multi-scale interactions and long-range modeling. By separating spatial operations from temporal ones, CSS-MSA introduces an inductive bias of paying more attention within frames instead of between frames while saving computational overheads. GSM-FFN further enhances the locality via gated mechanism and factorized spatial-temporal convolutions. Extensive experiments demonstrate that our method outperforms previous methods by >0.5 dB with comparable or fewer parameters and complexity. The source codes and pretrained models are released at https://github.com/pwangcs/HiSViT.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., Schmid, C.: ViViT: a video vision transformer. In: International Conference on Computer Vision, pp. 6836–6846 (2021)

    Google Scholar 

  2. Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for video understanding? In: International Conference on Machine Learning, vol. 2, p. 4 (2021)

    Google Scholar 

  3. Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J., et al.: Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends® Mach. Learn. 3(1), 1–122 (2011)

    Google Scholar 

  4. Brown, T., et al.: Language models are few-shot learners. Adv. Neural Inform. Process. Syst. 33, 1877–1901 (2020)

    Google Scholar 

  5. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13

    Chapter  Google Scholar 

  6. Chan, K.C., Zhou, S., Xu, X., Loy, C.C.: Investigating tradeoffs in real-world video super-resolution. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 5962–5971 (2022)

    Google Scholar 

  7. Chen, H., et al.: Pre-trained image processing transformer. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 12299–12310 (2021)

    Google Scholar 

  8. Chen, Z., Zhang, Y., Gu, J., Kong, L., Yang, X.: Recursive generalization transformer for image super-resolution. In: International Conference on Learning Representation (2024)

    Google Scholar 

  9. Chen, Z., Zhang, Y., Gu, J., Kong, L., Yang, X., Yu, F.: Dual aggregation transformer for image super-resolution. In: International Conference on Computer Vision, pp. 12312–12321 (2023)

    Google Scholar 

  10. Chen, Z., Zhang, Y., Gu, J., Kong, L., Yuan, X., et al.: Cross aggregation transformer for image restoration. Adv. Neural Inform. Process. Syst. 35, 25478–25490 (2022)

    Google Scholar 

  11. Cheng, Z., et al.: Memory-efficient network for large-scale video compressive sensing. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 16246–16255 (2021)

    Google Scholar 

  12. Cheng, Z., et al.: Recurrent neural networks for snapshot compressive imaging. IEEE Trans. Pattern Anal. Mach. Intell. 45(2), 2264–2281 (2022)

    Article  Google Scholar 

  13. Cheng, Z., et al.: BIRNAT: bidirectional recurrent neural networks with adversarial training for video snapshot compressive imaging. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12369, pp. 258–275. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58586-0_16

    Chapter  Google Scholar 

  14. Dosovitskiy, A., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representation (2020)

    Google Scholar 

  15. Duarte, M.F., et al.: Single-pixel imaging via compressive sampling. IEEE Sign. Process. Mag. 25(2), 83–91 (2008)

    Article  Google Scholar 

  16. Gao, L., Liang, J., Li, C., Wang, L.V.: Single-shot compressed ultrafast photography at one hundred billion frames per second. Nature 516(7529), 74–77 (2014)

    Article  Google Scholar 

  17. Hitomi, Y., Gu, J., Gupta, M., Mitsunaga, T., Nayar, S.K.: Video from a single coded exposure photograph using a learned over-complete dictionary. In: International Conference on Computer Vision, pp. 287–294 (2011)

    Google Scholar 

  18. Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 7132–7141 (2018)

    Google Scholar 

  19. Kenton, J.D.M.W.C., Toutanova, L.K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT, pp. 4171–4186 (2019)

    Google Scholar 

  20. Koller, R., et al.: High spatio-temporal resolution video with compressed sensing. Opt. Express 23(12), 15992–16007 (2015)

    Article  Google Scholar 

  21. Lai, Z., Yan, C., Fu, Y.: Hybrid spectral denoising transformer with guided attention. In: International Conference on Computer Vision, pp. 13065–13075 (2023)

    Google Scholar 

  22. Liang, J., et al.: VRT: a video restoration transformer. IEEE Trans. Image Process. 33, 2171–2182 (2024)

    Google Scholar 

  23. Liang, J., Cao, J., Sun, G., Zhang, K., Van Gool, L., Timofte, R.: Swinir: Image restoration using swin transformer. In: International Conference on Computer Vision Workshop, pp. 1833–1844 (2021)

    Google Scholar 

  24. Liang, J., et al.: Recurrent video restoration transformer with guided deformable attention. Adv. Neural Inform. Process. Syst. 35, 378–393 (2022)

    Google Scholar 

  25. Liao, X., Li, H., Carin, L.: Generalized alternating projection for weighted-2,1 minimization with applications to model-based compressive sensing. SIAM J. Imag. Sci. 7(2), 797–823 (2014)

    Article  MathSciNet  Google Scholar 

  26. Liu, Y., Yuan, X., Suo, J., Brady, D.J., Dai, Q.: Rank minimization for snapshot compressive imaging. IEEE Trans. Pattern Anal. Mach. Intell. 41(12), 2990–3006 (2019)

    Article  Google Scholar 

  27. Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: International Conference on Computer Vision, pp. 10012–10022 (2021)

    Google Scholar 

  28. Liu, Z., et al.: Video swin transformer. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3202–3211 (2022)

    Google Scholar 

  29. Llull, P., et al.: Coded aperture compressive temporal imaging. Opt. Express 21(9), 10526–10545 (2013)

    Article  Google Scholar 

  30. Lu, J., et al.: Soft: softmax-free transformer with linear complexity. Adv. Neural Inform. Process. Syst. 34, 21297–21309 (2021)

    Google Scholar 

  31. Ma, J., Liu, X.Y., Shou, Z., Yuan, X.: Deep tensor admm-net for snapshot compressive imaging. In: International Conference on Computer Vision, pp. 10223–10232 (2019)

    Google Scholar 

  32. Martel, J.N., Mueller, L.K., Carey, S.J., Dudek, P., Wetzstein, G.: Neural sensors: Learning pixel exposures for HDR imaging and video compressive sensing with programmable sensors. IEEE Trans. Pattern Anal. Mach. Intell. 42(7), 1642–1653 (2020)

    Article  Google Scholar 

  33. Mei, Y., et al.: Pyramid attention network for image restoration. Int. J. Comput. Vis. 131(12), 3207–3225 (2023)

    Google Scholar 

  34. Meng, Z., Yuan, X., Jalali, S.: Deep unfolding for snapshot compressive imaging. Int. J. Comput. Vis. 131(11), 2933–2958 (2023)

    Article  Google Scholar 

  35. Park, N., Kim, S.: How do vision transformers work? In: International Conference on Learning Representation (2022)

    Google Scholar 

  36. Pont-Tuset, J., Perazzi, F., Caelles, S., Arbeláez, P., Sorkine-Hornung, A., Van Gool, L.: The 2017 Davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675 (2017)

  37. Qiao, M., Meng, Z., Ma, J., Yuan, X.: Deep learning for video compressive sensing. APL Photon. 5(3) (2020)

    Google Scholar 

  38. Qu, G., Wang, P., Yuan, X.: Dual-scale transformer for large-scale single-pixel imaging. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 25327–25337 (2024)

    Google Scholar 

  39. Reddy, D., Veeraraghavan, A., Chellappa, R.: P2C2: programmable pixel compressive camera for high speed imaging. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 329–336 (2011)

    Google Scholar 

  40. Sun, J., Li, H., Xu, Z., et al.: Deep ADMM-net for compressive sensing MRI. Adv. Neural Inform. Process. Syst. 29, 10–18 (2016)

    Google Scholar 

  41. Vaswani, A., et al.: Attention is all you need. Adv. Neural Inform. Process. Syst. 30 (2017)

    Google Scholar 

  42. Voigtman, E., Winefordner, J.D.: Low-pass filters for signal averaging. Rev. Sci. Instrum. 57(5), 957–966 (1986)

    Article  Google Scholar 

  43. Wang, L., Cao, M., Yuan, X.: EfficientSci: densely connected network with space-time factorization for large-scale video snapshot compressive imaging. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 18477–18486 (2023)

    Google Scholar 

  44. Wang, L., Cao, M., Zhong, Y., Yuan, X.: Spatial-temporal transformer for video snapshot compressive imaging. IEEE Trans. Pattern Anal. Mach. Intell. 45(7), 9072–9089 (2022)

    Google Scholar 

  45. Wang, P., et al.: KVT: k-NN attention for boosting vision transformers. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13684, pp. 285–302. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20053-3_17

    Chapter  Google Scholar 

  46. Wang, P., Wang, L., Qiao, M., Yuan, X.: Full-resolution and full-dynamic-range coded aperture compressive temporal imaging. Opt. Lett. 48(18), 4813–4816 (2023)

    Article  Google Scholar 

  47. Wang, P., Wang, L., Yuan, X.: Deep optics for video snapshot compressive imaging. In: International Conference on Computer Vision, pp. 10646–10656 (2023)

    Google Scholar 

  48. Wang, P., Yuan, X.: SaUNet: spatial-attention unfolding network for image compressive sensing. In: ACM International Conference on Multimedia, pp. 5099–5108 (2023)

    Google Scholar 

  49. Wang, W., et al.: Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. In: International Conference on Computer Vision, pp. 568–578 (2021)

    Google Scholar 

  50. Wang, Z., Cun, X., Bao, J., Zhou, W., Liu, J., Li, H.: UFormer: a general U-shaped transformer for image restoration. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 17683–17693 (2022)

    Google Scholar 

  51. Wang, Z., Zhang, H., Cheng, Z., Chen, B., Yuan, X.: MetaSci: scalable and adaptive reconstruction for video compressive sensing. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2083–2092 (2021)

    Google Scholar 

  52. Wu, Z., Zhang, J., Mou, C.: Dense deep unfolding network with 3D-CNN prior for snapshot compressive imaging. In: International Conference on Computer Vision, pp. 4892–4901 (2021)

    Google Scholar 

  53. Yang, C., Zhang, S., Yuan, X.: Ensemble learning priors driven deep unfolding for scalable video snapshot compressive imaging. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13683, pp. 600–618. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20050-2_35

    Chapter  Google Scholar 

  54. Yuan, X.: Generalized alternating projection based total variation minimization for compressive sensing. In: IEEE International Conference on Image Processing, pp. 2539–2543 (2016)

    Google Scholar 

  55. Yuan, X., Liu, Y., Suo, J., Dai, Q.: Plug-and-play algorithms for large-scale snapshot compressive imaging. In: IEEE Conference on Computer Vision and Pattern Recognition (2020)

    Google Scholar 

  56. Yuan, X., Brady, D.J., Katsaggelos, A.K.: Snapshot compressive imaging: theory, algorithms, and applications. IEEE Sign. Process. Mag. 38(2), 65–88 (2021)

    Article  Google Scholar 

  57. Yuan, X., Liu, Y., Suo, J., Durand, F., Dai, Q.: Plug-and-play algorithms for video snapshot compressive imaging. IEEE Trans. Pattern Anal. Mach. Intell. 44(10), 7093–7111 (2021)

    Article  Google Scholar 

  58. Yuan, X., et al.: Low-cost compressive sensing for color video and depth. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3318–3325 (2014)

    Google Scholar 

  59. Zamir, S.W., Arora, A., Khan, S., Hayat, M., Khan, F.S., Yang, M.H.: Restormer: Efficient transformer for high-resolution image restoration. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 5728–5739 (2022)

    Google Scholar 

  60. Zhang, J., Zhang, Y., Gu, J., Zhang, Y., Kong, L., Yuan, X.: Accurate image restoration with attention retractable transformer. In: International Conference on Learning Representation (2023)

    Google Scholar 

  61. Zheng, S., Yuan, X.: Unfolding framework with prior of convolution-transformer mixture and uncertainty estimation for video snapshot compressive imaging. In: International Conference on Computer Vision, pp. 12738–12749 (2023)

    Google Scholar 

  62. Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable DETR: deformable transformers for end-to-end object detection. In: International Conference on Learning Representation (2021)

    Google Scholar 

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (grant number 62271414), Zhejiang Provincial Distinguished Young Scientist Foundation (grant number LR23F010001), Zhejiang “Pioneer” and “Leading Goose” R&D Program (grant number 2024SDXHDX0006, 2024C03182), the Key Project of Westlake Institute for Optoelectronics (grant number 2023GD007), the 2023 International Sci-tech Cooperation Projects under the purview of the “Innovation Yongjiang 2035” Key R&D Program (grant number 2024Z126), Shanghai Municipal Science and Technology Major Project (2021SHZDZX0102), and the Fundamental Research Funds for the Central Universities.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xin Yuan .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 1220 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Wang, P., Zhang, Y., Wang, L., Yuan, X. (2025). Hierarchical Separable Video Transformer for Snapshot Compressive Imaging. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15139. Springer, Cham. https://doi.org/10.1007/978-3-031-73004-7_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-73004-7_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-73003-0

  • Online ISBN: 978-3-031-73004-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics