Abstract
Transformers have achieved the state-of-the-art performance on solving the inverse problem of Snapshot Compressive Imaging (SCI) for video, whose ill-posedness is rooted in the mixed degradation of spatial masking and temporal aliasing. However, previous Transformers lack an insight into the degradation and thus have limited performance and efficiency. In this work, we tailor an efficient reconstruction architecture without temporal aggregation in early layers and Hierarchical Separable Video Transformer (HiSViT) as building block. HiSViT is built by multiple groups of Cross-Scale Separable Multi-head Self-Attention (CSS-MSA) and Gated Self-Modulated Feed-Forward Network (GSM-FFN) with dense connections, each of which is conducted within a separate channel portions at a different scale, for multi-scale interactions and long-range modeling. By separating spatial operations from temporal ones, CSS-MSA introduces an inductive bias of paying more attention within frames instead of between frames while saving computational overheads. GSM-FFN further enhances the locality via gated mechanism and factorized spatial-temporal convolutions. Extensive experiments demonstrate that our method outperforms previous methods by >0.5 dB with comparable or fewer parameters and complexity. The source codes and pretrained models are released at https://github.com/pwangcs/HiSViT.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., Schmid, C.: ViViT: a video vision transformer. In: International Conference on Computer Vision, pp. 6836–6846 (2021)
Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for video understanding? In: International Conference on Machine Learning, vol. 2, p. 4 (2021)
Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J., et al.: Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends® Mach. Learn. 3(1), 1–122 (2011)
Brown, T., et al.: Language models are few-shot learners. Adv. Neural Inform. Process. Syst. 33, 1877–1901 (2020)
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Chan, K.C., Zhou, S., Xu, X., Loy, C.C.: Investigating tradeoffs in real-world video super-resolution. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 5962–5971 (2022)
Chen, H., et al.: Pre-trained image processing transformer. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 12299–12310 (2021)
Chen, Z., Zhang, Y., Gu, J., Kong, L., Yang, X.: Recursive generalization transformer for image super-resolution. In: International Conference on Learning Representation (2024)
Chen, Z., Zhang, Y., Gu, J., Kong, L., Yang, X., Yu, F.: Dual aggregation transformer for image super-resolution. In: International Conference on Computer Vision, pp. 12312–12321 (2023)
Chen, Z., Zhang, Y., Gu, J., Kong, L., Yuan, X., et al.: Cross aggregation transformer for image restoration. Adv. Neural Inform. Process. Syst. 35, 25478–25490 (2022)
Cheng, Z., et al.: Memory-efficient network for large-scale video compressive sensing. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 16246–16255 (2021)
Cheng, Z., et al.: Recurrent neural networks for snapshot compressive imaging. IEEE Trans. Pattern Anal. Mach. Intell. 45(2), 2264–2281 (2022)
Cheng, Z., et al.: BIRNAT: bidirectional recurrent neural networks with adversarial training for video snapshot compressive imaging. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12369, pp. 258–275. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58586-0_16
Dosovitskiy, A., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representation (2020)
Duarte, M.F., et al.: Single-pixel imaging via compressive sampling. IEEE Sign. Process. Mag. 25(2), 83–91 (2008)
Gao, L., Liang, J., Li, C., Wang, L.V.: Single-shot compressed ultrafast photography at one hundred billion frames per second. Nature 516(7529), 74–77 (2014)
Hitomi, Y., Gu, J., Gupta, M., Mitsunaga, T., Nayar, S.K.: Video from a single coded exposure photograph using a learned over-complete dictionary. In: International Conference on Computer Vision, pp. 287–294 (2011)
Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 7132–7141 (2018)
Kenton, J.D.M.W.C., Toutanova, L.K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT, pp. 4171–4186 (2019)
Koller, R., et al.: High spatio-temporal resolution video with compressed sensing. Opt. Express 23(12), 15992–16007 (2015)
Lai, Z., Yan, C., Fu, Y.: Hybrid spectral denoising transformer with guided attention. In: International Conference on Computer Vision, pp. 13065–13075 (2023)
Liang, J., et al.: VRT: a video restoration transformer. IEEE Trans. Image Process. 33, 2171–2182 (2024)
Liang, J., Cao, J., Sun, G., Zhang, K., Van Gool, L., Timofte, R.: Swinir: Image restoration using swin transformer. In: International Conference on Computer Vision Workshop, pp. 1833–1844 (2021)
Liang, J., et al.: Recurrent video restoration transformer with guided deformable attention. Adv. Neural Inform. Process. Syst. 35, 378–393 (2022)
Liao, X., Li, H., Carin, L.: Generalized alternating projection for weighted-2,1 minimization with applications to model-based compressive sensing. SIAM J. Imag. Sci. 7(2), 797–823 (2014)
Liu, Y., Yuan, X., Suo, J., Brady, D.J., Dai, Q.: Rank minimization for snapshot compressive imaging. IEEE Trans. Pattern Anal. Mach. Intell. 41(12), 2990–3006 (2019)
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: International Conference on Computer Vision, pp. 10012–10022 (2021)
Liu, Z., et al.: Video swin transformer. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3202–3211 (2022)
Llull, P., et al.: Coded aperture compressive temporal imaging. Opt. Express 21(9), 10526–10545 (2013)
Lu, J., et al.: Soft: softmax-free transformer with linear complexity. Adv. Neural Inform. Process. Syst. 34, 21297–21309 (2021)
Ma, J., Liu, X.Y., Shou, Z., Yuan, X.: Deep tensor admm-net for snapshot compressive imaging. In: International Conference on Computer Vision, pp. 10223–10232 (2019)
Martel, J.N., Mueller, L.K., Carey, S.J., Dudek, P., Wetzstein, G.: Neural sensors: Learning pixel exposures for HDR imaging and video compressive sensing with programmable sensors. IEEE Trans. Pattern Anal. Mach. Intell. 42(7), 1642–1653 (2020)
Mei, Y., et al.: Pyramid attention network for image restoration. Int. J. Comput. Vis. 131(12), 3207–3225 (2023)
Meng, Z., Yuan, X., Jalali, S.: Deep unfolding for snapshot compressive imaging. Int. J. Comput. Vis. 131(11), 2933–2958 (2023)
Park, N., Kim, S.: How do vision transformers work? In: International Conference on Learning Representation (2022)
Pont-Tuset, J., Perazzi, F., Caelles, S., Arbeláez, P., Sorkine-Hornung, A., Van Gool, L.: The 2017 Davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675 (2017)
Qiao, M., Meng, Z., Ma, J., Yuan, X.: Deep learning for video compressive sensing. APL Photon. 5(3) (2020)
Qu, G., Wang, P., Yuan, X.: Dual-scale transformer for large-scale single-pixel imaging. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 25327–25337 (2024)
Reddy, D., Veeraraghavan, A., Chellappa, R.: P2C2: programmable pixel compressive camera for high speed imaging. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 329–336 (2011)
Sun, J., Li, H., Xu, Z., et al.: Deep ADMM-net for compressive sensing MRI. Adv. Neural Inform. Process. Syst. 29, 10–18 (2016)
Vaswani, A., et al.: Attention is all you need. Adv. Neural Inform. Process. Syst. 30 (2017)
Voigtman, E., Winefordner, J.D.: Low-pass filters for signal averaging. Rev. Sci. Instrum. 57(5), 957–966 (1986)
Wang, L., Cao, M., Yuan, X.: EfficientSci: densely connected network with space-time factorization for large-scale video snapshot compressive imaging. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 18477–18486 (2023)
Wang, L., Cao, M., Zhong, Y., Yuan, X.: Spatial-temporal transformer for video snapshot compressive imaging. IEEE Trans. Pattern Anal. Mach. Intell. 45(7), 9072–9089 (2022)
Wang, P., et al.: KVT: k-NN attention for boosting vision transformers. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13684, pp. 285–302. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20053-3_17
Wang, P., Wang, L., Qiao, M., Yuan, X.: Full-resolution and full-dynamic-range coded aperture compressive temporal imaging. Opt. Lett. 48(18), 4813–4816 (2023)
Wang, P., Wang, L., Yuan, X.: Deep optics for video snapshot compressive imaging. In: International Conference on Computer Vision, pp. 10646–10656 (2023)
Wang, P., Yuan, X.: SaUNet: spatial-attention unfolding network for image compressive sensing. In: ACM International Conference on Multimedia, pp. 5099–5108 (2023)
Wang, W., et al.: Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. In: International Conference on Computer Vision, pp. 568–578 (2021)
Wang, Z., Cun, X., Bao, J., Zhou, W., Liu, J., Li, H.: UFormer: a general U-shaped transformer for image restoration. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 17683–17693 (2022)
Wang, Z., Zhang, H., Cheng, Z., Chen, B., Yuan, X.: MetaSci: scalable and adaptive reconstruction for video compressive sensing. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2083–2092 (2021)
Wu, Z., Zhang, J., Mou, C.: Dense deep unfolding network with 3D-CNN prior for snapshot compressive imaging. In: International Conference on Computer Vision, pp. 4892–4901 (2021)
Yang, C., Zhang, S., Yuan, X.: Ensemble learning priors driven deep unfolding for scalable video snapshot compressive imaging. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13683, pp. 600–618. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20050-2_35
Yuan, X.: Generalized alternating projection based total variation minimization for compressive sensing. In: IEEE International Conference on Image Processing, pp. 2539–2543 (2016)
Yuan, X., Liu, Y., Suo, J., Dai, Q.: Plug-and-play algorithms for large-scale snapshot compressive imaging. In: IEEE Conference on Computer Vision and Pattern Recognition (2020)
Yuan, X., Brady, D.J., Katsaggelos, A.K.: Snapshot compressive imaging: theory, algorithms, and applications. IEEE Sign. Process. Mag. 38(2), 65–88 (2021)
Yuan, X., Liu, Y., Suo, J., Durand, F., Dai, Q.: Plug-and-play algorithms for video snapshot compressive imaging. IEEE Trans. Pattern Anal. Mach. Intell. 44(10), 7093–7111 (2021)
Yuan, X., et al.: Low-cost compressive sensing for color video and depth. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3318–3325 (2014)
Zamir, S.W., Arora, A., Khan, S., Hayat, M., Khan, F.S., Yang, M.H.: Restormer: Efficient transformer for high-resolution image restoration. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 5728–5739 (2022)
Zhang, J., Zhang, Y., Gu, J., Zhang, Y., Kong, L., Yuan, X.: Accurate image restoration with attention retractable transformer. In: International Conference on Learning Representation (2023)
Zheng, S., Yuan, X.: Unfolding framework with prior of convolution-transformer mixture and uncertainty estimation for video snapshot compressive imaging. In: International Conference on Computer Vision, pp. 12738–12749 (2023)
Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable DETR: deformable transformers for end-to-end object detection. In: International Conference on Learning Representation (2021)
Acknowledgements
This work was supported by the National Natural Science Foundation of China (grant number 62271414), Zhejiang Provincial Distinguished Young Scientist Foundation (grant number LR23F010001), Zhejiang “Pioneer” and “Leading Goose” R&D Program (grant number 2024SDXHDX0006, 2024C03182), the Key Project of Westlake Institute for Optoelectronics (grant number 2023GD007), the 2023 International Sci-tech Cooperation Projects under the purview of the “Innovation Yongjiang 2035” Key R&D Program (grant number 2024Z126), Shanghai Municipal Science and Technology Major Project (2021SHZDZX0102), and the Fundamental Research Funds for the Central Universities.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Wang, P., Zhang, Y., Wang, L., Yuan, X. (2025). Hierarchical Separable Video Transformer for Snapshot Compressive Imaging. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15139. Springer, Cham. https://doi.org/10.1007/978-3-031-73004-7_7
Download citation
DOI: https://doi.org/10.1007/978-3-031-73004-7_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73003-0
Online ISBN: 978-3-031-73004-7
eBook Packages: Computer ScienceComputer Science (R0)