skip to main content
10.1145/3394171.3413527acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Video Synthesis via Transform-Based Tensor Neural Network

Authors Info & Claims
Published:12 October 2020Publication History

ABSTRACT

Video frame synthesis is an important task in computer vision and has drawn great interests in wide applications. However, existing neural network methods do not explicitly impose tensor low-rankness of videos to capture the spatiotemporal correlations in a high-dimensional space, while existing iterative algorithms require hand-crafted parameters and take relatively long running time. In this paper, we propose a novel multi-phase deep neural network Transform-Based Tensor-Net that exploits the low-rank structure of video data in a learned transform domain, which unfolds an Iterative Shrinkage-Thresholding Algorithm (ISTA) for tensor signal recovery. Our design is based on two observations: (i) both linear and nonlinear transforms can be implemented by a neural network layer, and (ii) the soft-thresholding operator corresponds to an activation function. Further, such an unfolding design is able to achieve nearly real-time at the cost of training time and enjoys an interpretable nature as a byproduct. Experimental results on the KTH and UCF-101 datasets show that compared with the state-of-the-art methods, i.e., DVF and Super SloMo, the proposed scheme improves Peak Signal-to-Noise Ratio (PSNR) of video interpolation and prediction by 4.13 dB and 4.26 dB, respectively.

Skip Supplemental Material Section

Supplemental Material

3394171.3413527.mp4

mp4

10.5 MB

References

  1. Vlachos Alex. 2018. Introducing SteamVR Motion smoothing beta. https://steamcommunity.com/games/250820/announcements/detail/ 1696061565016280495.Google ScholarGoogle Scholar
  2. Simon Baker, Daniel Scharstein, JP Lewis, Stefan Roth, Michael J Black, and Richard Szeliski. 2011. A database and evaluation methodology for optical flow. International Journal of Computer Vision 92, 1 (2011), 1--31.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Wenbo Bao, Wei-Sheng Lai, Chao Ma, Xiaoyun Zhang, Zhiyong Gao, and Ming- Hsuan Yang. 2019. Depth-aware video frame interpolation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3703--3712.Google ScholarGoogle ScholarCross RefCross Ref
  4. Amir Beck and Marc Teboulle. 2009. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences 2, 1 (2009), 183--202.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Judith Bütepage, Michael J Black, Danica Kragic, and Hedvig Kjellström. 2017. Deep representation learning for human motion prediction and classification. In IEEE Conference on Computer Vision and Pattern Recognition. pp. 6158--6166.Google ScholarGoogle ScholarCross RefCross Ref
  6. Kanglin Chen and Dirk A Lorenz. 2011. Image sequence interpolation using optimal control. Journal of Mathematical Imaging and Vision 41, 3 (2011).Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Myungsub Choi, Heewon Kim, Bohyung Han, Ning Xu, and Kyoung Mu Lee. 2020. Channel attention is all you need for video frame interpolation. AAAI.Google ScholarGoogle Scholar
  8. David L. Donoho. 2006. Compressed sensing. IEEE Transactions on Information Theory 52, 4 (2006), 1289--1306.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Lijie Fan, Wenbing Huang, Chuang Gan, Junzhou Huang, and Boqing Gong.. Controllable image-to-video translation: A case study on facial expression generation. In Proceedings of the AAAI Conference on Artificial Intelligence.Google ScholarGoogle Scholar
  10. Chelsea Finn, Ian Goodfellow, and Sergey Levine. 2016. Unsupervised learning for physical interaction through video prediction. In Advances in Neural Information Processing Systems. pp. 64--72.Google ScholarGoogle Scholar
  11. Chuang Gan, Deng Huang, Hang Zhao, Joshua B Tenenbaum, and Antonio Torralba. 2020. Music gesture for visual sound separation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle ScholarCross RefCross Ref
  12. Xiaochen Han, Bo Wu, Zheng Shou, Xiao-Yang Liu, Yimeng Zhang, and Linghe Kong. 2020. Tensor FISTA-Net for real-time snapshot compressive imaging.. In AAAI. 10933--10940.Google ScholarGoogle Scholar
  13. John R Hershey, Jonathan Le Roux, and Felix Weninger. 2014. Deep unfolding: Model-based inspiration of novel deep architectures. arXiv preprint arXiv:1409.2574 (2014).Google ScholarGoogle Scholar
  14. Kurt Hornik, Maxwell Stinchcombe, and Halbert White. 1989. Multilayer feedforward networks are universal approximators. Neural Networks 2, 5 (1989).Google ScholarGoogle Scholar
  15. Jun-Ting Hsieh, Bingbin Liu, De-An Huang, Li F Fei-Fei, and Juan Carlos Niebles. 2018. Learning to decompose and disentangle representations for video prediction. In Advances in Neural Information Processing Systems. 517--526.Google ScholarGoogle Scholar
  16. Chao Jia and Brian L Evans. 2013. 3D rotational video stabilization using manifold optimization. In IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2493--2497.Google ScholarGoogle ScholarCross RefCross Ref
  17. Huaizu Jiang, Deqing Sun, Varun Jampani, Ming-Hsuan Yang, Erik G. Learned- Miller, and Jan Kautz. 2018. Super SloMo: High quality estimation of multiple intermediate frames for Video Interpolation. IEEE Conference on Computer Vision and Pattern Recognition (2018).Google ScholarGoogle ScholarCross RefCross Ref
  18. Xiaojie Jin, Huaxin Xiao, Xiaohui Shen, Jimei Yang, Zhe Lin, Yunpeng Chen, Zequn Jie, Jiashi Feng, and Shuicheng Yan. 2017. Predicting scene parsing and motion dynamics in the future. In Advances in Neural Information Processing Systems. pp. 6915--6924.Google ScholarGoogle Scholar
  19. Eric Kernfeld, Misha Kilmer, and Shuchin Aeron. 2015. Tensor-tensor products with invertible linear transforms. Linear Algebra Appl. 485 (2015), 545--570.Google ScholarGoogle ScholarCross RefCross Ref
  20. Tmamara G. Kolda and B.W. Bader. 2009. Tensor decompositions and applications. SIAM Rev. 51 (2009), 455--500. Issue 3.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Akshay Krishnamurthy and Aarti Singh. 2013. Low-rank matrix and tensor completion via adaptive sampling. Neural Information Processing Systems (2013).Google ScholarGoogle Scholar
  22. Shohei Kubota, Ryoichiro Yoshida, and Yoshimitsu Kuroki. 2018. L0 norm restricted LIC with ADMM. In International Workshop on Advanced Image Technology (IWAIT). IEEE, 1--4.Google ScholarGoogle ScholarCross RefCross Ref
  23. Chao Li, Qibin Zhao, Junhua Li, Andrzej Cichocki, and Lili Guo. 2015. Multitensor completion with common structures. In AAAI Conference on Artificial Intelligence.Google ScholarGoogle Scholar
  24. Yijun Li, Chen Fang, Jimei Yang, Zhaowen Wang, Xin Lu, and Ming-Hsuan Yang. 2018. Flow-grounded spatial-temporal video prediction from still images. In Proceedings of the European Conference on Computer Vision.Google ScholarGoogle ScholarCross RefCross Ref
  25. Chia-Kai Liang and Fuhao Shi. 2017. Fused video stabilization on the Pixel 2 and Pixel 2 XL. https://ai.googleblog.com/2017/11/fused-video-stabilization-onpixel-2.html.Google ScholarGoogle Scholar
  26. Ji Liu, Przemyslaw Musialski, Peter Wonka, and Jieping Ye. 2013. Tensor completion for estimating missing values in visual data. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 1 (2013), 208--220.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Xiao-Yang Liu and Xiaodong Wang. 2017. Fourth-order tensors with multidimensional discrete transforms. arXiv preprint arXiv:1705.01576 (2017).Google ScholarGoogle Scholar
  28. Ziwei Liu, Raymond Yeh, Xiaoou Tang, Yiming Liu, and Aseem Agarwala. 2017. Video frame synthesis using Deep Voxel Flow. In Proceedings of International Conference on Computer Vision. pp. 4463--4471.Google ScholarGoogle ScholarCross RefCross Ref
  29. Jiawei Ma, Xiao-Yang Liu, Zheng Shou, and Xin Yuan. 2019. Deep tensor ADMMNet for snapshot compressive imaging. In Proceedings of the IEEE International Conference on Computer Vision.Google ScholarGoogle Scholar
  30. Yasuyuki Matsushita, Eyal Ofek, Weina Ge, Xiaoou Tang, and Heung-Yeung Shum. 2006. Full-frame video stabilization with motion inpainting. IEEE Transactions on Pattern Analysis & Machine Intelligence 7 (2006), 1150--1163.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Simon Niklaus, Long Mai, and Feng Liu. 2017. Video frame interpolation via adaptive convolution. In IEEE Conference on Computer Vision and Pattern Recognition. pp. 670--679.Google ScholarGoogle Scholar
  32. Minho Park, Hak Gu Kim, Sangmin Lee, and Yong Man Ro. 2020. Robust video frame interpolation with exceptional motion map. IEEE Transactions on Circuits and Systems for Video Technology (2020), 1--1.Google ScholarGoogle Scholar
  33. Tomer Peleg, Pablo Szekely, Doron Sabo, and Omry Sendik. 2019. Im-net for high resolution video frame interpolation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle ScholarCross RefCross Ref
  34. Christian Schuldt, Ivan Laptev, and Barbara Caputo. 2004. Recognizing human actions: a local SVM approach. In Proceedings of the 17th International Conference on Pattern Recognition, Vol. 3.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Guangyao Shen, Wenbing Huang, Chuang Gan, Mingkui Tan, Junzhou Huang, Wenwu Zhu, and Boqing Gong. 2019. Facial image-to-video translation by a hidden affine transformation. In Proceedings of the 27th ACM international conference on Multimedia. 2505--2513.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A dataset of 101 human actions classes from videos in the Wild. CoRR (2012). arXiv:1212.0402 http://arxiv.org/abs/1212.0402Google ScholarGoogle Scholar
  37. Jian Sun, Huibin Li, Zongben Xu, et al. 2016. Deep ADMM-Net for compressive sensing MRI. In Advances in Neural Information Processing Systems.Google ScholarGoogle Scholar
  38. Robert Tibshirani. 1996. Regression shrinkage and selection via the LASSO. Journal of the Royal Statistical Society. Series B (Methodological) 58 (1996).Google ScholarGoogle Scholar
  39. Ruben Villegas, Jimei Yang, Seunghoon Hong, Xunyu Lin, and Honglak Lee. 2017. Decomposing motion and content for natural video sequence prediction. arXiv preprint arXiv:1706.08033 (2017).Google ScholarGoogle Scholar
  40. Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. 2016. Generating videos with scene dynamics. In Advances In Neural Information Processing Systems.Google ScholarGoogle Scholar
  41. Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. 2004. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13, 4 (2004), 600--612.Google ScholarGoogle Scholar
  42. Manuel Werlberger, Thomas Pock, Markus Unger, and Horst Bischof. 2011. Optical flow guided TV-L 1 video interpolation and restoration. In International Workshop on Energy Minimization Methods in Computer Vision and Pattern Recognition. Springer, 273--286.Google ScholarGoogle ScholarCross RefCross Ref
  43. John Wright, Arvind Ganesh, Shankar Rao, Yigang Peng, and Yi Ma. 2009. Robust principal component analysis: Exact recovery of corrupted low-rank matrices via convex optimization. In Advances in Neural Information Processing Systems.Google ScholarGoogle Scholar
  44. HuayingWu, Xiao-Yang Liu, Luoyi Fu, and XinbingWang. 2018. Energy-efficient and robust tensor-encoder for wireless camera networks in Internet of Things. IEEE Transactions on Network Science and Engineering 6, 4 (2018), 646--656.Google ScholarGoogle Scholar
  45. Yangyang Xu and Wotao Yin. 2013. A block coordinate descent method for regularized multiconvex optimization with applications to nonnegative tensor factorization and completion. SIAM Journal on Imaging Sciences 6, 3 (2013).Google ScholarGoogle ScholarCross RefCross Ref
  46. Tianfan Xue, Jiajun Wu, Katherine Bouman, and Bill Freeman. 2016. Visual dynamics: Probabilistic future frame synthesis via cross convolutional networks. In Advances in Neural Information Processing Systems.Google ScholarGoogle Scholar
  47. Chenggang Yan, Zhisheng Li, Yongbing Zhang, Yutao Liu, Xiangyang Ji, and Zhang Yongdong. 2020. Depth image denoising using nuclear norm and learning graph model. ACM Transactions on Multimedia Computing Communications and Applications (2020).Google ScholarGoogle Scholar
  48. Chenggang Yan, Yunbin Tu, Xingzheng Wang, Yongbing Zhang, Xinhong Hao, Yongdong Zhang, and Qionghai Dai. 2019. Stat: spatial-temporal attention mechanism for video captioning. IEEE transactions on multimedia 22, 1 (2019).Google ScholarGoogle Scholar
  49. Ming Yuan and Cun-Hui Zhang. 2016. On tensor completion via nuclear norm minimization. Foundations of Computational Mathematics 16, 4 (2016).Google ScholarGoogle Scholar
  50. Jian Zhang and Bernard Ghanem. 2018. ISTA-Net: Interpretable optimizationinspired deep network for image compressive sensing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1828--1837.Google ScholarGoogle ScholarCross RefCross Ref
  51. Zemin Zhang and Shuchin Aeron. 2017. Exact tensor completion using t-SVD. IEEE Transactions on Signal Processing 65, 6 (2017), 1511--1526.Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Qibin Zhao, Liqing Zhang, and Andrzej Cichocki. 2015. Bayesian CP factorization of incomplete tensors with automatic rank determination. IEEE Transactions on Pattern Analysis and Machine Intelligence 37, 9 (2015), 1751--1763.Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Yipin Zhou and Tamara L Berg. 2016. Learning temporal transformations from time-lapse videos. In European Conference on Computer Vision. pp. 262--277.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Video Synthesis via Transform-Based Tensor Neural Network

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      MM '20: Proceedings of the 28th ACM International Conference on Multimedia
      October 2020
      4889 pages
      ISBN:9781450379885
      DOI:10.1145/3394171

      Copyright © 2020 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 12 October 2020

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate995of4,171submissions,24%

      Upcoming Conference

      MM '24
      MM '24: The 32nd ACM International Conference on Multimedia
      October 28 - November 1, 2024
      Melbourne , VIC , Australia

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader