Abstract
Video frame interpolation (VFI) is of great importance for many video applications, yet it is still challenging even in the era of deep learning. Some existing VFI models directly exploit existing lightweight network frameworks, thus making synthesized in-between frames blurry and creating artifacts due to imprecise motion representation. The other existing VFI models typically depend on heavy model architectures with a large number of parameters, preventing them from being deployed on small terminals. To address these issues, we propose a local lightweight VFI network (L2BEC2) that leverages bidirectional encoding structure with channel attention cascade. Specifically, we improve visual quality by introducing a forward and backward encoding structure with channel attention cascade to better characterize motion information. Furthermore, we introduce a local lightweight strategy into the state-of-the-art Adaptive Collaboration of Flows (AdaCoF) model to simplify its model parameters. Compared with the original AdaCoF model, the proposed L2BEC2 obtains performance gain at the cost of only one-third of the number of parameters and performs favorably against the state-of-the-art works on public datasets. Our source code is available at https://github.com/Pumpkin123709/LBEC.git.
- [1] . 2020. Spatio-temporal saliency-based motion vector refinement for frame rate up-conversion. ACM Trans. Multimedia Comput. Commun. Appl. 16, 2, Article
55 (May 2020), 18 pages. Google ScholarDigital Library - [2] . 2020. AdaCoF: Adaptive collaboration of flows for video frame interpolation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20). 5315–5324. Google ScholarCross Ref
- [3] . 2018. Super SloMo: High quality estimation of multiple intermediate frames for video interpolation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9000–9008. Google ScholarCross Ref
- [4] . 2019. Depth-aware video frame interpolation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’19). 3698–3707. Google ScholarCross Ref
- [5] . 2016. Deep stereo: Learning to predict new views from the world’s imagery. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 5515–5524. Google ScholarCross Ref
- [6] . 2019. PoSNet: 4x video frame interpolation using position-specific flow. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshop (ICCVW’19). 3503–3511. Google ScholarCross Ref
- [7] . 2016. Learning image matching by simply watching video. In Proceedings of the European Conference on Computer Vision (ECCV’16), , , , and (Eds.). Springer International Publishing, Cham, 434–450.Google ScholarCross Ref
- [8] . 2020. Video frame interpolation via residue refinement. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’20). 2613–2617. Google ScholarCross Ref
- [9] . 2021. MEMC-net: Motion estimation and motion compensation driven neural network for video interpolation and enhancement. IEEE Trans. Pattern Anal. Mach. Intell. 43, 3 (2021), 933–948. Google ScholarCross Ref
- [10] . 2017. Video frame interpolation via adaptive convolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 2270–2279. Google ScholarCross Ref
- [11] . 2017. Video frame interpolation via adaptive separable convolution. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’17). 261–270. Google ScholarCross Ref
- [12] . 2020. Video frame interpolation via deformable separable convolution. Proceedings of the AAAI Conference on Artificial Intelligence 34, 7 (2020), 10607–10614.Google Scholar
- [13] . 2021. CDFI: Compression-driven network design for frame interpolation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’21). 7997–8007. Google ScholarCross Ref
- [14] . 2020. Channel attention is all you need for video frame interpolation. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI’20).Google ScholarCross Ref
- [15] . 2021. Multiple video frame interpolation via enhanced deformable separable convolution. IEEE Trans. Pattern Anal. Mach. Intell. (2021), 1–1. Google ScholarDigital Library
- [16] . 2020. ConvTransformer: A convolutional transformer network for video frame synthesis. arXiv:2011.10185. Retrieved from https://arXiv.org/abs/2011.10185.Google Scholar
- [17] . 2020. All at once: Temporally adaptive multi-frame interpolation with advanced motion modeling. arXiv:2007.11762. Retrieved from https://arXiv:org/abs/2007.11762.Google Scholar
- [18] . 2017. Video frame synthesis using deep voxel flow. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’17). 4473–4481. Google ScholarCross Ref
- [19] . 2017. Video enhancement with task-oriented flow. arXiv:1711.09078. Retrieved from https://arXiv.org/abs/1711.09078.Google Scholar
- [20] . 2019. Zoom-in-to-check: boosting video interpolation via instance-level discrimination. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’19). 12175–12183. Google ScholarCross Ref
- [21] . 2017. MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv:1704.04861. Retrieved from https://arXiv.org/abs/1704.04861.Google Scholar
- [22] . 2017. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 1800–1807. Google ScholarCross Ref
- [23] . 2015. FlowNet: Learning optical flow with convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’15). 2758–2766. Google ScholarDigital Library
- [24] . 2020. RIFE: Real-time intermediate flow estimation for video frame interpolation. arXiv:2011.06294. Retrieved from https://arXiv.org/abs/2011.06294.Google Scholar
- [25] . 2021. Robust video frame interpolation with exceptional motion map. IEEE Trans. Circ. Syst. Vid. Technol. 31, 2 (2021), 754–764. Google ScholarCross Ref
- [26] . 2019. FI-net: A lightweight video frame interpolation network using feature-level flow. IEEE Access 7 (2019), 118287–118296. Google ScholarCross Ref
- [27] . 2018. PhaseNet for video frame interpolation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 498–507. Google ScholarCross Ref
- [28] . 2017. Frame interpolation using generative adversarial networks. (2017).Google Scholar
- [29] . 2020. Efficient video frame interpolation using generative adversarial networks. Appl. Sci. 10, 18 (2020), 6245. Google ScholarCross Ref
- [30] . 2020. Multi-scale attention generative adversarial networks for video frame interpolation. IEEE Access 8 (2020), 94842–94851. Google ScholarCross Ref
- [31] . 2020. A lightweight network model for video frame interpolation using spatial Pyramids. In Proceedings of the IEEE International Conference on Image Processing (ICIP’20). 543–547. Google ScholarCross Ref
- [32] . 2020. ALANET: Adaptive latent attention network forjoint video deblurring and interpolation. arXiv:2009.01005. Retrieved from https://arXiv.org/abs/2009.01005.Google Scholar
- [33] . 2016. Wide residual networks. arXiv:1605.07146. Retrieved from https://arXiv.org/abs/1605.0714.Google Scholar
- [34] . 2009. Learning deep architectures for AI. Now Publishers Inc.Google Scholar
- [35] . 2007. A database and evaluation methodology for optical flow. In Proceedings of the IEEE 11th International Conference on Computer Vision. 1–8. Google ScholarCross Ref
- [36] . 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402. Retrieved from https://arXiv.org/abs/1212.0402.Google Scholar
- [37] . 2016. A benchmark dataset and evaluation methodology for video object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 724–732. Google ScholarCross Ref
- [38] . 2016. Perceptual losses for real-time style transfer and super-resolution. In Proceedings of the European Conference on Computer Vision (ECCV’16), , , , and (Eds.). Springer International Publishing, Cham, 694–711.Google ScholarCross Ref
- [39] . 2019. PyTorch: An imperative style, high-performance deep learning library. arXiv:1912.01703. Retrieved from https://arXiv.org/abs/1912.01703.Google Scholar
- [40] . 2014. Adam: A method for stochastic optimization. arXiv:1412.6980. Retrieved from https://arXiv.org/abs/1412.6980.Google Scholar
- [41] . 2004. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 13, 4 (2004), 600–612. Google ScholarDigital Library
- [42] . 2017. Deep roots: Improving CNN efficiency with hierarchical filter groups. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 5977–5986. Google ScholarCross Ref
- [43] . 2018. ShuffleNet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of hte IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6848–6856. Google ScholarCross Ref
Index Terms
- L2BEC2: Local Lightweight Bidirectional Encoding and Channel Attention Cascade for Video Frame Interpolation
Recommendations
Video Frame Interpolation: A Comprehensive Survey
Video Frame Interpolation (VFI) is a fascinating and challenging problem in the computer vision (CV) field, aiming to generate non-existing frames between two consecutive video frames. In recent years, many algorithms based on optical flow, kernel, or ...
Spatial Degrees of Freedom for MIMO Interference Channel with Local Channel State Information at Transmitters
This paper discusses inner bound and outer bound for the total number of spatial degrees of freedom (DoF) of the K-user MIMO interference channel with only local channel state information at each transmitter when channel extensions are disabled over ...
Lightweight image super-resolution with feature cheap convolution and attention mechanism
AbstractSince deep learning is introduced into the field of super-resolution (SR), many deep learning-based SR methods have been proposed and achieved good results. At present, most neural networks use ordinary convolution and deeper neural layer in image ...
Comments