Abstract
Video colorization aims to add color to grayscale or monochrome videos. Although existing methods have achieved substantial and noteworthy results in the field of image colorization, video colorization presents more formidable obstacles due to the additional necessity for temporal consistency. Moreover, there is rarely a systematic review of video colorization methods. In this paper, we aim to review existing state-of-the-art video colorization methods. In addition, maintaining spatial-temporal consistency is pivotal to the process of video colorization. To gain deeper insight into the evolution of existing methods in terms of spatial-temporal consistency, we further review video colorization methods from a novel perspective. Video colorization methods can be categorized into four main categories: optical-flow based methods, scribble-based methods, exemplar-based methods, and fully automatic methods. However, optical-flow based methods rely heavily on accurate optical-flow estimation, scribble-based methods require extensive user interaction and modifications, exemplar-based methods face challenges in obtaining suitable reference images, and fully automatic methods often struggle to meet specific colorization requirements. We also discuss the existing challenges and highlight several future research opportunities worth exploring.
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Lai W S, Huang J B, Wang O, Shechtman E, Yumer E, Yang M H. Learning blind video temporal consistency. In Proc. the 15th European Conference on Computer Vision, Oct. 2018, pp.170–185. DOI: https://doi.org/10.1007/978-3-030-01267-0_11.
Bonneel N, Tompkin J, Sunkavalli K, Sun D Q, Paris S, Pfister H. Blind video temporal consistency. ACM Trans. Graphics, 2015, 34(6): 196. DOI: https://doi.org/10.1145/2816795.2818107.
Lei C Y, Xing Y Z, Ouyang H, Chen Q F. Deep video prior for video consistency and propagation. IEEE Trans. Pattern Analysis and Machine Intelligence, 2023, 45(1): 356–371. DOI: https://doi.org/10.1109/TPAMI.2022.3142071.
Yatziv L, Sapiro G. Fast image and video colorization using chrominance blending. IEEE Trans. Image Processing, 2006, 15(5): 1120–1129. DOI: https://doi.org/10.1109/TIP.2005.864231.
Sheng B, Sun H Q, Magnor M, Li P. Video colorization using parallel optimization in feature space. IEEE Trans. Circuits and Systems for Video Technology, 2014, 24(3): 407–417. DOI: https://doi.org/10.1109/TCSVT.2013.2276702.
Doğan P, Aydın T O, Stefanoski N, Smolic A. Key-frame based spatiotemporal scribble propagation. In Proc. the 2015 Eurographics Workshop on Intelligent Cinematography and Editing, May 2015, pp.13–20. DOI: https://doi.org/10.2312/wiced.20151073.
Paul S, Bhattacharya S, Gupta S. Spatiotemporal colorization of video using 3D steerable pyramids. IEEE Trans. Circuits and Systems for Video Technology, 2017, 27(8): 1605–1619. DOI: https://doi.org/10.1109/TCSVT.2016.2539539.
Jacob V G, Gupta S. Colorization of grayscale images and videos using a semiautomatic approach. In Proc. the 16th IEEE International Conference on Image Processing, Nov. 2009, pp.1653–1656. DOI: https://doi.org/10.1109/ICIP.2009.5413392.
Ben-Zrihem N, Zelnik-Manor L. Approximate nearest neighbor fields in video. In Proc. the 28th IEEE Conference on Computer Vision and Pattern Recognition, Jun. 2015, pp.5233–5242. DOI: https://doi.org/10.1109/CVPR.2015.7299160.
Xia S F, Liu J Y, Fang Y M, Yang W H, Guo Z M. Robust and automatic video colorization via multiframe reordering refinement. In Proc. the 23rd IEEE International Conference on Image Processing, Sept. 2016, pp.4017–4021. DOI: https://doi.org/10.1109/ICIP.2016.7533114.
Heu J H, Hyun D Y, Kim C S, Lee S U. Image and video colorization based on prioritized source propagation. In Proc. the 16th IEEE International Conference on Image Processing, Nov. 2009, pp.465–468. DOI: https://doi.org/10.1109/ICIP.2009.5414371.
Zhang B, He M M, Liao J, Sander P V, Yuan L, Bermak A, Chen D. Deep exemplar-based video colorization. In Proc. the 32nd IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2019, pp.8044–8053. DOI: https://doi.org/10.1109/CVPR.2019.00824.
Vondrick C, Shrivastava A, Fathi A, Guadarrama S, Murphy K. Tracking emerges by colorizing videos. In Proc. the 15th European Conference on Computer Vision, Sept. 2018. pp.391–408. DOI: https://doi.org/10.1007/978-3-030-01261-8_24.
Meyer S, Cornillère V, Djelouah A, Schroers C, Gross M H. Deep video color propagation. In Proc. the 29th British Machine Vision Conference, Sept. 2018, Article No. 128. DOI: https://doi.org/10.3929/ethz-b-000319608.
Iizuka S, Simo-Serra E. DeepRemaster: Temporal source-reference attention networks for comprehensive video enhancement. ACM Trans. Graphics, 2019, 38 (6): Article No.176. DOI: https://doi.org/10.1145/3355089.3356570.
Liu Y X, Zhang X Y, Xu X G. Reference-based video colorization with multi-scale semantic fusion and temporal augmentation. In Proc. the 28th IEEE International Conference on Image Processing, Sept. 2021, pp.1924–1928. DOI: https://doi.org/10.1109/ICIP42928.2021.9506422.
Yang Y, Liu Y, Yuan H, Chu Y H. Deep colorization: A channel attention-based CNN for video colorization. In Proc. the 5th International Conference on Image and Graphics Processing, Jan. 2022, pp.275–280. DOI: https://doi.org/10.1145/3512388.3512428.
Yang Y X, Pan J S, Peng Z Z, Du X Y, Tao Z L, Tang J H. BiSTNet: Semantic image prior guided bidirectional temporal feature fusion for deep exemplar-based video colorization. IEEE Trans. Pattern Analysis and Machine Intelligence, 2024. DOI: https://doi.org/10.1109/TPAMI.2024.3370920. (early access)
Wan Z Y, Zhang B, Chen D D, Liao J. Bringing old films back to life. In Proc. the 35th IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2022, pp.17673–17682. DOI: https://doi.org/10.1109/CVPR52688.2022.01717.
Zhao Y Z, Po L M, Liu K C, Wang X H, Yu W Y, Xian P F, Zhang Y J, Liu M Y. SVCNet: Scribble-based video colorization network with temporal aggregation. IEEE Trans. Image Processing, 2023, 32: 4443–4458. DOI: https://doi.org/10.1109/TIP.2023.3298537.
Jampani V, Gadde R, Gehler P V. Video propagation networks. In Proc. the 30th IEEE Conference on Computer Vision and Pattern Recognition, Jul. 2017, pp.3154–3164. DOI: https://doi.org/10.1109/CVPR.2017.336.
Liu S F, Zhong G Y, De Mello S, Gu J W, Jampani V, Yang M H, Kautz J. Switchable temporal propagation network. In Proc. the 15th European Conference on Computer Vision, Sept. 2018, pp.89–104. DOI: https://doi.org/10.1007/978-3030-01234-2_6.
Liu Y H, Zhao H Y, Chan K C K, Wang X T, Loy C C, Qiao Y, Dong C. Temporally consistent video colorization with deep feature propagation and self-regularization learning. Computational Visual Media, 2024, 10(2): 375–395. DOI: https://doi.org/10.1007/s41095-023-0342-8.
Lei C Y, Chen Q F. Fully automatic video colorization with self-regularization and diversity. In Proc. the 32nd IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2019, pp.3748–3756. DOI: https://doi.org/10.1109/CVPR.2019.00387.
Kouzouglidis P, Sfikas G, Nikou C. Automatic video colorization using 3D conditional generative adversarial networks. In Proc. the 14th International Symposium on Visual Computing, Oct. 2019, pp.209–218. DOI: https://doi.org/10.1007/9783-030-33720-9_16.
Thasarathan H, Nazeri K, Ebrahimi M. Automatic temporally coherent video colorization. In Proc. the 16th Conference on Computer and Robot Vision, May 2019, pp.189–194. DOI: https://doi.org/10.1109/CRV.2019.00033.
Zhao Y Z, Po L M, Yu W Y, Rehman Y A U, Liu M Y, Zhang Y J, Ou W F. VCGAN: Video colorization with hybrid generative adversarial network. IEEE Trans. Multimedia, 2023, 25: 3017–3032. DOI: https://doi.org/10.1109/TMM.2022.3154600.
Salmona A, Bouza L, Delon J. Deoldify: A review and implementation of an automatic colorization method. Image Processing on Line, 2022, 12: 347–368. DOI: https://doi.org/10.5201/ipol.2022.403.
Jampour M, Zare M, Javidi M. Advanced multi-GANs towards near to real image and video colorization. Journal of Ambient Intelligence and Humanized Computing, 2023, 14(9): 12857–12874. DOI: https://doi.org/10.1007/s12652-022-04206-z.
Mahajan A, Patel N, Kotak A, Palkar B. An end-to-end approach for automatic and consistent colorization of gray-scale videos using deep-learning techniques. In Proc. the 2020 International Conference on Machine Intelligence and Data Science Applications, May 2021, pp.539–551. DOI: https://doi.org/10.1007/978-981-33-4087-9_45.
Shi M, Zhang J Q, Chen S Y, Gao L, Lai Y K, Zhang F L. Reference-based deep line art video colorization. IEEE Trans. Visualization and Computer Graphics, 2023, 29(6): 2965–2979. DOI: https://doi.org/10.1109/TVCG.2022.3146000.
Veluri B, Pernu C, Saffari A, Smith J R, Taylor M B, Gollakota S. NeuriCam: Key-frame video super-resolution and colorization for IoT cameras. arXiv: 2207.12496, 2022. https://arxiv.org/abs/2207.12496, May 2024.
Zhang Q, Wang B, Wen W, Li H, Liu J. Line art correlation matching feature transfer network for automatic animation colorization. In Proc. the 2021 IEEE Winter Conference on Applications of Computer Vision, Jan. 2021, pp.3871–3880. DOI: https://doi.org/10.1109/WACV48630.2021.00392.
Casey E, Pérez V, Li Z R. The animation transformer: Visual correspondence via segment matching. In Proc. the 2021 IEEE/CVF International Conference on Computer Vision, Oct. 2021, pp.11303–11312. DOI: https://doi.org/10.1109/ICCV48922.2021.01113.
Zhao H Y, Wu W H, Liu Y H, He D L. Color2Embed: Fast exemplar-based image colorization using color embeddings. arXiv: 2106.08017, 2021. https://arxiv.org/abs/2106.08017, May 2024.
Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. In Proc. the 3rd International Conference on Learning Representations, May 2015.
He K M, Zhang X Y, Ren S Q, Sun J. Deep residual learning for image recognition. In Proc. the 29th IEEE Conference on Computer Vision and Pattern Recognition, Jun. 2016, pp.770–778. DOI: https://doi.org/10.1109/CVPR.2016.90.
Chen S Q, Li X M, Zhang X L, Wang M D, Zhang Y, Han J T, Zhang Y. Exemplar-based video colorization with long-term spatiotemporal dependency. Knowledge-Based Systems, 2024, 284: 111240. DOI: https://doi.org/10.1016/j.knosys.2023.111240.
Wang Z, Bovik A C, Sheikh H R, Simoncelli E P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Processing, 2004, 13(4): 600–612. DOI: https://doi.org/10.1109/TIP.2003.819861.
Heusel M, Ramsauer H, Unterthiner T, Nessler B, Hochreiter S. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In Proc. the 31st International Conference on Neural Information Processing Systems, Dec. 2017, pp.6629–6640. DOI: https://doi.org/10.5555/3295222.3295408.
Zhang R, Isola P, Efros A A, Shechtman E, Wang O. The unreasonable effectiveness of deep features as a perceptual metric. In Proc. the 31st IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2018, pp.586–595. DOI: https://doi.org/10.1109/CVPR.2018.00068.
Hasler D, Suesstrunk S E. Measuring colorfulness in natural images. In Proc. the SPIE 5007, Human Vision and Electronic Imaging VIII, Jun. 2003, pp.87–95. DOI: https://doi.org/10.1117/12.477378.
Xue T F, Chen B A, Wu J J, Wei D L, Freeman W T. Video enhancement with task-oriented flow. International Journal of Computer Vision, 2019, 127(8): 1106–1125. DOI: https://doi.org/10.1007/s11263-018-01144-2.
Deng J, Dong W, Socher R, Li L J, Li K, Fei-Fei L. ImageNet: A large-scale hierarchical image database. In Proc. the 22nd IEEE Conference on Computer Vision and Pattern Recognition, Jun. 2009, pp.248–255. DOI: https://doi.org/10.1109/CVPR.2009.5206848.
Soomro K, Zamir A R, Shah M. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv: 1212.0402, 2012. https://arxiv.org/abs/1212.0402, May 2024.
Wu Z X, Wang X, Jiang Y G, Ye H, Xue X Y. Modeling spatial-temporal clues in a hybrid deep learning framework for video classification. In Proc. the 23rd ACM International Conference on Multimedia, Oct. 2015, pp.461–470. DOI: https://doi.org/10.1145/2733373.2806222.
Feichtenhofer C, Pinz A, Zisserman A. Convolutional two-stream network fusion for video action recognition. In Proc. the 29th IEEE Conference on Computer Vision and Pattern Recognition, Jun. 2016, pp.1933–1941. DOI: https://doi.org/10.1109/CVPR.2016.213.
Perazzi F, Pont-Tuset J, McWilliams B, Van Gool L, Gross M, Sorkine-Hornung A. A benchmark dataset and evaluation methodology for video object segmentation. In Proc. the 29th IEEE Conference on Computer Vision and Pattern Recognition, Jun. 2016, pp.724–732. DOI: https://doi.org/10.1109/CVPR.2016.85.
Pont-Tuset J, Perazzi F, Caelles S, Arbeláez P, Sorkine-Hornung A, Van Gool L. The 2017 DAVIS challenge on video object segmentation. arXiv: 1704.00675, 2017. https://arxiv.org/abs/1704.00675, May 2024.
Caelles S, Pont-Tuset J, Perazzi F, Montes A, Maninis K K, Van Gool L. The 2019 DAVIS challenge on VOS: Unsupervised multi-object segmentation. arXiv: 1905.00737, 2019. https://arxiv.org/abs/1905.00737, May 2024.
Abu-El-Haija S, Kothari N, Lee J, Natsev P, Toderici G, Varadarajan B, Vijayanarasimhan S. YouTube-8M: A large-scale video classification benchmark. arXiv: 1609. 08675, 2016. https://arxiv.org/abs/1609.08675, May 2024.
Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation. In Proc. the 2015 IEEE Conference on Computer Vision and Pattern Recognition, Jun. 2015, pp.3431–3440. DOI: https://doi.org/10.1109/CVPR.2015.7298965.
Li S Y, Zhao S Y, Yu W J, Sun W X, Metaxas D, Loy C C, Liu Z W. Deep animation video interpolation in the wild. In Proc. the 34th IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2021, pp.6583–6591. DOI: https://doi.org/10.1109/CVPR46437.2021.00652.
Zhang S H, Chen T, Zhang Y F, Hu S M, Martin R R. Vectorizing cartoon animations. IEEE Trans. Visualization and Computer Graphics, 2009, 15(4): 618–629. DOI: https://doi.org/10.1109/TVCG.2009.9.
Levin A, Lischinski D, Weiss Y. Colorization using optimization. ACM Trans. Graphics, 2004, 23(3): 689–694. DOI: https://doi.org/10.1145/1015706.1015780.
Akimoto N, Hayakawa A, Shin A, Narihira T. Reference-based video colorization with spatiotemporal correspondence. arXiv: 2011.12528, 2020. https://arxiv.org/abs/2011.12528, May 2024.
Sun D Q, Yang X D, Liu M Y, Kautz J. PWC-Net: CNNs for optical flow using pyramid, warping, and cost volume. In Proc. the 31st IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2018, pp.8934–8943. DOI: https://doi.org/10.1109/CVPR.2018.00931.
Ilg E, Mayer N, Saikia T, Keuper M, Dosovitskiy A, Brox T. FlowNet 2.0: Evolution of optical flow estimation with deep networks. In Proc. the 30th IEEE Conference on Computer Vision and Pattern Recognition, Jul. 2017, pp.1647–1655. DOI: https://doi.org/10.1109/CVPR.2017.179.
Chang Y L, Liu Z Y, Lee K Y, Hsu W. Free-form video inpainting with 3D gated convolution and temporal patchGAN. In Proc. the 16th IEEE/CVF International Conference on Computer Vision, Oct. 2019, pp.9065–9074. DOI: https://doi.org/10.1109/ICCV.2019.00916.
Oquab M, Darcet T, Moutakanni T, Vo H, Szafraniec M, Khalidov V, Fernandez P, Haziza D, Massa F, El-Nouby A, Assran M, Ballas N, Galuba W, Howes R, Huang P Y, Li S W, Misra I, Rabbat M, Sharma V, Synnaeve G, Xu H, Jegou H, Mairal J, Labatut P, Joulin A, Bojanowski P. DINOv2: Learning robust visual features without supervision. arXiv: 2304.07193, 2023. https://arxiv.org/abs/2304.07193, May 2024.
Hochreiter S, Schmidhuber J. Long short-term memory. Neural Computation, 1997, 9(8): 1735–1780. DOI: https://doi.org/10.1162/neco.1997.9.8.1735.
Cho K, Van Merriënboer B, Bahdanau D, Bengio Y. On the properties of neural machine translation: Encoder-decoder approaches. arXiv: 1409.1259, 2014. https://arxiv.org/abs/1409.1259, May 2024.
Kang X Y, Lin X H, Zhang K, Hui Z, Xiang W M, He J Y, Li X M, Ren P R, Xie X S, Timofte R, Yang Y X, Pan J S, Zheng Z, Qiyan P, Jiangxin Z, Jinhui D, Jinjing T, Chichen L, Li L Q, Liang Q R, Gang R, Liu X F, Feng S, Liu S, Wang H, Feng C Y, Bai F R, Zhang Y Q, Shao G Q, Wang X T, Lei L, Chen S Q, Zhang Y, Xu H N, Liu Z Y, Zhang Z, Luo Y, Zuo Z C. NTIRE 2023 video colorization challenge. In Proc. the 36th IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Jun. 2023. pp.1570–1581. DOI: https://doi.org/10.1109/CVPRW59228.2023.00159.
Zhang R, Isola P, Efros A A. Colorful image colorization. In Proc. the 14th European Conference on Computer Vision, Oct. 2016, pp.649–666. DOI: https://doi.org/10.1007/978-3-319-46487-9_40.
Kang X Y, Yang T, Ouyang W Q, Ren P R, Li L Z, Xie X S. DDColor: Towards photo-realistic image colorization via dual decoders. In Proc. the 2023 IEEE/CVF International Conference on Computer Vision, Oct. 2023, pp.328–338. DOI: https://doi.org/10.1109/ICCV51070.2023.00037.
Ji X Z, Jiang B Y, Luo D H, Tao G P, Chu W Q, Xie Z F, Wang C J, Tai Y. ColorFormer: Image colorization via color memory assisted hybrid-attention transformer. In Proc. the 17th European Conference on Computer Vision, Oct. 2022, pp.20–36. DOI: https://doi.org/10.1007/978-3-031-19787-1_2.
Kim E, Lee S, Park J, Choi S, Seo C, Choo J. Deep edge-aware interactive colorization against color-bleeding effects. In Proc. the 2021 IEEE/CVF International Conference on Computer Vision, Oct. 2021, pp.14647–14656. DOI: https://doi.org/10.1109/ICCV48922.2021.01440.
Pan J H, Bai H R, Tang J H. Cascaded deep video deblurring using temporal sharpness prior. In Proc. the 33rd IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2020, pp.3040–3048. DOI: https://doi.org/10.1109/CVPR42600.2020.00311.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interest The authors declare that they have no conflict of interest.
Additional information
This work was supported by the National Natural Science Foundation of China under Grant Nos. U22B2049 and 62332010.
Zhong-Zheng Peng is currently pursuing his Ph.D. degree in School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing. His research interests include image/video colorization, super-resolution, and other restoration tasks.
Yi-Xin Yang is currently pursuing his Ph.D. degree in School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing. He received his M.S. degree in electrical engineering from University of California, Riverside, and B.S. degree in electronic information engineering from University of Electronic Science and Technology of China, Chengdu. His research interests include image/video colorization, super-resolution, and other restoration tasks.
Jin-Hui Tang received his B.E. and Ph.D. degrees from the University of Science and Technology of China, Hefei, in 2003 and 2008, respectively. He is currently a professor with the Nanjing University of Science and Technology, Nanjing. His research interests include multimedia analysis and computer vision.
Jin-Shan Pan received his Ph.D. degree in computational mathematics from the Dalian University of Technology, Dalian, in 2017. He is a professor of School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing. His research interest includes image deblurring, image/video analysis and enhancement, and related vision problems.
Electronic Supplementary Material
Rights and permissions
About this article
Cite this article
Peng, ZZ., Yang, YX., Tang, JH. et al. Video Colorization: A Survey. J. Comput. Sci. Technol. 39, 487–508 (2024). https://doi.org/10.1007/s11390-024-4143-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11390-024-4143-z