Abstract
Handling various objects with different colours is a significant challenge for image colourisation techniques. Thus, for complex real-world scenes, the existing image colourisation algorithms often fail to maintain colour consistency. In this work, we attempt to integrate textual descriptions as an auxiliary condition, along with the greyscale image that is to be colourised, to improve the fidelity of the colourisation process. To do so, we have proposed a deep network that takes two inputs (greyscale image and the respective encoded text description) and tries to predict the relevant colour components. Also, we have predicted each object in the image and have colourised them with their individual description to incorporate their specific attributes in the colourisation process. After that, a fusion model fuses all the image objects (segments) to generate the final colourised image. As the respective textual descriptions contain colour information of the objects in the image, text encoding helps improve the overall quality of predicted colours. In terms of performance, the proposed method outperforms existing colourisation techniques in terms of LPIPS, PSNR and SSIM metrics.
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11760-024-03650-y/MediaObjects/11760_2024_3650_Fig1_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11760-024-03650-y/MediaObjects/11760_2024_3650_Fig2_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11760-024-03650-y/MediaObjects/11760_2024_3650_Fig3_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11760-024-03650-y/MediaObjects/11760_2024_3650_Fig4_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11760-024-03650-y/MediaObjects/11760_2024_3650_Fig5_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11760-024-03650-y/MediaObjects/11760_2024_3650_Fig6_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11760-024-03650-y/MediaObjects/11760_2024_3650_Fig7_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11760-024-03650-y/MediaObjects/11760_2024_3650_Fig8_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11760-024-03650-y/MediaObjects/11760_2024_3650_Fig9_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11760-024-03650-y/MediaObjects/11760_2024_3650_Fig10_HTML.jpg)
Similar content being viewed by others
Data availability
Data will be made available at a reasonable request.
Code Availability
The code will be made available at a reasonable request.
References
Caesar, H., Uijlings, J.R.R., Ferrari, V.: Coco-stuff: Thing and stuff classes in context. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1209–1218 (2018)
Welinder, P., Branson, S., Mita, T., Wah, C., Schroff, F., Belongie, S., Perona, P.: Caltech-ucsd birds 200. In: Technical Report CNS-TR-2010-001, California Institute of Technology (2010)
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, 248–255 (2009)
Kumar, M., Weissenborn, D., Kalchbrenner, N.: Colorization transformer. arXiv:2102.04432 (2021)
Wu, Y., Wang, X., Li, Y., Zhang, H., Zhao, X., Shan, Y.: Towards vivid and diverse image colorization with generative color prior. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 14357–14366 (2021)
Su, J.-W., Chu, H.-k., Huang, J.-B.: Instance-aware image colorization. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 7965–7974 (2020)
Parihar, A.S., Verma, O.P., Khanna, C.: Fuzzy-contextual contrast enhancement. IEEE Trans. Image Process. 26, 1810–1819 (2017)
Verma, O.P., Parihar, A.S.: An optimal fuzzy system for edge detection in color images using bacterial foraging algorithm. IEEE Trans. Fuzzy Syst. 25, 114–127 (2017)
Wu, D.N.B.H., Gan, J., Zhou, J., Wang, J., Gao, W.: Fine-grained semantic ethnic costume high-resolution image colorization with conditional gan. Int. J. Intell. Syst. 37, 2952–2968 (2022)
Huang, S., Jin, X., Jiang, Q., Liu, L.: Deep learning for image colorization: current and future prospects. Eng. Appl. Artif. Intell. 114, 105006 (2022)
Xiao, Y., Jiang, A., Liu, C., Wang, M.: Semantic-aware automatic image colorization via unpaired cycle-consistent self-supervised network. Int. J. Intell. Syst. 37, 1222–1238 (2022)
Luo, F., Li, Y., Zeng, G., Peng, P., Wang, G., Li, Y.: Thermal infrared image colorization for nighttime driving scenes with top-down guided attention. IEEE Trans. Intell. Transp. Syst. 23, 15808–15823 (2022)
Treneska, S., Zdravevski, E., Pires, I., Lameski, P., Gievska, S.: Gan-based image colorization for self-supervised visual feature learning. Sensors (Basel, Switzerland) 22 (2022)
Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: From error visibility to structural similarity. IEEE Transactions on Image Processing (TIP) (2004)
Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 13, 600–612 (2004)
Huang, Y.-C., Tung, Y.-S., Chen, J.-C., Wang, S.-W., Wu, J.-L.: An adaptive edge detection based colorization algorithm and its applications. In: MULTIMEDIA ’05 (2005)
Nie, D., Ma, Q., Ma, L., Xiao, S.: Optimization based grayscale image colorization. Pattern Recognit. Lett. 28, 1445–1451 (2007)
Wang, P., Patel, V.M.: Generating high quality visible images from SAR images using CNNS. In: 2018 IEEE Radar Conference (RadarConf18), 0570–0575 (2018)
Tola, E., Lepetit, V., Fua, P.V.: A fast local descriptor for dense matching. In: 2008 IEEE Conference on Computer Vision and Pattern Recognition, 1–8 (2008)
Perazzi, F., Pont-Tuset, J., McWilliams, B., Gool, L.V., Gross, M.H., Sorkine-Hornung, A.: A benchmark dataset and evaluation methodology for video object segmentation. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 724–732 (2016)
Lu, Y., Yang, X., Li, X., Wang, X.E., Wang, W.Y.: Llmscore: Unveiling the power of large language models in text-to-image synthesis evaluation. In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S. (eds.) Advances in Neural Information Processing Systems, vol. 36, pp. 23075–23093. Curran Associates, Inc., (2023)
Zhang, R., Isola, P., Efros, A.A.: Colorful image colorization. In: ECCV (2016)
Koley, S., Bhunia, A.K., Sain, A., Chowdhury, P.N., Xiang, T., Song, Y.-Z.: You’ll never walk alone: A sketch and text duet for fine-grained image retrieval. In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 16509–16519 (2024)
Cheng, Z., Yang, Q., Sheng, B.: Deep colorization. In: 2015 IEEE International Conference on Computer Vision (ICCV), 415–423 (2015)
Carlucci, F.M., Russo, P., Caputo, B.: \((de)^2co\): Deep depth colorization. IEEE Robotics and Automation Letters (2018)
Bahng, H., Yoo, S., Cho, W., Park, D.K., Wu, Z., Ma, X., Choo, J.: Coloring with words: Guiding image colorization through text-based palette generation. In: ECCV (2018)
Zhang, R., Zhu, J.-Y., Isola, P., Geng, X., Lin, A.S., Yu, T., Efros, A.A.: Real-time user-guided image colorization with learned deep priors. ACM Transactions on Graphics (TOG) 36, 1–11 (2017)
Weng, S., Wu, H., Chang, Z., Tang, J., Li, S., Shi, B.: L-code: Language-based colorization using color-object decoupled conditions. In: AAAI (2022)
He, K., Gkioxari, G., Dollár, P., Girshick, R.B.: Mask r-cnn. In: 2017 IEEE International Conference on Computer Vision (ICCV), 2980–2988 (2017)
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805 (2019)
Ren, M., Kiros, R., Zemel, R.S.: Exploring models and data for image question answering. In: NIPS (2015)
Iizuka, S., Simo-Serra, E., Ishikawa, H.: Let there be color! ACM Transactions on Graphics (TOG) 35, 1–11 (2016)
Antic., J.: A deep learning based project for colorizing and restoring old images (and video!). https://github.com/jantic/deoldify,. (2019)
Lei, C., Chen, Q.: Fully automatic video colorization with self-regularization and diversity. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 3748–3756 (2019)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: The International Conference on Learning Representations (ICLR) (2015)
Manjunatha, V., Iyyer, M., Boyd-Graber, J.L., Davis, L.S.: Learning to color from language. In: NAACL (2018)
Chang, Z., Weng, S., Li, Y., Li, S., Shi, B.: L-coder: Language-based colorization with color-object decoupling transformer. In: ECCV (2022)
Author information
Authors and Affiliations
Contributions
Subhankar Ghosh wrote the main manuscript text. Saumik Bhattacharya helps with logical understanding and wrote the manuscript. Prasun Roy, Umapada Pal, and Michael Blumenstein reviewed the manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no Conflict of interest
Ethics approval and consent to participate
Not applicable
Consent for publication
Yes
Materials availability
Materials will be made available at a reasonable request.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Ghosh, S., Bhattacharya, S., Roy, P. et al. MMC: Multi-modal colorization of images using textual description. SIViP 19, 107 (2025). https://doi.org/10.1007/s11760-024-03650-y
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11760-024-03650-y