Skip to main content

Advertisement

Log in

MMC: Multi-modal colorization of images using textual description

  • Original Paper
  • Published:
Signal, Image and Video Processing Aims and scope Submit manuscript

Abstract

Handling various objects with different colours is a significant challenge for image colourisation techniques. Thus, for complex real-world scenes, the existing image colourisation algorithms often fail to maintain colour consistency. In this work, we attempt to integrate textual descriptions as an auxiliary condition, along with the greyscale image that is to be colourised, to improve the fidelity of the colourisation process. To do so, we have proposed a deep network that takes two inputs (greyscale image and the respective encoded text description) and tries to predict the relevant colour components. Also, we have predicted each object in the image and have colourised them with their individual description to incorporate their specific attributes in the colourisation process. After that, a fusion model fuses all the image objects (segments) to generate the final colourised image. As the respective textual descriptions contain colour information of the objects in the image, text encoding helps improve the overall quality of predicted colours. In terms of performance, the proposed method outperforms existing colourisation techniques in terms of LPIPS, PSNR and SSIM metrics.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Data availability

Data will be made available at a reasonable request.

Code Availability

The code will be made available at a reasonable request.

References

  1. Caesar, H., Uijlings, J.R.R., Ferrari, V.: Coco-stuff: Thing and stuff classes in context. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1209–1218 (2018)

  2. Welinder, P., Branson, S., Mita, T., Wah, C., Schroff, F., Belongie, S., Perona, P.: Caltech-ucsd birds 200. In: Technical Report CNS-TR-2010-001, California Institute of Technology (2010)

  3. Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, 248–255 (2009)

  4. Kumar, M., Weissenborn, D., Kalchbrenner, N.: Colorization transformer. arXiv:2102.04432 (2021)

  5. Wu, Y., Wang, X., Li, Y., Zhang, H., Zhao, X., Shan, Y.: Towards vivid and diverse image colorization with generative color prior. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 14357–14366 (2021)

  6. Su, J.-W., Chu, H.-k., Huang, J.-B.: Instance-aware image colorization. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 7965–7974 (2020)

  7. Parihar, A.S., Verma, O.P., Khanna, C.: Fuzzy-contextual contrast enhancement. IEEE Trans. Image Process. 26, 1810–1819 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  8. Verma, O.P., Parihar, A.S.: An optimal fuzzy system for edge detection in color images using bacterial foraging algorithm. IEEE Trans. Fuzzy Syst. 25, 114–127 (2017)

    Article  MATH  Google Scholar 

  9. Wu, D.N.B.H., Gan, J., Zhou, J., Wang, J., Gao, W.: Fine-grained semantic ethnic costume high-resolution image colorization with conditional gan. Int. J. Intell. Syst. 37, 2952–2968 (2022)

    Article  Google Scholar 

  10. Huang, S., Jin, X., Jiang, Q., Liu, L.: Deep learning for image colorization: current and future prospects. Eng. Appl. Artif. Intell. 114, 105006 (2022)

    Article  MATH  Google Scholar 

  11. Xiao, Y., Jiang, A., Liu, C., Wang, M.: Semantic-aware automatic image colorization via unpaired cycle-consistent self-supervised network. Int. J. Intell. Syst. 37, 1222–1238 (2022)

    Article  MATH  Google Scholar 

  12. Luo, F., Li, Y., Zeng, G., Peng, P., Wang, G., Li, Y.: Thermal infrared image colorization for nighttime driving scenes with top-down guided attention. IEEE Trans. Intell. Transp. Syst. 23, 15808–15823 (2022)

    Article  Google Scholar 

  13. Treneska, S., Zdravevski, E., Pires, I., Lameski, P., Gievska, S.: Gan-based image colorization for self-supervised visual feature learning. Sensors (Basel, Switzerland) 22 (2022)

  14. Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: From error visibility to structural similarity. IEEE Transactions on Image Processing (TIP) (2004)

  15. Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 13, 600–612 (2004)

    Article  MATH  Google Scholar 

  16. Huang, Y.-C., Tung, Y.-S., Chen, J.-C., Wang, S.-W., Wu, J.-L.: An adaptive edge detection based colorization algorithm and its applications. In: MULTIMEDIA ’05 (2005)

  17. Nie, D., Ma, Q., Ma, L., Xiao, S.: Optimization based grayscale image colorization. Pattern Recognit. Lett. 28, 1445–1451 (2007)

    Article  MATH  Google Scholar 

  18. Wang, P., Patel, V.M.: Generating high quality visible images from SAR images using CNNS. In: 2018 IEEE Radar Conference (RadarConf18), 0570–0575 (2018)

  19. Tola, E., Lepetit, V., Fua, P.V.: A fast local descriptor for dense matching. In: 2008 IEEE Conference on Computer Vision and Pattern Recognition, 1–8 (2008)

  20. Perazzi, F., Pont-Tuset, J., McWilliams, B., Gool, L.V., Gross, M.H., Sorkine-Hornung, A.: A benchmark dataset and evaluation methodology for video object segmentation. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 724–732 (2016)

  21. Lu, Y., Yang, X., Li, X., Wang, X.E., Wang, W.Y.: Llmscore: Unveiling the power of large language models in text-to-image synthesis evaluation. In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S. (eds.) Advances in Neural Information Processing Systems, vol. 36, pp. 23075–23093. Curran Associates, Inc., (2023)

  22. Zhang, R., Isola, P., Efros, A.A.: Colorful image colorization. In: ECCV (2016)

  23. Koley, S., Bhunia, A.K., Sain, A., Chowdhury, P.N., Xiang, T., Song, Y.-Z.: You’ll never walk alone: A sketch and text duet for fine-grained image retrieval. In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 16509–16519 (2024)

  24. Cheng, Z., Yang, Q., Sheng, B.: Deep colorization. In: 2015 IEEE International Conference on Computer Vision (ICCV), 415–423 (2015)

  25. Carlucci, F.M., Russo, P., Caputo, B.: \((de)^2co\): Deep depth colorization. IEEE Robotics and Automation Letters (2018)

  26. Bahng, H., Yoo, S., Cho, W., Park, D.K., Wu, Z., Ma, X., Choo, J.: Coloring with words: Guiding image colorization through text-based palette generation. In: ECCV (2018)

  27. Zhang, R., Zhu, J.-Y., Isola, P., Geng, X., Lin, A.S., Yu, T., Efros, A.A.: Real-time user-guided image colorization with learned deep priors. ACM Transactions on Graphics (TOG) 36, 1–11 (2017)

    Google Scholar 

  28. Weng, S., Wu, H., Chang, Z., Tang, J., Li, S., Shi, B.: L-code: Language-based colorization using color-object decoupled conditions. In: AAAI (2022)

  29. He, K., Gkioxari, G., Dollár, P., Girshick, R.B.: Mask r-cnn. In: 2017 IEEE International Conference on Computer Vision (ICCV), 2980–2988 (2017)

  30. Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805 (2019)

  31. Ren, M., Kiros, R., Zemel, R.S.: Exploring models and data for image question answering. In: NIPS (2015)

  32. Iizuka, S., Simo-Serra, E., Ishikawa, H.: Let there be color! ACM Transactions on Graphics (TOG) 35, 1–11 (2016)

    Article  Google Scholar 

  33. Antic., J.: A deep learning based project for colorizing and restoring old images (and video!). https://github.com/jantic/deoldify,. (2019)

  34. Lei, C., Chen, Q.: Fully automatic video colorization with self-regularization and diversity. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 3748–3756 (2019)

  35. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: The International Conference on Learning Representations (ICLR) (2015)

  36. Manjunatha, V., Iyyer, M., Boyd-Graber, J.L., Davis, L.S.: Learning to color from language. In: NAACL (2018)

  37. Chang, Z., Weng, S., Li, Y., Li, S., Shi, B.: L-coder: Language-based colorization with color-object decoupling transformer. In: ECCV (2022)

Download references

Author information

Authors and Affiliations

Authors

Contributions

Subhankar Ghosh wrote the main manuscript text. Saumik Bhattacharya helps with logical understanding and wrote the manuscript. Prasun Roy, Umapada Pal, and Michael Blumenstein reviewed the manuscript.

Corresponding author

Correspondence to Subhankar Ghosh.

Ethics declarations

Conflict of interest

The authors declare that they have no Conflict of interest

Ethics approval and consent to participate

Not applicable

Consent for publication

Yes

Materials availability

Materials will be made available at a reasonable request.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ghosh, S., Bhattacharya, S., Roy, P. et al. MMC: Multi-modal colorization of images using textual description. SIViP 19, 107 (2025). https://doi.org/10.1007/s11760-024-03650-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11760-024-03650-y

Keywords

Navigation