skip to main content
10.1145/3539618.3591979acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
short-paper

Evaluating Cross-modal Generative Models Using Retrieval Task

Published: 18 July 2023 Publication History

Abstract

Generative models have taken the world by storm -- image generative models such as Stable Diffusion and DALL-E generate photo-realistic images, whereas image captioning models such as BLIP, GIT, ClipCap, and ViT-GPT2 generate descriptive and informative captions. While it may be true that these models produce remarkable results, their systematic evaluation is missing, making it hard to advance the research further. Currently, heuristic metrics such as the Inception Score and the Fréchet Inception Distance are the most prevalent metrics for the image generation task, while BLEU, CIDEr, SPICE, METEOR, BERTScore, and CLIPScore are common for the image captioning task. Unfortunately, these are poorly interpretable and are not based on the solid user-behavior model that the Information Retrieval community has worked towards. In this paper, we present a novel cross-modal retrieval framework to evaluate the effectiveness of cross-modal (image-to-text and text-to-image) generative models using reference text and images. We propose the use of scoring models based on user-behavior, such as Normalized Discounted Cumulative Gain (nDCG'@K ) and Rank-Biased Precision (RBP'@K) adjusted for incomplete judgments. Experiments using ECCV Caption and Flickr8k-EXPERTS benchmark datasets demonstrate the effectiveness of various image captioning and image generation models for the proposed retrieval task. Results also indicate that the nDCG'@K and RBP'@K scores are consistent with heuristics-driven metrics, excluding CLIPScore, in model selection.

Supplemental Material

MP4 File
Presentation video

References

[1]
2022. BLIP Model. https://github.com/salesforce/BLIP.
[2]
2022. CLIPCAP Model. https://github.com/rmokady/CLIP_prefix_caption.
[3]
2022. GIT Model. https://huggingface.co/microsoft/git-base-coco.
[4]
2022. Stable Diffusion Model. https://huggingface.co/stabilityai/stable-diffusion-2--1.
[5]
2022. ViT-GPT2 Model. https://huggingface.co/nlpconnect/vit-gpt2-image-captioning.
[6]
Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. 2016. SPICE: Semantic Propositional Image Caption Evaluation. In Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11--14, 2016, Proceedings, Part V (Lecture Notes in Computer Science, Vol. 9909), Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling (Eds.). Springer, 382--398. https://doi.org/10.1007/978--3--319--46454--1_24
[7]
Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In Proceedings of the Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization@ACL 2005, Ann Arbor, Michigan, USA, June 29, 2005, Jade Goldstein, Alon Lavie, Chin-Yew Lin, and Clare R. Voss (Eds.). Association for Computational Linguistics, 65--72. https://aclanthology.org/W05-0909/
[8]
Shane Barratt and Rishi Sharma. 2018. A Note on the Inception Score. arXiv:1801.01973 [stat.ML]
[9]
Eyal Betzalel, Coby Penso, Aviv Navon, and Ethan Fetaya. 2022. A Study on the Evaluation of Generative Models. CoRR abs/2206.10935 (2022). https://doi.org/10.48550/arXiv.2206.10935 arXiv:2206.10935
[10]
Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. 2021. Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts. (2021), 3558--3568. https://doi.org/10.1109/CVPR46437.2021.00356
[11]
Sanghyuk Chun, Wonjae Kim, Song Park, Minsuk Chang, and Seong Joon Oh. 2022. ECCV Caption: Correcting False Negatives by Collecting Machine-and-Human-verified Image-Caption Associations for MS-COCO. In Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23--27, 2022, Proceedings, Part VIII (Lecture Notes in Computer Science, Vol. 13668), Shai Avidan, Gabriel J. Brostow, Moustapha Cissé, Giovanni Maria Farinella, and Tal Hassner (Eds.). Springer, 1--19. https://doi.org/10.1007/978--3-031--20074--8_1
[12]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3--7, 2021. OpenReview.net. https://openreview.net/forum?id=YicbFdNTTy
[13]
Oran Gafni, Adam Polyak, Oron Ashual, Shelly Sheynin, Devi Parikh, and Yaniv Taigman. 2022. Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors. In Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23--27, 2022, Proceedings, Part XV (Lecture Notes in Computer Science, Vol. 13675), Shai Avidan, Gabriel J. Brostow, Moustapha Cissé, Giovanni Maria Farinella, and Tal Hassner (Eds.). Springer, 89--106. https://doi.org/10.1007/978--3-031--19784-0_6
[14]
Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. 2014. Generative Adversarial Nets. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8--13 2014, Montreal, Quebec, Canada, Zoubin Ghahramani, Max Welling, Corinna Cortes, Neil D. Lawrence, and Kilian Q. Weinberger (Eds.). 2672--2680. https://proceedings.neurips.cc/paper/2014/hash/5ca3e9b122f61f8f06494c97b1afccf3-Abstract.html
[15]
Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. 2021. CLIPScore: A Reference-free Evaluation Metric for Image Captioning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7--11 November, 2021, Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (Eds.). Association for Computational Linguistics, 7514--7528. https://doi.org/10.18653/v1/2021.emnlp-main.595
[16]
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. 2017. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4--9, 2017, Long Beach, CA, USA, Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (Eds.). 6626--6637. https://proceedings.neurips.cc/paper/2017/hash/8a1d694707eb0fefe65871369074926d-Abstract.html
[17]
Micah Hodosh, Peter Young, and Julia Hockenmaier. 2013. Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics. J. Artif. Intell. Res. 47 (2013), 853--899. https://doi.org/10.1613/jair.3994
[18]
Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumulated Gain-Based Evaluation of IR Techniques. ACM Trans. Inf. Syst. 20, 4 (oct 2002), 422--446. https://doi.org/10.1145/582415.582418
[19]
Saehoon Kim, Sanghun Cho, Chiheon Kim, Doyup Lee, and Woonhyuk Baek. 2021. minDALL-E on Conceptual Captions. https://github.com/kakaobrain/minDALL-E.
[20]
Wonjae Kim, Bokyung Son, and Ildoo Kim. 2021. ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18--24 July 2021, Virtual Event (Proceedings of Machine Learning Research, Vol. 139), Marina Meila and Tong Zhang (Eds.). PMLR, 5583--5594. http://proceedings.mlr.press/v139/kim21k.html
[21]
Diederik P. Kingma and Max Welling. 2014. Auto-Encoding Variational Bayes. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14--16, 2014, Conference Track Proceedings, Yoshua Bengio and Yann LeCun (Eds.). http://arxiv.org/abs/1312.6114
[22]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a meeting held December 3--6, 2012, Lake Tahoe, Nevada, United States, Peter L. Bartlett, Fernando C. N. Pereira, Christopher J. C. Burges, Léon Bottou, and Kilian Q. Weinberger (Eds.). 1106--1114. https://proceedings.neurips.cc/paper/2012/hash/c399862d3b9d6b76c8436e924a68c45b-Abstract.html
[23]
Hang Li, Jindong Gu, Rajat Koner, Sahand Sharifzadeh, and Volker Tresp. 2022. Do DALL-E and Flamingo Understand Each Other? CoRR abs/2212.12249 (2022). https://doi.org/10.48550/arXiv.2212.12249 arXiv:2212.12249
[24]
Junnan Li, Dongxu Li, Caiming Xiong, and Steven C. H. Hoi. 2022. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. In International Conference on Machine Learning, ICML 2022, 17--23 July 2022, Baltimore, Maryland, USA (Proceedings of Machine Learning Research, Vol. 162), Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvári, Gang Niu, and Sivan Sabato (Eds.). PMLR, 12888--12900. https://proceedings.mlr.press/v162/li22n.html
[25]
Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. In Annual Meeting of the Association for Computational Linguistics.
[26]
Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. In Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6--12, 2014, Proceedings, Part V (Lecture Notes in Computer Science, Vol. 8693), David J. Fleet, Tomás Pajdla, Bernt Schiele, and Tinne Tuytelaars (Eds.). Springer, 740--755. https://doi.org/10.1007/978--3--319--10602--1_48
[27]
Oscar Mañas, Pau Rodríguez, Saba Ahmadi, Aida Nematzadeh, Yash Goyal, and Aishwarya Agrawal. 2022. MAPL: Parameter-Efficient Adaptation of Unimodal Pre-Trained Models for Vision-Language Few-Shot Prompting. CoRR abs/2210.07179 (2022). https://doi.org/10.48550/arXiv.2210.07179 arXiv:2210.07179
[28]
Alistair Moffat and Justin Zobel. 2008. Rank-biased precision for measurement of retrieval effectiveness. ACM Trans. Inf. Syst. 27, 1 (2008), 2:1--2:27. https://doi.org/10.1145/1416950.1416952
[29]
Ron Mokady, Amir Hertz, and Amit H. Bermano. 2021. ClipCap: CLIP Prefix for Image Captioning. CoRR abs/2111.09734 (2021). arXiv:2111.09734 https://arxiv.org/abs/2111.09734
[30]
Kevin Musgrave, Serge J. Belongie, and Ser-Nam Lim. 2020. A Metric Learning Reality Check. In Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part XXV (Lecture Notes in Computer Science, Vol. 12370), Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm (Eds.). Springer, 681--699. https://doi.org/10.1007/978--3-030--58595--2_41
[31]
Alexander Quinn Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. 2022. GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models. In International Conference on Machine Learning, ICML 2022, 17--23 July 2022, Baltimore, Maryland, USA (Proceedings of Machine Learning Research, Vol. 162), Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvári, Gang Niu, and Sivan Sabato (Eds.). PMLR, 16784--16804. https://proceedings.mlr.press/v162/nichol22a.html
[32]
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, July 6--12, 2002, Philadelphia, PA, USA. ACL, 311--318. https://doi.org/10.3115/1073083.1073135
[33]
David Picard. 2021. Torch.manual_seed(3407) is all you need: On the influence of random seeds in deep learning architectures for computer vision. CoRR abs/2109.08203 (2021). arXiv:2109.08203 https://arxiv.org/abs/2109.08203
[34]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18--24 July 2021, Virtual Event (Proceedings of Machine Learning Research, Vol. 139), Marina Meila and Tong Zhang (Eds.). PMLR, 8748--8763. http://proceedings.mlr.press/v139/radford21a.html
[35]
Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language Models are Unsupervised Multitask Learners.
[36]
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. 2022. Hierarchical Text-Conditional Image Generation with CLIP Latents. CoRR abs/2204.06125 (2022). https://doi.org/10.48550/arXiv.2204.06125 arXiv:2204.06125
[37]
Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. 2021. Zero-Shot Text-to-Image Generation. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18--24 July 2021, Virtual Event (Proceedings of Machine Learning Research, Vol. 139), Marina Meila and Tong Zhang (Eds.). PMLR, 8821--8831. http://proceedings.mlr.press/v139/ramesh21a.html
[38]
Danilo Jimenez Rezende and Shakir Mohamed. 2015. Variational Inference with Normalizing Flows. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6--11 July 2015 (JMLR Workshop and Conference Proceedings, Vol. 37), Francis R. Bach and David M. Blei (Eds.). JMLR.org, 1530--1538. http://proceedings.mlr.press/v37/rezende15.html
[39]
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-Resolution Image Synthesis with Latent Diffusion Models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18--24, 2022. IEEE, 10674--10685. https://doi.org/10.1109/CVPR52688.2022.01042
[40]
Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S. Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J. Fleet, and Mohammad Norouzi. 2022. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. CoRR abs/2205.11487 (2022). https://doi.org/10.48550/arXiv.2205.11487 arXiv:2205.11487
[41]
Tetsuya Sakai. 2007. Alternatives to Bpref. In SIGIR 2007: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Amsterdam, The Netherlands, July 23--27, 2007, Wessel Kraaij, Arjen P. de Vries, Charles L. A. Clarke, Norbert Fuhr, and Noriko Kando (Eds.). ACM, 71--78. https://doi.org/10.1145/1277741.1277756
[42]
Tetsuya Sakai. 2021. On Fuhr's Guideline for IR Evaluation. SIGIR Forum 54, 1, Article 12 (feb 2021), 8 pages. https://doi.org/10.1145/3451964.3451976
[43]
Tetsuya Sakai and Noriko Kando. 2008. On information retrieval metrics designed for evaluation with incomplete relevance assessments. Inf. Retr. 11, 5 (2008), 447--470. https://doi.org/10.1007/s10791-008--9059--7
[44]
Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. 2022. LAION-5B: An open large-scale dataset for training next generation image-text models. CoRR abs/2210.08402 (2022). https://doi.org/10.48550/arXiv.2210.08402 arXiv:2210.08402
[45]
Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. 2018. Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15--20, 2018, Volume 1: Long Papers, Iryna Gurevych and Yusuke Miyao (Eds.). Association for Computational Linguistics, 2556--2565. https://doi.org/10.18653/v1/P18--1238
[46]
Maria Tsimpoukelli, Jacob Menick, Serkan Cabi, S. M. Ali Eslami, Oriol Vinyals, and Felix Hill. 2021. Multimodal Few-Shot Learning with Frozen Language Models. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6--14, 2021, virtual, Marc'Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan (Eds.). 200--212. https://proceedings.neurips.cc/paper/2021/hash/01b7575c38dac42f3cfb7d500438b875-Abstract.html
[47]
Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. 2015. CIDEr: Consensus-based image description evaluation. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7--12, 2015. IEEE Computer Society, 4566--4575. https://doi.org/10.1109/CVPR.2015.7299087
[48]
Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, and Lijuan Wang. 2022. GIT: A Generative Image-to-text Transformer for Vision and Language. https://doi.org/10.48550/ARXIV.2205.14100
[49]
Lu Yuan, Dongdong Chen, Yi-Ling Chen, Noel Codella, Xiyang Dai, Jianfeng Gao, Houdong Hu, Xuedong Huang, Boxin Li, Chunyuan Li, Ce Liu, Mengchen Liu, Zicheng Liu, Yumao Lu, Yu Shi, Lijuan Wang, Jianfeng Wang, Bin Xiao, Zhen Xiao, Jianwei Yang, Michael Zeng, Luowei Zhou, and Pengchuan Zhang. 2021. Florence: A New Foundation Model for Computer Vision. CoRR abs/2111.11432 (2021). arXiv:2111.11432 https://arxiv.org/abs/2111.11432
[50]
Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. 2020. BERTScore: Evaluating Text Generation with BERT. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26--30, 2020. OpenReview.net. https://openreview.net/forum?id=SkeHuCVFDr

Cited By

View all
  • (2024)Uncertainty in Visual Generative AIAlgorithms10.3390/a1704013617:4(136)Online publication date: 27-Mar-2024
  • (2024)Report on the 1st Workshop on Large Language Model for Evaluation in Information Retrieval (LLM4Eval 2024) at SIGIR 2024ACM SIGIR Forum10.1145/3722449.372246158:2(1-12)Online publication date: 1-Dec-2024
  • (2024)Visual-linguistic Cross-domain Feature Learning with Group Attention and Gamma-correct Gated Fusion for Extracting Commonsense KnowledgeProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680820(4650-4659)Online publication date: 28-Oct-2024
  • Show More Cited By

Index Terms

  1. Evaluating Cross-modal Generative Models Using Retrieval Task

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGIR '23: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval
    July 2023
    3567 pages
    ISBN:9781450394086
    DOI:10.1145/3539618
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 18 July 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. cross-modal generative modal
    2. cross-modal retrieval
    3. evaluation method

    Qualifiers

    • Short-paper

    Conference

    SIGIR '23
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 792 of 3,983 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)79
    • Downloads (Last 6 weeks)10
    Reflects downloads up to 05 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Uncertainty in Visual Generative AIAlgorithms10.3390/a1704013617:4(136)Online publication date: 27-Mar-2024
    • (2024)Report on the 1st Workshop on Large Language Model for Evaluation in Information Retrieval (LLM4Eval 2024) at SIGIR 2024ACM SIGIR Forum10.1145/3722449.372246158:2(1-12)Online publication date: 1-Dec-2024
    • (2024)Visual-linguistic Cross-domain Feature Learning with Group Attention and Gamma-correct Gated Fusion for Extracting Commonsense KnowledgeProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680820(4650-4659)Online publication date: 28-Oct-2024
    • (2024)Image captioning by diffusion modelsEngineering Applications of Artificial Intelligence10.1016/j.engappai.2024.109288138:PAOnline publication date: 1-Dec-2024

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media