skip to main content
10.1145/3607541.3616818acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Nonword-to-Image Generation Considering Perceptual Association of Phonetically Similar Words

Published:29 October 2023Publication History

ABSTRACT

Text-to-Image (T2I) generation has long been a popular field of multimedia processing. Recent advances in large-scale vision and language pretraining have brought a number of models capable of very high-quality T2I generation. However, they are reported to generate unexpected images when users input words that have no definition within a language (nonwords), including coined words and pseudo-words. To make the behavior of T2I generation models against nonwords more intuitive, we propose a method that considers phonetic information of text inputs. The phonetic similarity is adopted so that the generated images from a nonword contain the concept of its phonetically similar words. This is based on the psycholinguistic finding that humans would also associate nonwords with their phonetically similar words when they perceive the sound. Our evaluations confirm a better agreement of the generated images of the proposed method with both phonetic relationships and human expectations than a conventional T2I generation model. The cross-lingual comparison of generated images for a nonword highlights the differences in language-specific nonword-imagery correspondences. These results provide insight into the usefulness of the proposed method in brand naming and language learning.

References

  1. Fredrik Carlsson, Philipp Eisen, Faton Rekathati, and Magnus Sahlgren. 2022. Cross-lingual and multilingual CLIP. In Proc. 13th Lang. Resour. Evaluation Conf. (Marseille, Bouches-du-Rhône, France). ELRA, Paris, France, 6848--6854.Google ScholarGoogle Scholar
  2. Katherine Crowson, Stella Biderman, Daniel Kornis, Dashiell Stander, Eric Hallahan, Louis Castricato, and Edward Raff. 2022. VQGAN-CLIP: Open domain image generation and editing with natural language guidance. Comput. Res. Reposit., arXiv Preprint, arXiv:2204.08583. https://doi.org/10.48550/arXiv.2204.08583Google ScholarGoogle ScholarCross RefCross Ref
  3. Giannis Daras and Alexandros G. Dimakis. 2022. Discovering the hidden vocabulary of DALLE-2. In Proc. NeurIPS 2022 Workshop Score-Based Methods (New Orleans, LA, USA). bibinfonumpages5 pages. https://openreview.net/forum?id=jxeSZaVzpmgGoogle ScholarGoogle Scholar
  4. Federico Galatolo, Mario Cimino, and Gigliola Vaglini. 2021. Generating images from caption and vice versa via CLIP-guided generative latent space search. In Proc. Int. Conf. Image Process. Vis. Eng. (Prague, Czech). SciTePress, Setúbal, Portugal, 166--174. https://doi.org/10.5220/0010503701660174Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Stephen Goldinger, Paul Luce, David Pisoni, and Joanne Marcario. 1992. Form-based priming in spoken word recognition: The roles of competition and bias. J. Exp. Psychol. Learn. Mem. Cogn. , Vol. 18, 6 (1992), 1211--1238. https://doi.org/10.1037/0278--7393.18.6.1211Google ScholarGoogle ScholarCross RefCross Ref
  6. Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative Adversarial Nets. In Adv. Neural Inf. Process. Syst. (Montréal, QC, Canada), Vol. 27. Curran Associates, Inc., New York, NY, USA, bibinfonumpages9 pages.Google ScholarGoogle Scholar
  7. Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. 2017. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In Adv. Neural Inf. Process. Syst. (Long Beach, CA, USA), Vol. 30. Curran Associates, Inc., New York, NY, USA, bibinfonumpages12 pages.Google ScholarGoogle Scholar
  8. Leanne Hinton, Johanna Nichols, and John J. Ohala. 1995. Sound Symbolism. Cambridge University Press, Cambridge, England, UK. https://doi.org/10.1017/CBO9780511751806Google ScholarGoogle ScholarCross RefCross Ref
  9. Hyung-Kwon Ko, Gwanmo Park, Hyeon Jeon, Jaemin Jo, Juho Kim, and Jinwook Seo. 2023. Large-scale text-to-image generation models for visual artists' creative works. In Proc. 28th Int. Conf. Intell. User Interfaces (Sydney, NSW, Australia). ACM, New York, NY, US, 919--933.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Sudheer Kolachina and Lilla Magyar. 2019. What do phone embeddings learn about Phonology?. In Proc. 16th Workshop Comput. Res. Phonetics, Phonol., Morphol. (Firenze, Toscana, Italy). ACL, Stroudsburg, PA, USA, 160--169. https://doi.org/10.18653/v1/W19--4219Google ScholarGoogle ScholarCross RefCross Ref
  11. Alex Krizhevsky. 2009. Learning Multiple Layers of Features from Tiny Images. Master's thesis. University of Toronto, Toronto, ON, Canada. https://www.cs.toronto.edu/ kriz/learning-features-2009-TR.pdfGoogle ScholarGoogle Scholar
  12. Wolfgang Köhler. 1929. Gestalt Psychology. H. Liveright, New York, NY, USA.Google ScholarGoogle Scholar
  13. Tsung Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Ramanan Deva, Dollár Piotr, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. In Proc. 13th Europ. Conf. Comput. Vis. Part V (Zurich, Switzerland). Springer, Cham, Basel, Switzerland, 740--755. https://doi.org/10.1007/978--3--319--10602--1_48Google ScholarGoogle ScholarCross RefCross Ref
  14. Chihaya Matsuhira, Marc A. Kastner, Takahiro Komamizu, Takatsugu Hirayama, Keisuke Doman, Yasutomo Kawanishi, and Ichiro Ide. 2023. IPA-CLIP: Integrating phonetic priors into vision and language pretraining. Comput. Res. Reposit., arXiv Preprint, arXiv:2303.03144. https://doi.org/10.48550/arxiv.2303.03144Google ScholarGoogle ScholarCross RefCross Ref
  15. David Meyer and Roger Schvaneveldt. 1971. Facilitation in recognizing pairs of words: Evidence of a dependence between retrieval operations. J. Exp. Psychol. , Vol. 90, 2 (11 1971), 227--234. https://doi.org/10.1037/h0031564Google ScholarGoogle ScholarCross RefCross Ref
  16. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013a. Efficient estimation of word representations in vector space. Comput. Res. Reposit., arXiv Preprint, arXiv:1301.3781. https://doi.org/10.48550/arXiv.1301.3781Google ScholarGoogle ScholarCross RefCross Ref
  17. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013b. Distributed representations of words and phrases and their compositionality. In Adv. Neural Inf. Process. Syst. (Lake Tahoe, NV, USA), Vol. 26. Curran Associates, Inc., New York, NY, USA, 3111--3119.Google ScholarGoogle Scholar
  18. Raphaël Millière. 2022. Adversarial attacks on image generation with made-up words. Comput. Res. Reposit., arXiv Preprint, arXiv:2208.04135. https://doi.org/10.48550/arXiv.2208.04135Google ScholarGoogle ScholarCross RefCross Ref
  19. Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. 2022. GLIDE: Towards photorealistic image generation and editing with text-guided diffusion models. In Proc. 39th Int. Conf. Mach. Learn., Proc. Mach. Learn. Res. (Baltimore, MD, USA), Vol. 162. PMLR, Cambridge, MA, USA, 16784--16804.Google ScholarGoogle Scholar
  20. Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning transferable visual models from natural language supervision. In Proc. 38th Int. Conf. Mach. Learn., Proc. Mach. Learn. Res. (Online), Vol. 139. PMLR, Cambridge, MA, USA, 8748--8763.Google ScholarGoogle Scholar
  21. Vilayanur S. Ramachandran and Edward M. Hubbard. 2001. Synaesthesia --A window into perception, thought and language. J. Conscious. Stud. , Vol. 8, 12 (2001), 3--34.Google ScholarGoogle Scholar
  22. Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. 2022. Hierarchical text-conditional image generation with CLIP latents. Comput. Res. Reposit., arXiv Preprint, arXiv:2204.06125. https://doi.org/10.48550/arXiv.2204.06125Google ScholarGoogle ScholarCross RefCross Ref
  23. Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-resolution image synthesis with latent diffusion models. In Proc. 2022 IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (New Orleans, LA, USA). IEEE, New York, NY, USA, 10684--10695.Google ScholarGoogle ScholarCross RefCross Ref
  24. Valentino Sabbatino, Enrica Troiano, Antje Schweitzer, and Roman Klinger. 2022. “splink” is happy and “phrouth” is scary: Emotion intensity analysis for nonsense words. In Proc. 12th Workshop Comput. Approaches to Subj. Sentiment Soc. Media Anal. (Dublin, Ireland). ACL, Stroudsburg, PA, USA, 37--50. https://doi.org/10.18653/v1/2022.wassa-1.4Google ScholarGoogle ScholarCross RefCross Ref
  25. Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S. Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi. 2022. Photorealistic text-to-image diffusion models with deep language understanding. Comput. Res. Reposit., arXiv Preprint, arXiv:2205.11487. https://doi.org/10.48550/arxiv.2205.11487Google ScholarGoogle ScholarCross RefCross Ref
  26. Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade W. Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa R. Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. 2022. LAION-5B: An open large-scale dataset for training next generation image-text models. In Proc. 36th Conf. Neural Inf. Process. Syst. Datasets Benchmarks Track (New Orleans, LA, USA). Curran Associates, Inc., New York, NY, USA, bibinfonumpages17 pages.Google ScholarGoogle Scholar
  27. Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In Proc. 54th Annual Meet. Assoc. Comput. Linguist. (Berlin, Germany), Vol. 1. ACL, Stroudsburg, PA, USA, 1715--1725. https://doi.org/10.18653/v1/P16--1162Google ScholarGoogle ScholarCross RefCross Ref
  28. Hengcan Shi, Munawar Hayat, Yicheng Wu, and Jianfei Cai. 2022. ProposalCLIP: Unsupervised open-category object proposal generation via exploiting CLIP cues. In Proc. 2022 IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (New Orleans, LA, USA). IEEE, New York, NY, USA, 9611--9620.Google ScholarGoogle ScholarCross RefCross Ref
  29. Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. 2015. Deep unsupervised learning using nonequilibrium thermodynamics. In Proc. 32nd Int. Conf. Mach. Learn., Proc. Mach. Learn. Res. (Lille, Nord, France), Vol. 37. PMLR, Cambridge, MA, USA, 2256--2265.Google ScholarGoogle Scholar
  30. Robyn Speer, Joshua Chin, Andrew Lin, Sara Jewett, and Lance Nathan. 2018. LuminosoInsight/wordfreq: v2.2. Zenodo. https://doi.org/10.5281/zenodo.1443582Google ScholarGoogle ScholarCross RefCross Ref
  31. Computer Vision and Learning Research Group at Ludwig Maximilian University of Munich. 2022. Stable Diffusion. https://github.com/CompVis/stable-diffusion/ (Accessed July 11, 2023).Google ScholarGoogle Scholar
  32. Ho-Hsiang Wu, Prem Seetharaman, Kundan Kumar, and Juan Pablo Bello. 2022. Wav2CLIP: Learning robust audio representations from CLIP. In Proc. 2022 IEEE Int. Conf. Acoust. Speech Signal Process. (Singapore). IEEE, New York, NY, USA, 4563--4567. https://doi.org/10.1109/ICASSP43922.2022.9747669 ioGoogle ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Nonword-to-Image Generation Considering Perceptual Association of Phonetically Similar Words

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        McGE '23: Proceedings of the 1st International Workshop on Multimedia Content Generation and Evaluation: New Methods and Practice
        October 2023
        151 pages
        ISBN:9798400702785
        DOI:10.1145/3607541
        • General Chairs:
        • Cheng Jin,
        • Liang He,
        • Mingli Song,
        • Rui Wang

        Copyright © 2023 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 29 October 2023

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Upcoming Conference

        MM '24
        MM '24: The 32nd ACM International Conference on Multimedia
        October 28 - November 1, 2024
        Melbourne , VIC , Australia
      • Article Metrics

        • Downloads (Last 12 months)57
        • Downloads (Last 6 weeks)7

        Other Metrics

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader