ABSTRACT
Text-to-Image (T2I) generation has long been a popular field of multimedia processing. Recent advances in large-scale vision and language pretraining have brought a number of models capable of very high-quality T2I generation. However, they are reported to generate unexpected images when users input words that have no definition within a language (nonwords), including coined words and pseudo-words. To make the behavior of T2I generation models against nonwords more intuitive, we propose a method that considers phonetic information of text inputs. The phonetic similarity is adopted so that the generated images from a nonword contain the concept of its phonetically similar words. This is based on the psycholinguistic finding that humans would also associate nonwords with their phonetically similar words when they perceive the sound. Our evaluations confirm a better agreement of the generated images of the proposed method with both phonetic relationships and human expectations than a conventional T2I generation model. The cross-lingual comparison of generated images for a nonword highlights the differences in language-specific nonword-imagery correspondences. These results provide insight into the usefulness of the proposed method in brand naming and language learning.
- Fredrik Carlsson, Philipp Eisen, Faton Rekathati, and Magnus Sahlgren. 2022. Cross-lingual and multilingual CLIP. In Proc. 13th Lang. Resour. Evaluation Conf. (Marseille, Bouches-du-Rhône, France). ELRA, Paris, France, 6848--6854.Google Scholar
- Katherine Crowson, Stella Biderman, Daniel Kornis, Dashiell Stander, Eric Hallahan, Louis Castricato, and Edward Raff. 2022. VQGAN-CLIP: Open domain image generation and editing with natural language guidance. Comput. Res. Reposit., arXiv Preprint, arXiv:2204.08583. https://doi.org/10.48550/arXiv.2204.08583Google ScholarCross Ref
- Giannis Daras and Alexandros G. Dimakis. 2022. Discovering the hidden vocabulary of DALLE-2. In Proc. NeurIPS 2022 Workshop Score-Based Methods (New Orleans, LA, USA). bibinfonumpages5 pages. https://openreview.net/forum?id=jxeSZaVzpmgGoogle Scholar
- Federico Galatolo, Mario Cimino, and Gigliola Vaglini. 2021. Generating images from caption and vice versa via CLIP-guided generative latent space search. In Proc. Int. Conf. Image Process. Vis. Eng. (Prague, Czech). SciTePress, Setúbal, Portugal, 166--174. https://doi.org/10.5220/0010503701660174Google ScholarDigital Library
- Stephen Goldinger, Paul Luce, David Pisoni, and Joanne Marcario. 1992. Form-based priming in spoken word recognition: The roles of competition and bias. J. Exp. Psychol. Learn. Mem. Cogn. , Vol. 18, 6 (1992), 1211--1238. https://doi.org/10.1037/0278--7393.18.6.1211Google ScholarCross Ref
- Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative Adversarial Nets. In Adv. Neural Inf. Process. Syst. (Montréal, QC, Canada), Vol. 27. Curran Associates, Inc., New York, NY, USA, bibinfonumpages9 pages.Google Scholar
- Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. 2017. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In Adv. Neural Inf. Process. Syst. (Long Beach, CA, USA), Vol. 30. Curran Associates, Inc., New York, NY, USA, bibinfonumpages12 pages.Google Scholar
- Leanne Hinton, Johanna Nichols, and John J. Ohala. 1995. Sound Symbolism. Cambridge University Press, Cambridge, England, UK. https://doi.org/10.1017/CBO9780511751806Google ScholarCross Ref
- Hyung-Kwon Ko, Gwanmo Park, Hyeon Jeon, Jaemin Jo, Juho Kim, and Jinwook Seo. 2023. Large-scale text-to-image generation models for visual artists' creative works. In Proc. 28th Int. Conf. Intell. User Interfaces (Sydney, NSW, Australia). ACM, New York, NY, US, 919--933.Google ScholarDigital Library
- Sudheer Kolachina and Lilla Magyar. 2019. What do phone embeddings learn about Phonology?. In Proc. 16th Workshop Comput. Res. Phonetics, Phonol., Morphol. (Firenze, Toscana, Italy). ACL, Stroudsburg, PA, USA, 160--169. https://doi.org/10.18653/v1/W19--4219Google ScholarCross Ref
- Alex Krizhevsky. 2009. Learning Multiple Layers of Features from Tiny Images. Master's thesis. University of Toronto, Toronto, ON, Canada. https://www.cs.toronto.edu/ kriz/learning-features-2009-TR.pdfGoogle Scholar
- Wolfgang Köhler. 1929. Gestalt Psychology. H. Liveright, New York, NY, USA.Google Scholar
- Tsung Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Ramanan Deva, Dollár Piotr, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. In Proc. 13th Europ. Conf. Comput. Vis. Part V (Zurich, Switzerland). Springer, Cham, Basel, Switzerland, 740--755. https://doi.org/10.1007/978--3--319--10602--1_48Google ScholarCross Ref
- Chihaya Matsuhira, Marc A. Kastner, Takahiro Komamizu, Takatsugu Hirayama, Keisuke Doman, Yasutomo Kawanishi, and Ichiro Ide. 2023. IPA-CLIP: Integrating phonetic priors into vision and language pretraining. Comput. Res. Reposit., arXiv Preprint, arXiv:2303.03144. https://doi.org/10.48550/arxiv.2303.03144Google ScholarCross Ref
- David Meyer and Roger Schvaneveldt. 1971. Facilitation in recognizing pairs of words: Evidence of a dependence between retrieval operations. J. Exp. Psychol. , Vol. 90, 2 (11 1971), 227--234. https://doi.org/10.1037/h0031564Google ScholarCross Ref
- Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013a. Efficient estimation of word representations in vector space. Comput. Res. Reposit., arXiv Preprint, arXiv:1301.3781. https://doi.org/10.48550/arXiv.1301.3781Google ScholarCross Ref
- Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013b. Distributed representations of words and phrases and their compositionality. In Adv. Neural Inf. Process. Syst. (Lake Tahoe, NV, USA), Vol. 26. Curran Associates, Inc., New York, NY, USA, 3111--3119.Google Scholar
- Raphaël Millière. 2022. Adversarial attacks on image generation with made-up words. Comput. Res. Reposit., arXiv Preprint, arXiv:2208.04135. https://doi.org/10.48550/arXiv.2208.04135Google ScholarCross Ref
- Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. 2022. GLIDE: Towards photorealistic image generation and editing with text-guided diffusion models. In Proc. 39th Int. Conf. Mach. Learn., Proc. Mach. Learn. Res. (Baltimore, MD, USA), Vol. 162. PMLR, Cambridge, MA, USA, 16784--16804.Google Scholar
- Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning transferable visual models from natural language supervision. In Proc. 38th Int. Conf. Mach. Learn., Proc. Mach. Learn. Res. (Online), Vol. 139. PMLR, Cambridge, MA, USA, 8748--8763.Google Scholar
- Vilayanur S. Ramachandran and Edward M. Hubbard. 2001. Synaesthesia --A window into perception, thought and language. J. Conscious. Stud. , Vol. 8, 12 (2001), 3--34.Google Scholar
- Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. 2022. Hierarchical text-conditional image generation with CLIP latents. Comput. Res. Reposit., arXiv Preprint, arXiv:2204.06125. https://doi.org/10.48550/arXiv.2204.06125Google ScholarCross Ref
- Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-resolution image synthesis with latent diffusion models. In Proc. 2022 IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (New Orleans, LA, USA). IEEE, New York, NY, USA, 10684--10695.Google ScholarCross Ref
- Valentino Sabbatino, Enrica Troiano, Antje Schweitzer, and Roman Klinger. 2022. “splink” is happy and “phrouth” is scary: Emotion intensity analysis for nonsense words. In Proc. 12th Workshop Comput. Approaches to Subj. Sentiment Soc. Media Anal. (Dublin, Ireland). ACL, Stroudsburg, PA, USA, 37--50. https://doi.org/10.18653/v1/2022.wassa-1.4Google ScholarCross Ref
- Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S. Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi. 2022. Photorealistic text-to-image diffusion models with deep language understanding. Comput. Res. Reposit., arXiv Preprint, arXiv:2205.11487. https://doi.org/10.48550/arxiv.2205.11487Google ScholarCross Ref
- Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade W. Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa R. Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. 2022. LAION-5B: An open large-scale dataset for training next generation image-text models. In Proc. 36th Conf. Neural Inf. Process. Syst. Datasets Benchmarks Track (New Orleans, LA, USA). Curran Associates, Inc., New York, NY, USA, bibinfonumpages17 pages.Google Scholar
- Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In Proc. 54th Annual Meet. Assoc. Comput. Linguist. (Berlin, Germany), Vol. 1. ACL, Stroudsburg, PA, USA, 1715--1725. https://doi.org/10.18653/v1/P16--1162Google ScholarCross Ref
- Hengcan Shi, Munawar Hayat, Yicheng Wu, and Jianfei Cai. 2022. ProposalCLIP: Unsupervised open-category object proposal generation via exploiting CLIP cues. In Proc. 2022 IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (New Orleans, LA, USA). IEEE, New York, NY, USA, 9611--9620.Google ScholarCross Ref
- Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. 2015. Deep unsupervised learning using nonequilibrium thermodynamics. In Proc. 32nd Int. Conf. Mach. Learn., Proc. Mach. Learn. Res. (Lille, Nord, France), Vol. 37. PMLR, Cambridge, MA, USA, 2256--2265.Google Scholar
- Robyn Speer, Joshua Chin, Andrew Lin, Sara Jewett, and Lance Nathan. 2018. LuminosoInsight/wordfreq: v2.2. Zenodo. https://doi.org/10.5281/zenodo.1443582Google ScholarCross Ref
- Computer Vision and Learning Research Group at Ludwig Maximilian University of Munich. 2022. Stable Diffusion. https://github.com/CompVis/stable-diffusion/ (Accessed July 11, 2023).Google Scholar
- Ho-Hsiang Wu, Prem Seetharaman, Kundan Kumar, and Juan Pablo Bello. 2022. Wav2CLIP: Learning robust audio representations from CLIP. In Proc. 2022 IEEE Int. Conf. Acoust. Speech Signal Process. (Singapore). IEEE, New York, NY, USA, 4563--4567. https://doi.org/10.1109/ICASSP43922.2022.9747669 ioGoogle ScholarCross Ref
Index Terms
- Nonword-to-Image Generation Considering Perceptual Association of Phonetically Similar Words
Recommendations
Word, pseudoword, and nonword processing: A multitask comparison using event-related brain potentials
Event-related brain potentials (ERPs) to words, pseudowords, and nonwords were recorded in three different tasks. A letter search task was used in Experiment 1. Performance was affected by whether the target letter occurred in a word, a pseudoword, or a ...
Language Experience and the Organization of Brain Activity to Phonetically Similar Words: ERP Evidence from 14- and 20-Month-Olds
The ability to discriminate phonetically similar speech sounds is evident quite early in development. However, inexperienced word learners do not always use this information in processing word meanings [Stager & Werker (1997). Nature, 388, 381–382]. The ...
Word and nonword repetition within-and across-modality: An event-related potential study
The effects on event-related potentials (ERPs) of within-and across-modality repetition of words and nonwords were investigated. In Experiment 1, subjects detected occasional animal names embedded in a series of words. AU items were equally likely to be ...
Comments