research-article

Nonword-to-Image Generation Considering Perceptual Association of Phonetically Similar Words

Authors:
Chihaya Matsuhira

Nagoya University, Nagoya, Japan

Nagoya University, Nagoya, Japan

0000-0003-2453-4560
View Profile

,
Marc A. Kastner

Kyoto University, Kyoto, Japan

Kyoto University, Kyoto, Japan

0000-0002-9193-5973
View Profile

,
Takahiro Komamizu

Nagoya University, Nagoya, Japan

Nagoya University, Nagoya, Japan

0000-0002-3041-4330
View Profile

,
Takatsugu Hirayama

University of Human Environments, Okazaki, Japan

University of Human Environments, Okazaki, Japan

0000-0001-6290-9680
View Profile

,
Keisuke Doman

Chukyo University, Toyota, Japan

Chukyo University, Toyota, Japan

0000-0001-6040-4988
View Profile

,
Ichiro Ide

Nagoya University, Nagoya, Japan

Nagoya University, Nagoya, Japan

0000-0003-3942-9296
View Profile

McGE '23: Proceedings of the 1st International Workshop on Multimedia Content Generation and Evaluation: New Methods and PracticeOctober 2023Pages 115–125https://doi.org/10.1145/3607541.3616818

Published:29 October 2023Publication History

McGE '23: Proceedings of the 1st International Workshop on Multimedia Content Generation and Evaluation: New Methods and Practice

Pages 115–125

ABSTRACT

Text-to-Image (T2I) generation has long been a popular field of multimedia processing. Recent advances in large-scale vision and language pretraining have brought a number of models capable of very high-quality T2I generation. However, they are reported to generate unexpected images when users input words that have no definition within a language (nonwords), including coined words and pseudo-words. To make the behavior of T2I generation models against nonwords more intuitive, we propose a method that considers phonetic information of text inputs. The phonetic similarity is adopted so that the generated images from a nonword contain the concept of its phonetically similar words. This is based on the psycholinguistic finding that humans would also associate nonwords with their phonetically similar words when they perceive the sound. Our evaluations confirm a better agreement of the generated images of the proposed method with both phonetic relationships and human expectations than a conventional T2I generation model. The cross-lingual comparison of generated images for a nonword highlights the differences in language-specific nonword-imagery correspondences. These results provide insight into the usefulness of the proposed method in brand naming and language learning.

References

Fredrik Carlsson, Philipp Eisen, Faton Rekathati, and Magnus Sahlgren. 2022. Cross-lingual and multilingual CLIP. In Proc. 13th Lang. Resour. Evaluation Conf. (Marseille, Bouches-du-Rhône, France). ELRA, Paris, France, 6848--6854.Google Scholar
Katherine Crowson, Stella Biderman, Daniel Kornis, Dashiell Stander, Eric Hallahan, Louis Castricato, and Edward Raff. 2022. VQGAN-CLIP: Open domain image generation and editing with natural language guidance. Comput. Res. Reposit., arXiv Preprint, arXiv:2204.08583. https://doi.org/10.48550/arXiv.2204.08583Google ScholarCross Ref
Giannis Daras and Alexandros G. Dimakis. 2022. Discovering the hidden vocabulary of DALLE-2. In Proc. NeurIPS 2022 Workshop Score-Based Methods (New Orleans, LA, USA). bibinfonumpages5 pages. https://openreview.net/forum?id=jxeSZaVzpmgGoogle Scholar
Federico Galatolo, Mario Cimino, and Gigliola Vaglini. 2021. Generating images from caption and vice versa via CLIP-guided generative latent space search. In Proc. Int. Conf. Image Process. Vis. Eng. (Prague, Czech). SciTePress, Setúbal, Portugal, 166--174. https://doi.org/10.5220/0010503701660174Google ScholarDigital Library
Stephen Goldinger, Paul Luce, David Pisoni, and Joanne Marcario. 1992. Form-based priming in spoken word recognition: The roles of competition and bias. J. Exp. Psychol. Learn. Mem. Cogn. , Vol. 18, 6 (1992), 1211--1238. https://doi.org/10.1037/0278--7393.18.6.1211Google ScholarCross Ref
Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative Adversarial Nets. In Adv. Neural Inf. Process. Syst. (Montréal, QC, Canada), Vol. 27. Curran Associates, Inc., New York, NY, USA, bibinfonumpages9 pages.Google Scholar
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. 2017. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In Adv. Neural Inf. Process. Syst. (Long Beach, CA, USA), Vol. 30. Curran Associates, Inc., New York, NY, USA, bibinfonumpages12 pages.Google Scholar
Leanne Hinton, Johanna Nichols, and John J. Ohala. 1995. Sound Symbolism. Cambridge University Press, Cambridge, England, UK. https://doi.org/10.1017/CBO9780511751806Google ScholarCross Ref
Hyung-Kwon Ko, Gwanmo Park, Hyeon Jeon, Jaemin Jo, Juho Kim, and Jinwook Seo. 2023. Large-scale text-to-image generation models for visual artists' creative works. In Proc. 28th Int. Conf. Intell. User Interfaces (Sydney, NSW, Australia). ACM, New York, NY, US, 919--933.Google ScholarDigital Library
Sudheer Kolachina and Lilla Magyar. 2019. What do phone embeddings learn about Phonology?. In Proc. 16th Workshop Comput. Res. Phonetics, Phonol., Morphol. (Firenze, Toscana, Italy). ACL, Stroudsburg, PA, USA, 160--169. https://doi.org/10.18653/v1/W19--4219Google ScholarCross Ref
Alex Krizhevsky. 2009. Learning Multiple Layers of Features from Tiny Images. Master's thesis. University of Toronto, Toronto, ON, Canada. https://www.cs.toronto.edu/ kriz/learning-features-2009-TR.pdfGoogle Scholar
Wolfgang Köhler. 1929. Gestalt Psychology. H. Liveright, New York, NY, USA.Google Scholar
Tsung Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Ramanan Deva, Dollár Piotr, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. In Proc. 13th Europ. Conf. Comput. Vis. Part V (Zurich, Switzerland). Springer, Cham, Basel, Switzerland, 740--755. https://doi.org/10.1007/978--3--319--10602--1_48Google ScholarCross Ref
Chihaya Matsuhira, Marc A. Kastner, Takahiro Komamizu, Takatsugu Hirayama, Keisuke Doman, Yasutomo Kawanishi, and Ichiro Ide. 2023. IPA-CLIP: Integrating phonetic priors into vision and language pretraining. Comput. Res. Reposit., arXiv Preprint, arXiv:2303.03144. https://doi.org/10.48550/arxiv.2303.03144Google ScholarCross Ref
David Meyer and Roger Schvaneveldt. 1971. Facilitation in recognizing pairs of words: Evidence of a dependence between retrieval operations. J. Exp. Psychol. , Vol. 90, 2 (11 1971), 227--234. https://doi.org/10.1037/h0031564Google ScholarCross Ref
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013a. Efficient estimation of word representations in vector space. Comput. Res. Reposit., arXiv Preprint, arXiv:1301.3781. https://doi.org/10.48550/arXiv.1301.3781Google ScholarCross Ref
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013b. Distributed representations of words and phrases and their compositionality. In Adv. Neural Inf. Process. Syst. (Lake Tahoe, NV, USA), Vol. 26. Curran Associates, Inc., New York, NY, USA, 3111--3119.Google Scholar
Raphaël Millière. 2022. Adversarial attacks on image generation with made-up words. Comput. Res. Reposit., arXiv Preprint, arXiv:2208.04135. https://doi.org/10.48550/arXiv.2208.04135Google ScholarCross Ref
Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. 2022. GLIDE: Towards photorealistic image generation and editing with text-guided diffusion models. In Proc. 39th Int. Conf. Mach. Learn., Proc. Mach. Learn. Res. (Baltimore, MD, USA), Vol. 162. PMLR, Cambridge, MA, USA, 16784--16804.Google Scholar
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning transferable visual models from natural language supervision. In Proc. 38th Int. Conf. Mach. Learn., Proc. Mach. Learn. Res. (Online), Vol. 139. PMLR, Cambridge, MA, USA, 8748--8763.Google Scholar
Vilayanur S. Ramachandran and Edward M. Hubbard. 2001. Synaesthesia --A window into perception, thought and language. J. Conscious. Stud. , Vol. 8, 12 (2001), 3--34.Google Scholar
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. 2022. Hierarchical text-conditional image generation with CLIP latents. Comput. Res. Reposit., arXiv Preprint, arXiv:2204.06125. https://doi.org/10.48550/arXiv.2204.06125Google ScholarCross Ref
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-resolution image synthesis with latent diffusion models. In Proc. 2022 IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (New Orleans, LA, USA). IEEE, New York, NY, USA, 10684--10695.Google ScholarCross Ref
Valentino Sabbatino, Enrica Troiano, Antje Schweitzer, and Roman Klinger. 2022. “splink” is happy and “phrouth” is scary: Emotion intensity analysis for nonsense words. In Proc. 12th Workshop Comput. Approaches to Subj. Sentiment Soc. Media Anal. (Dublin, Ireland). ACL, Stroudsburg, PA, USA, 37--50. https://doi.org/10.18653/v1/2022.wassa-1.4Google ScholarCross Ref
Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S. Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi. 2022. Photorealistic text-to-image diffusion models with deep language understanding. Comput. Res. Reposit., arXiv Preprint, arXiv:2205.11487. https://doi.org/10.48550/arxiv.2205.11487Google ScholarCross Ref
Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade W. Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa R. Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. 2022. LAION-5B: An open large-scale dataset for training next generation image-text models. In Proc. 36th Conf. Neural Inf. Process. Syst. Datasets Benchmarks Track (New Orleans, LA, USA). Curran Associates, Inc., New York, NY, USA, bibinfonumpages17 pages.Google Scholar
Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In Proc. 54th Annual Meet. Assoc. Comput. Linguist. (Berlin, Germany), Vol. 1. ACL, Stroudsburg, PA, USA, 1715--1725. https://doi.org/10.18653/v1/P16--1162Google ScholarCross Ref
Hengcan Shi, Munawar Hayat, Yicheng Wu, and Jianfei Cai. 2022. ProposalCLIP: Unsupervised open-category object proposal generation via exploiting CLIP cues. In Proc. 2022 IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (New Orleans, LA, USA). IEEE, New York, NY, USA, 9611--9620.Google ScholarCross Ref
Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. 2015. Deep unsupervised learning using nonequilibrium thermodynamics. In Proc. 32nd Int. Conf. Mach. Learn., Proc. Mach. Learn. Res. (Lille, Nord, France), Vol. 37. PMLR, Cambridge, MA, USA, 2256--2265.Google Scholar
Robyn Speer, Joshua Chin, Andrew Lin, Sara Jewett, and Lance Nathan. 2018. LuminosoInsight/wordfreq: v2.2. Zenodo. https://doi.org/10.5281/zenodo.1443582Google ScholarCross Ref
Computer Vision and Learning Research Group at Ludwig Maximilian University of Munich. 2022. Stable Diffusion. https://github.com/CompVis/stable-diffusion/ (Accessed July 11, 2023).Google Scholar
Ho-Hsiang Wu, Prem Seetharaman, Kundan Kumar, and Juan Pablo Bello. 2022. Wav2CLIP: Learning robust audio representations from CLIP. In Proc. 2022 IEEE Int. Conf. Acoust. Speech Signal Process. (Singapore). IEEE, New York, NY, USA, 4563--4567. https://doi.org/10.1109/ICASSP43922.2022.9747669 ioGoogle ScholarCross Ref

Index Terms

Nonword-to-Image Generation Considering Perceptual Association of Phonetically Similar Words
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Phonology / morphology
2. Information systems
  1. Information retrieval
    1. Specialized information retrieval
      1. Multimedia and multimodal retrieval

Recommendations

Word, pseudoword, and nonword processing: A multitask comparison using event-related brain potentials

Event-related brain potentials (ERPs) to words, pseudowords, and nonwords were recorded in three different tasks. A letter search task was used in Experiment 1. Performance was affected by whether the target letter occurred in a word, a pseudoword, or a ...
Read More
Language Experience and the Organization of Brain Activity to Phonetically Similar Words: ERP Evidence from 14- and 20-Month-Olds

The ability to discriminate phonetically similar speech sounds is evident quite early in development. However, inexperienced word learners do not always use this information in processing word meanings [Stager & Werker (1997). Nature, 388, 381–382]. The ...
Read More
Word and nonword repetition within-and across-modality: An event-related potential study

The effects on event-related potentials (ERPs) of within-and across-modality repetition of words and nonwords were investigated. In Experiment 1, subjects detected occasional animal names embedded in a series of words. AU items were equally likely to be ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
McGE '23: Proceedings of the 1st International Workshop on Multimedia Content Generation and Evaluation: New Methods and Practice
October 2023
151 pages
ISBN:9798400702785
DOI:10.1145/3607541
General Chairs:
Cheng Jin
Professor, Fudan University, China
,
Liang He
Professor, East China Normal University, China
,
Mingli Song
Professor, Zhejiang University, China
,
Rui Wang
Professor, IIE, Chinese Academy of Sciences, China
Copyright © 2023 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 29 October 2023
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
phonetics
psycholinguistics
text-to-image generation
Qualifiers
- research-article
Conference
Upcoming Conference
MM '24

Sponsor:

sigmm

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 57
  Total Downloads
- Downloads (Last 12 months)57
- Downloads (Last 6 weeks)7
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Nonword-to-Image Generation Considering Perceptual Association of Phonetically Similar Words

McGE '23: Proceedings of the 1st International Workshop on Multimedia Content Generation and Evaluation: New Methods and Practice

ABSTRACT

References

Cited By

Index Terms

Recommendations

Word, pseudoword, and nonword processing: A multitask comparison using event-related brain potentials

Language Experience and the Organization of Brain Activity to Phonetically Similar Words: ERP Evidence from 14- and 20-Month-Olds

Word and nonword repetition within-and across-modality: An event-related potential study

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Nonword-to-Image Generation Considering Perceptual Association of Phonetically Similar Words

McGE '23: Proceedings of the 1st International Workshop on Multimedia Content Generation and Evaluation: New Methods and Practice

ABSTRACT

References

Cited By

Index Terms

Recommendations

Word, pseudoword, and nonword processing: A multitask comparison using event-related brain potentials

Language Experience and the Organization of Brain Activity to Phonetically Similar Words: ERP Evidence from 14- and 20-Month-Olds

Word and nonword repetition within-and across-modality: An event-related potential study

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media