Estimating the visual variety of concepts by referring to Web popularity

Kastner, Marc A.; Ide, Ichiro; Kawanishi, Yasutomo; Hirayama, Takatsugu; Deguchi, Daisuke; Murase, Hiroshi

doi:10.1007/s11042-018-6528-x

Estimating the visual variety of concepts by referring to Web popularity

Published: 23 August 2018

Volume 78, pages 9463–9488, (2019)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Marc A. Kastner ORCID: orcid.org/0000-0002-9193-5973¹,
Ichiro Ide¹,
Yasutomo Kawanishi¹,
Takatsugu Hirayama²,
Daisuke Deguchi³ &
…
Hiroshi Murase¹

305 Accesses
4 Citations
1 Altmetric
Explore all metrics

Abstract

Increasingly sophisticated methods for data processing demand knowledge on the semantic relationship between language and vision. New fields of research like Explainable AI demand to step away from black-boxed approaches and understanding how the underlying semantics of data sets and AI models work. Advancements in Psycholinguistics suggest, that there is a relationship from language perception to how language production and sentence creation work. In this paper, a method to measure the visual variety of concepts is proposed to quantify the semantic gap between vision and language. For this, an image corpus is recomposed using ImageNet and Web data. Web-based metrics for measuring the popularity of sub-concepts are used as a weighting to ensure that the image composition in a dataset is as natural as possible. Using clustering methods, a score describing the visual variety of each concept is determined. A crowd-sourced survey is conducted to create ground-truth values applicable for this research. The evaluations show that the recomposed image corpus largely improves the measured variety compared to previous datasets. The results are promising and give additional knowledge about the relationship of language and vision.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Estimating the imageability of words by mining visual characteristics from crawled image data

Article 29 February 2020

Marc A. Kastner, Ichiro Ide, … Hiroshi Murase

The Flickr frequency norms: What 17 years of images tagged online tell us about lexical processing

Article 12 December 2022

Marco A. Petilli, Fritz Günther & Marco Marelli

Attributes as Semantic Units Between Natural Language and Visual Recognition

References

Barnard K, Duygulu P, Forsyth D, de Freitas N, Blei DM, Jordan MI (2003) Matching words and pictures. J Mach Learn Res 3:1107–1135. https://doi.org/10.1162/153244303322533214
MATH Google Scholar
Bay H, Ess A, Tuytelaars T, Van Gool L (2008) Speeded-Up Robust Features (SURF). Comput Vis Image Underst 110(3):346–359. https://doi.org/10.1016/j.cviu.2007.09.014
Article Google Scholar
Buhrmester M, Kwang T, Gosling SD (2011) Amazon’s Mechanical Turk: A new source of inexpensive, yet high-quality, data? Perspect Psychol Sci 6(1):3–5. https://doi.org/10.1177/1745691610393980
Article Google Scholar
Comaniciu D, Meer P (2002) Mean Shift: A robust approach toward feature space analysis. IEEE Trans Pattern Anal Mach Intell 24(5):603–619. https://doi.org/10.1109/34.1000236
Article Google Scholar
Csurka G, Dance CR, Fan L, Willamowski J, Bray C (2004) Visual categorization with bags of keypoints. In: Proceedings of ECCV 2004 Workshop on Statistical Learning in Computer Vision, pp 1–22
Davies M (2008) The corpus of contemporary American English: 520 million words, 1990–present. http://corpus.byu.edu/coca/
Deng JDJ, Dong WDW, Socher R, Li LJ, Li K, Fei-Fei L (2009) ImageNet: A large-scale hierarchical image database. In: Proceedings of 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp 2–9. https://doi.org/10.1109/CVPR.2009.5206848
Divvala SK, Farhadi A, Guestrin C (2014) Learning everything about anything: Webly-supervised visual concept learning. In: Proceedings 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp 3270–3277. https://doi.org/10.1109/CVPR.2014.412
Dodge Y (2008) Spearman rank correlation coefficient. In: The Concise Encyclopedia of Statistics. Springer, New York, pp 502–505,. https://doi.org/10.1007/978-0-387-32833-1_379
Google (2016) Google Custom Search API. https://developers.google.com/custom-search/
Hentschel C, Sack H (2015) What image classifiers really see —visualizing bag-of-visual words models. In: Advances in Multimedia Modeling: 21st International Conference on Multimedia Modeling Processing. Springer, Lecture Notes in Computer Science, vol 8935, pp 95–104. https://doi.org/10.1007/978-3-319-14445-0_9
Google Scholar
Holzinger A, Biemann C, Pattichis CS, Kell DB (2017) What do we need to build explainable AI systems for the medical domain?. Computer Research Repository arXiv:http://arXiv.org/abs/1712.09923
Holzinger A, Malle B, Kieseberg P, Roth PM, Müller H, Reihs R, Zatloukal K (2017) Towards the augmented pathologist: challenges of explainable-AI in digital pathology. Computer Research Repository arXiv:http://arXiv.org/abs/1712.06657
Inoue N, Shinoda K (2016) Adaptation of word vectors using tree structure for visual semantics. In: Proceedings of 24th ACM Multimedia Conference, pp 277–281. https://doi.org/10.1145/2964284.2967226
Itseez (2015) Open source computer vision library. https://opencv.org/
Kawakubo H, Akima Y, Yanai K (2010) Automatic construction of a folksonomy-based visual ontology. In: Proceedings of 2010 IEEE International Symposium on Multimedia, pp 330–335. https://doi.org/10.1109/ISM.2010.57
Kennedy LS, Chang SF, Kozintsev IV (2006) To search or to label?: Predicting the performance of search-based automatic image classifiers. In: Proceedings of 8th ACM International Workshop on Multimedia Information Retrieval, pp 249–258. https://doi.org/10.1145/1178677.1178712
Kilgarriff A, Baisa V, Bušta J, Jakubíček M, Kovávr V, Michelfeit J, Rychlý P, Suchomel V (2014) The sketch engine: Ten years on. Lexicography 1(1):7–36. https://doi.org/10.1007/s40607-014-0009-9
Article Google Scholar
Kohara Y, Yanai K (2013) Visual analysis of tag co-occurrence on nouns and adjectives. In: Li S, El Saddik A, Wang M, Mei T, Sebe N, Yan S, Hong R, Gurrin C (eds) Advances in Multimedia Modeling: 19th International Conference on Multimedia Modeling Processing, vol 7732. Springer, Lecture Notes in Computer Science, pp 47–57. https://doi.org/10.1007/978-3-642-35725-1-5
van Leuken RH, Garcia L, Olivares X, van Zwol R (2009) Visual diversification of image search results. In: Proceedings of 18th International Conference on World Wide Web, pp 341–350. https://doi.org/10.1145/1526709.1526756
Li JJ, Nenkova A (2015) Fast and accurate prediction of sentence specificity. In: Proceedings of 29th AAAI Conference on Artificial Intelligence, pp 2281–2287
Loper E, Bird S (2002) NLTK: The Natural Language Toolkit. In: Proceedings of ACL-02 Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics, vol 1, pp 63–70. https://doi.org/10.3115/1118108.1118117
Maystre L (2017) Choix —Inference algorithms for models based on Luce’s choice axiom. https://github.com/lucasmaystre/choix/
Merriam-Webster (2017) Merriam-Webster Online Dictionary. http://www.merriam-webster.com/
Microsoft (2016) Microsoft Azure Bing Search API. https://azure.microsoft.com/ja-jp/services/cognitive-services/search/
Miller GA (1995) WordNet: A lexical database for English, vol 38. https://doi.org/10.1145/219717.219748
Article Google Scholar
Nagasawa Y, Nakamura K, Nitta N, Babaguchi N (2017) Effect of junk images on inter-concept distance measurement: Positive or negative? In: Advances in Multimedia Modeling: 23rd International Conference on Multimedia Modeling Procs., Springer, Lecture Notes in Computer Science, vol 10133, pp 173–184. https://doi.org/10.1007/978-3-319-51814-5_15
Google Scholar
Nakamura K, Babaguchi N (2015) Inter-concept distance measurement with adaptively weighted multiple visual features. In: Computer Vision — ACCV 2014 Workshops. Springer, Lecture Notes in Computer Science, vol 9010, pp 56–70. https://doi.org/10.1007/978-3-319-16634-6_5
Google Scholar
Oxford University Press (2017) OED Online. https://en.oxforddictionaries.com/
Paivio A, Yuille JC, Madigan SA (1968) Concreteness, imagery, and meaningfulness values for 925 nouns. J Exp Psychol 76(1):1–25
Article Google Scholar
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: Machine learning in Python. J Mach Learn Res 12:2825–2830
MathSciNet MATH Google Scholar
Samek W, Wiegand T, Mueller KR (2017) Explainable artificial intelligence: understanding, visualizing and interpreting deep learning models. Computer Research Repository arXiv:http://arXiv.org/abs/1708.08296
Smolik F, Kriz A (2015) The power of imageability: How the acquisition of inflected forms is facilitated in highly imageable verbs and nouns in Czech children. J First Lang 35(6):446–465. https://doi.org/10.1177/0142723715609228
Article Google Scholar
Thurstone LL (1927) The method of paired comparisons for social values. J Abnorm Psychol 21(4):384–400
Google Scholar
Yahoo (2005) Flickr. https://www.flickr.com/
Yanai K, Barnard K (2005) Image region entropy: A measure of “visualness” of Web images associated with one concept. In: Proceedings of 13th ACM Multimedia Conference, pp 419–422. https://doi.org/10.1145/1101149.1101241

Download references

Acknowledgements

We are grateful to Dr. Kazuaki Nakamura at Osaka University who provided expertise that greatly assisted this research.

Author information

Authors and Affiliations

Graduate School of Informatics, Nagoya University, Furo-cho, Chikusa-ku, Nagoya, 464-8601, Japan
Marc A. Kastner, Ichiro Ide, Yasutomo Kawanishi & Hiroshi Murase
Institute of Innovation for Future Society, Nagoya University, Furo-cho, Chikusa-ku, Nagoya, 464-8601, Japan
Takatsugu Hirayama
Information Strategy Office, Nagoya University, Furo-cho, Chikusa-ku, Nagoya, 464-8601, Japan
Daisuke Deguchi

Authors

Marc A. Kastner
View author publications
You can also search for this author in PubMed Google Scholar
Ichiro Ide
View author publications
You can also search for this author in PubMed Google Scholar
Yasutomo Kawanishi
View author publications
You can also search for this author in PubMed Google Scholar
Takatsugu Hirayama
View author publications
You can also search for this author in PubMed Google Scholar
Daisuke Deguchi
View author publications
You can also search for this author in PubMed Google Scholar
Hiroshi Murase
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Marc A. Kastner.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Parts of this research were supported by the Ministry of Education, Culture, Sports, Science and Technology, Grant-in-Aid for Scientific Research, and a joint research project with NII, Japan.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kastner, M.A., Ide, I., Kawanishi, Y. et al. Estimating the visual variety of concepts by referring to Web popularity. Multimed Tools Appl 78, 9463–9488 (2019). https://doi.org/10.1007/s11042-018-6528-x

Download citation

Received: 03 February 2018
Revised: 30 July 2018
Accepted: 10 August 2018
Published: 23 August 2018
Issue Date: April 2019
DOI: https://doi.org/10.1007/s11042-018-6528-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Estimating the visual variety of concepts by referring to Web popularity

Abstract

Access this article

Similar content being viewed by others

Estimating the imageability of words by mining visual characteristics from crawled image data

The Flickr frequency norms: What 17 years of images tagged online tell us about lexical processing

Attributes as Semantic Units Between Natural Language and Visual Recognition

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Estimating the visual variety of concepts by referring to Web popularity

Abstract

Access this article

Similar content being viewed by others

Estimating the imageability of words by mining visual characteristics from crawled image data

The Flickr frequency norms: What 17 years of images tagged online tell us about lexical processing

Attributes as Semantic Units Between Natural Language and Visual Recognition

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation