Uni- and multimodal methods for single- and multi-label recognition

Ishikawa, Satoru; Laaksonen, Jorma

doi:10.1007/s11042-017-4733-7

Uni- and multimodal methods for single- and multi-label recognition

Published: 09 May 2017

Volume 76, pages 22405–22423, (2017)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Satoru Ishikawa¹ &
Jorma Laaksonen¹

204 Accesses
Explore all metrics

Abstract

The multimodal approach is becoming more and more attractive and common method in multimedia information retrieval and description. It often shows better content recognition results than using only unimodal methods, but depending on the used data, this is not always the case. Most of the current multimodal media content classification methods still depend on unimodal recognition results. For both uni- and multimodal approaches it is important to choose the best features and classification models. In addition, in the case of unimodal models, the final multimodal recognitions still need to be produced with an appropriate late fusion technique. In this article, we study several multi- and unimodal recognition methods, features for them and their combination techniques, in the application setup of concept detection in image–text data. We consider both single- and multi-label recognition tasks. As the image features, we use GoogLeNet deep convolutional neural network (DCNN) activation features and semantic concept or classeme vectors. For text features, we use simple binary vectors for tags and the word2vec embedding vectors. The Multimodal Deep Boltzmann Machine (DBM) model is used as the multimodal model and the Support Vector Machine (SVM) with both linear and non-linear radial basis function (RBF) kernels as the unimodal one. The experiments are performed with the MIRFLICKR-1M and the NUS-WIDE datasets. The results show that the two models have equally good performance in the single-label recognition task of the former database, while the Multimodal DBM produces clearly better results in the multi-label task of the latter database. Compared with the results in the literature, we exceed the state of the art in both datasets, mostly due to the use of DCNN features and semantic concept vectors based on them.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Transfer learning for image classification using VGG19: Caltech-101 image data set

Article 17 September 2021

Monika Bansal, Munish Kumar, … Ajay Mittal

TextConvoNet: a convolutional neural network based architecture for text classification

Article 22 October 2022

Sanskar Soni, Satyendra Singh Chouhan & Santosh Singh Rathore

Self-supervised Learning: A Succinct Review

Article 20 January 2023

Veenu Rani, Syed Tufael Nabi, … Krishan Kumar

Notes

http://mulan.sourceforge.net

References

Arandjelović R, Zisserman A (2013) All about VLAD. IEEE conference on computer vision and pattern recognition
Bahrampour S, Nasrabadi NM, Ray A, Jenkins WK (2016) Multimodal task-driven dictionary learning for image classification. IEEE Trans Image Process 25 (1):24–38
Article MathSciNet Google Scholar
Bhatia K, Jain H, Kar P, Varma M, Jain P (2015) Sparse local embeddings for extreme multi-label classification. In: Cortes C, Lawrence ND, Lee DD, Sugiyama M, Garnett R (eds) Advances in neural information processing systems 28. Curran Associates, Inc, pp 730–738. http://papers.nips.cc/paper/5969-sparse-local-embeddings-for-extreme-multi-label-classification.pdf
Bosch A, Zisserman A, Munoz X (2007) Representing shape with a spatial pyramid kernel Proceedings of ACM ICVR 2007, pp 401–408
Google Scholar
Boyd-Graber JL, Blei DM (2009) Syntactic topic models Advances in neural information processing systems, pp 185–192
Google Scholar
Chang CC, Lin CJ (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol 2:27:1–27:27
Article Google Scholar
Chua TS, Tang J, Hong R, Li H, Luo Z, Zheng YT (2009) Nus-wide: A real-world web image database from national university of Singapore Proceedings of ACM conference on image and video retrieval (CIVR’09). santorini, Greece
Donahue J, Jia Y, Vinyals O, Hoffman J, Zhang N, Tzeng E, Darrell T (2014) deCAF: A deep convolutional activation feature for generic visual recognition ICML 2014
Erk K, Padó S (2008) A structured vector space model for word meaning in context Proceedings of the conference on empirical methods in natural language processing. Association for Computational Linguistics, pp 897–906
Escalera S, Athitsos V, Guyon I (2016) Challenges in multimodal gesture recognition. J Mach Learn Res 17(72):1–54
MathSciNet Google Scholar
Fan R, Chang K, Hsieh C, Wang X, Lin C (2008) LIBLINEAR: A library for large linear classification. J Mach Learn Res 9:1871–1874
MATH Google Scholar
Gong Y, Ke Q, Isard M, Lazebnik S (2012) A multi-view embedding space for modeling internet images, tags, and their semantics. CoRR arXiv:1212.4522
Gong Y, Wang L, Guo R, Lazebnik S (2014) Multi-scale orderless pooling of deep convolutional activation features. arXiv:1403.1840
Gong Y, Wang L, Hodosh M, Hockenmaier J, Lazebnik S (2014) Improving image-sentence embeddings using large weakly annotated photo collections European conference on computer vision, pp 529–545
Google Scholar
Guillaumin M, Verbeek J, Schmid C (2010) Multimodal semi-supervised learning for image classification. In: CVPR 2010 - 23Rd IEEE conference on computer vision & pattern recognition. IEEE Computer Society, San Francisco, USA, pp 902–909
Habibian A, van de Sande KE, Snoek CG (2013) Recommendations for video event recognition using concept vocabularies. In: Proceedings of the 3rd ACM conference on international conference on multimedia retrieval, ICMR ’13, pp 89–96. ACM, New York, NY, USA. doi:10.1145/2461466.2461482
Hinton GE (2002) Training products of experts by minimizing contrastive divergence. Neural Comput 14:1771–1800
Article MATH Google Scholar
Hinton GE, Osindero S, Teh Y (2006) A fast learning algorithm for deep belief nets. Neural Comput 18:1527–1554
Article MathSciNet MATH Google Scholar
Hinton GE, Salakhutdinov R (2009) Replicated softmax: an undirected topic model. In: Bengio Y, Schuurmans D, Lafferty JD, Williams CKI, Culotta A (eds) Advances in neural information processing systems 22. Curran Associates, Inc, pp 1607–1614. http://papers.nips.cc/paper/3856-replicated-softmax-an-undirected-topic-model.pdf
Huiskes MJ, Lew MS (2008) The MIR Flickr retrieval evaluation
Ishikawa S, Laaksonen J (2016) Comparing and combining unimodal methods for multimodal recognition Proceedings of the 14th international workshop on content-based multimedia indexing (CBMI). bucharest, Romania
Jegou H, Douze M, Schmid C, Perez P (2010) Aggregating local descriptors into a compact image representation. In: 2010 IEEE conference on computer vision and pattern recognition (CVPR)
Jia Y (2013) Caffe: An open source convolutional architecture for fast feature embedding http://caffe.berkeleyvision.org/
Kara S, Alan Ö, Sabuncu O, Akpınar S, Cicekli NK, Alpaslan FN (2012) An ontology-based retrieval system using semantic indexing. Inf Syst 37(4):294–305
Article Google Scholar
Koskela M, Laaksonen J (2014) Convolutional network features for scene recognition Proceedings of the 22nd ACM international conference on multimedia. Orlando, Florida
Li C, Wang B, Pavlu V, Aslam J (2016) Conditional bernoulli mixtures for multi-label classification Proceedings of the 33rd international conference on machine learning, pp 2482–2491
Google Scholar
Li LJ, Su H, Fei-Fei L, Xing EP (2010) Object bank: a high-level image representation for scene classification & semantic feature sparsification Advances in neural information processing systems, pp 1378–1386
Google Scholar
Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91–110
Article Google Scholar
Manjunath BS, Ohm JR, Vasudevan VV, Yamada A (2001) Color and texture descriptors. IEEE Trans Circuits Syst Video Technol 11(6):703–715
Article Google Scholar
Merler M, Huang B, Xie L, Hua G, Natsev A (2012) Semantic model vectors for complex video event recognition. Trans Multi 14(1):88–101. doi:10.1109/TMM.2011.2168948
Article Google Scholar
Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. CoRR arXiv:1301..3781
Oliva A, Torralba A (2001) Modeling the shape of the scene: A holistic representation of the spatial envelope. Int J Comput Vis 42(3):145–175. doi:10.1023/A:1011139631724
Article MATH Google Scholar
Sjöberg M, Laaksonen J (2014) Using semantic features to improve large-scale visual concept detection Proceedings of the 12th International Workshop on Content Based Multimedia Indexing (CBMI 2014), pp 1–6. IEEE, Klagenfurt, Austria. doi:10.1109/CBMI.2014.6849817
Srivastava N, Salakhutdinov R (2012) Multimodal learning with deep boltzmann machines Advances in neural information processing systems 2012, pp 2222–2230
Google Scholar
Srivastava N, Salakhutdinov R (2014) Multimodal learning with deep boltzmann machines. J Mach Learn Res 15:2949–2980
MathSciNet MATH Google Scholar
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2014) Going deeper with convolutions. arXiv:1409.4842
Torresani L, Szummer M, Fitzgibbon A (2010) Efficient object category recognition using classemes European Conference on Computer Vision (ECCV), pp 776–789. http://research.microsoft.com/pubs/136846/TorresaniSzummerFitzgibbon-classemes-eccv10.pdf
Vedaldi A, Fulkerson B VLFeat: A library of computer vision algorithms. http://www.vlfeat.org/
Vedaldi A, Zisserman A (2010) Efficient additive kernels via explicit feature maps Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR 2010)
Verbeek JJ, Guillaumin M, Mensink T, Schmid C (2010) Image annotation with tagprop on the mirflickr set. ACM, New York, NY, USA
Book Google Scholar
van de Weijer J, Schmid C (2006) Coloring local feature extraction Proceedings ECCV 2006
Zhang H, Shang X, Luan H, Wang M, Chua TS (2016) Learning from collective intelligence: Feature learning using largely social images and tags. In: ACM transactions on multimedia computing, communications and applications
Zhao F, Huang Y, Wang L, Tan T (2015) Deep semantic ranking based hashing for multi-label image retrieval The IEEE conference on computer vision and pattern recognition (CVPR)

Download references

Acknowledgment

This work has been funded by the grant 251170 of the Academy of Finland and the Data to Intelligence D2I DIGILE SHOK program. The calculations were performed using computer resources within the Aalto University School of Science “Science-IT” project.

Author information

Authors and Affiliations

Department of Computer Science, Aalto University School of Science, P.O.Box 15400, FI-00076, Aalto, Finland
Satoru Ishikawa & Jorma Laaksonen

Authors

Satoru Ishikawa
View author publications
You can also search for this author in PubMed Google Scholar
Jorma Laaksonen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Satoru Ishikawa.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ishikawa, S., Laaksonen, J. Uni- and multimodal methods for single- and multi-label recognition. Multimed Tools Appl 76, 22405–22423 (2017). https://doi.org/10.1007/s11042-017-4733-7

Download citation

Received: 14 October 2016
Revised: 09 March 2017
Accepted: 18 April 2017
Published: 09 May 2017
Issue Date: November 2017
DOI: https://doi.org/10.1007/s11042-017-4733-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Uni- and multimodal methods for single- and multi-label recognition

Abstract

Access this article

Similar content being viewed by others

Transfer learning for image classification using VGG19: Caltech-101 image data set

TextConvoNet: a convolutional neural network based architecture for text classification

Self-supervised Learning: A Succinct Review

Notes

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Uni- and multimodal methods for single- and multi-label recognition

Abstract

Access this article

Similar content being viewed by others

Transfer learning for image classification using VGG19: Caltech-101 image data set

TextConvoNet: a convolutional neural network based architecture for text classification

Self-supervised Learning: A Succinct Review

Notes

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation