research-article

Multimodal feature generation framework for semantic image classification

Authors:
Amel Znaidia

CEA, LIST, Vision & Content Engineering Laboratory, Gif-sur-Yvettes, France

CEA, LIST, Vision & Content Engineering Laboratory, Gif-sur-Yvettes, France
View Profile

,
Aymen Shabou

CEA, LIST, Vision & Content Engineering Laboratory, Gif-sur-Yvettes, France

CEA, LIST, Vision & Content Engineering Laboratory, Gif-sur-Yvettes, France
View Profile

,
Adrian Popescu

CEA, LIST, Vision & Content Engineering Laboratory, Gif-sur-Yvettes, France

CEA, LIST, Vision & Content Engineering Laboratory, Gif-sur-Yvettes, France
View Profile

,
Hervé le Borgne

CEA, LIST, Vision & Content Engineering Laboratory, Gif-sur-Yvettes, France

CEA, LIST, Vision & Content Engineering Laboratory, Gif-sur-Yvettes, France
View Profile

,
Céline Hudelot

Applied Mathematics & Systems Laboratory, Antony, France

Applied Mathematics & Systems Laboratory, Antony, France
View Profile

ICMR '12: Proceedings of the 2nd ACM International Conference on Multimedia RetrievalJune 2012Article No.: 38Pages 1–8https://doi.org/10.1145/2324796.2324842

Published:05 June 2012Publication History

ICMR '12: Proceedings of the 2nd ACM International Conference on Multimedia Retrieval

Pages 1–8

ABSTRACT

The automatic attribution of semantic labels to unlabeled or weakly labeled images has received considerable attention but, given the complexity of the problem, remains a hard research topic. Here we propose a unified classification framework which mixes textual and visual information in a seamless manner. Unlike most recent previous works, computer vision techniques are used as inspiration to process textual information. To do so, we consider two types of complementary tag similarities, respectively computed from a conceptual hierarchy and from data collected from a photo sharing platform. Visual content is processed using recent techniques for bag-of visual-words feature generation. A central contribution of our work is to infer the coding step of the general bag-of-word framework with such similarities and to aggregate these tag-codes by max-pooling to obtain a single representative vector (signature). Final image annotations are obtained via late fusion, where the three modalities (two text-based and one visual-based) are merged during the classification step. Experimental results on the Pascal VOC 2007 and MIR Flickr datasets show an improvement over the state-of-the-art methods, while significantly decreasing the computational complexity of the learning system.

References

A. Binder, W. Samek, M. Kloft, C. Müller, K.-R. Müller, and M. Kawanabe. The Joint Submission of the TU Berlin and Fraunhofer FIRST (TUBFI) to the ImageCLEF2011 Photo Annotation Task. In CLEF (Notebook Papers/Labs/Workshop), 2011.Google Scholar
Y.-L. Boureau, F. Bach, Y. LeCun, and J. Ponce. Learning mid-level features for recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2559--2566, 2010.Google ScholarCross Ref
A. Coates and A. Ng. The Importance of Encoding Versus Training with Sparse Coding and Vector Quantization. In ACM International Conference on Machine Learning (ICML), pages 921--928, 2011.Google Scholar
G. Csurka, C. R. Dance, L. Fan, J. Willamowski, and C. Bray. Visual categorization with bags of keypoints. In Workshop on Statistical Learning in Computer Vision (ECCV), pages 1--22, 2004.Google Scholar
G. Dork and C. Schmid. Object class recognition using discriminative local features. Rapport de recherche RR-5497, INRIA, 2005.Google Scholar
R. P. W. Duin. The Combining Classifier: To Train or Not to Train? In International Conference on Pattern Recognition (ICPR), pages 765--770, 2002.Google ScholarCross Ref
M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results.Google Scholar
C. Fellbaum, editor. WordNet: An Electronic Lexical Database (Language, Speech, and Communication). The MIT Press, 1998.Google ScholarCross Ref
S. Gao, I. Tsang, L. Chia, and P. Zhao. Local features are not lonely - Laplacian sparse coding for image classification. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3555--3561, 2011.Google Scholar
M. Guillaumin, J. Verbeek, and C. Schmid. Multimodal semi-supervised learning for image classification. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 902--909, 2010.Google ScholarCross Ref
Y. Huang, K. Huang, Y. Yu, and T. Tan. Salient Coding for Image Classification. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1753--1760, 2011. Google ScholarDigital Library
M. J. Huiskes and M. S. Lew. The MIR flickr retrieval evaluation. In ACM international conference on Multimedia information retrieval (ICMR), pages 39--43, 2008. Google ScholarDigital Library
M. Kawanabe, A. Binder, C. Muller, and W. Wojcikiewicz. Multi-modal visual concept classification of images via Markov random walk over tags. In IEEE Workshop on Applications of Computer Vision, pages 396--401, 2011. Google ScholarDigital Library
S. Lazebnik, C. Schmid, and J. Ponce. Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2169--2178, 2006. Google ScholarDigital Library
L. Liu, L. Wang, and X. Liu. In Defense of Soft-assignment Coding. In IEEE International Conference on Computer Vision (ICCV), 2011. Google ScholarDigital Library
D. G. Lowe. Distinctive image features from scale-invariant keypoints. International journal of computer vision (IJCV), 60(2):91--110, 2004. Google ScholarDigital Library
A. Oliva and A. Torralba. Modeling the shape of the scene: A holistic representation of the spatial envelope. International Journal of Computer Vision (IJCV), 42(3):145--175, 2001. Google ScholarDigital Library
A. Popescu and G. Grefenstette. Social media driven image retrieval. In ACM International Conference on Multimedia Retrieval (ICMR), pages 33:1--33:8, 2011. Google ScholarDigital Library
G. Salton and M. J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, New York, 1983. Google ScholarDigital Library
J. Sivic and A. Zisserman. Video Google: A Text Retrieval Approach to Object Matching in Videos. In IEEE International Conference on Computer Vision (ICCV), volume 2, pages 1470--1477, 2003. Google ScholarDigital Library
A. W. M. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain. Content-based image retrieval at the end of the early years. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 22:1349--1380, 2000. Google ScholarDigital Library
J. van Gemert, C. Veenman, A. Smeulders, and J. Geusebroek. Visual word ambiguity. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), pages 1271--1283, 2009. Google ScholarDigital Library
G. Wang, D. Hoiem, and D. Forsyth. Building text features for object image classification. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1367--1374, 2009.Google ScholarCross Ref
J. Wang, J. Yang, K. Yu, F. Lv, T. S. Huang, and Y. Gong. Locality-constrained linear coding for image classification. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3360--3367, 2010.Google ScholarCross Ref
D. H. Wolpert. Stacked generalization. Neural Networks, 5:241--259, 1992. Google ScholarDigital Library
Z. Wu and M. Palmer. Verb semantics and lexical selection. In Annual Meeting of the Association for Computational Linguistics, pages 133--138, 1994. Google ScholarDigital Library
J. Yang, K. Yu, Y. Gong, and T. Huang. Linear spatial pyramid matching using sparse coding for image classification. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1794--1801, 2009.Google Scholar
K. Yu, T. Zhang, and Y. Gong. Nonlinear learning using local coordinate coding. Advances in Neural Information Processing Systems, 22:2223--2231, 2009.Google Scholar

Index Terms

Multimodal feature generation framework for semantic image classification
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
      2. Computer vision tasks
        Scene understanding

Recommendations

Multimodal fusion using learned text concepts for image categorization
MM '06: Proceedings of the 14th ACM international conference on Multimedia

Conventional image categorization techniques primarily rely on low-level visual cues. In this paper, we describe a multimodal fusion scheme which improves the image classification accuracy by incorporating the information derived from the embedded texts ...
Read More
Image retrieval based on high level concept detection and semantic labelling

This paper presents a novel approach to high-level concept detection and retrieval in images based on a combination of visual thesaurus and multi-class supervised learning. The visual thesaurus includes both conceptual and spatial location information ...
Read More
Semantic context learning with large-scale weakly-labeled image set
CIKM '12: Proceedings of the 21st ACM international conference on Information and knowledge management

There are a large number of images available on the web; meanwhile, only a subset of web images can be labeled by professionals because manual annotation is time-consuming and labor-intensive. Although we can now use the collaborative image tagging ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ICMR '12: Proceedings of the 2nd ACM International Conference on Multimedia Retrieval
June 2012
489 pages
ISBN:9781450313292
DOI:10.1145/2324796
Conference Chairs:
Horace H. S. Ip
City University of Hong Kong
,
Yong Rui
Microsoft, China
Copyright © 2012 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 5 June 2012
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
bags of words
classification
image annotation
multimedia fusion
Qualifiers
- research-article
Conference

Acceptance Rates
ICMR '12 Paper Acceptance Rate50of145submissions,34%Overall Acceptance Rate254of830submissions,31%
More
Upcoming Conference
ICMR '24

Sponsor:

sigmm

International Conference on Multimedia Retrieval

June 10 - 14, 2024

Phuket , Thailand
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 7
  Total Citations
  View Citations
- 298
  Total Downloads
- Downloads (Last 12 months)5
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Multimodal feature generation framework for semantic image classification

ICMR '12: Proceedings of the 2nd ACM International Conference on Multimedia Retrieval

ABSTRACT

References

Cited By

Index Terms

Recommendations

Multimodal fusion using learned text concepts for image categorization

Image retrieval based on high level concept detection and semantic labelling

Semantic context learning with large-scale weakly-labeled image set