ABSTRACT
The automatic attribution of semantic labels to unlabeled or weakly labeled images has received considerable attention but, given the complexity of the problem, remains a hard research topic. Here we propose a unified classification framework which mixes textual and visual information in a seamless manner. Unlike most recent previous works, computer vision techniques are used as inspiration to process textual information. To do so, we consider two types of complementary tag similarities, respectively computed from a conceptual hierarchy and from data collected from a photo sharing platform. Visual content is processed using recent techniques for bag-of visual-words feature generation. A central contribution of our work is to infer the coding step of the general bag-of-word framework with such similarities and to aggregate these tag-codes by max-pooling to obtain a single representative vector (signature). Final image annotations are obtained via late fusion, where the three modalities (two text-based and one visual-based) are merged during the classification step. Experimental results on the Pascal VOC 2007 and MIR Flickr datasets show an improvement over the state-of-the-art methods, while significantly decreasing the computational complexity of the learning system.
- A. Binder, W. Samek, M. Kloft, C. Müller, K.-R. Müller, and M. Kawanabe. The Joint Submission of the TU Berlin and Fraunhofer FIRST (TUBFI) to the ImageCLEF2011 Photo Annotation Task. In CLEF (Notebook Papers/Labs/Workshop), 2011.Google Scholar
- Y.-L. Boureau, F. Bach, Y. LeCun, and J. Ponce. Learning mid-level features for recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2559--2566, 2010.Google ScholarCross Ref
- A. Coates and A. Ng. The Importance of Encoding Versus Training with Sparse Coding and Vector Quantization. In ACM International Conference on Machine Learning (ICML), pages 921--928, 2011.Google Scholar
- G. Csurka, C. R. Dance, L. Fan, J. Willamowski, and C. Bray. Visual categorization with bags of keypoints. In Workshop on Statistical Learning in Computer Vision (ECCV), pages 1--22, 2004.Google Scholar
- G. Dork and C. Schmid. Object class recognition using discriminative local features. Rapport de recherche RR-5497, INRIA, 2005.Google Scholar
- R. P. W. Duin. The Combining Classifier: To Train or Not to Train? In International Conference on Pattern Recognition (ICPR), pages 765--770, 2002.Google ScholarCross Ref
- M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results.Google Scholar
- C. Fellbaum, editor. WordNet: An Electronic Lexical Database (Language, Speech, and Communication). The MIT Press, 1998.Google ScholarCross Ref
- S. Gao, I. Tsang, L. Chia, and P. Zhao. Local features are not lonely - Laplacian sparse coding for image classification. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3555--3561, 2011.Google Scholar
- M. Guillaumin, J. Verbeek, and C. Schmid. Multimodal semi-supervised learning for image classification. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 902--909, 2010.Google ScholarCross Ref
- Y. Huang, K. Huang, Y. Yu, and T. Tan. Salient Coding for Image Classification. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1753--1760, 2011. Google ScholarDigital Library
- M. J. Huiskes and M. S. Lew. The MIR flickr retrieval evaluation. In ACM international conference on Multimedia information retrieval (ICMR), pages 39--43, 2008. Google ScholarDigital Library
- M. Kawanabe, A. Binder, C. Muller, and W. Wojcikiewicz. Multi-modal visual concept classification of images via Markov random walk over tags. In IEEE Workshop on Applications of Computer Vision, pages 396--401, 2011. Google ScholarDigital Library
- S. Lazebnik, C. Schmid, and J. Ponce. Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2169--2178, 2006. Google ScholarDigital Library
- L. Liu, L. Wang, and X. Liu. In Defense of Soft-assignment Coding. In IEEE International Conference on Computer Vision (ICCV), 2011. Google ScholarDigital Library
- D. G. Lowe. Distinctive image features from scale-invariant keypoints. International journal of computer vision (IJCV), 60(2):91--110, 2004. Google ScholarDigital Library
- A. Oliva and A. Torralba. Modeling the shape of the scene: A holistic representation of the spatial envelope. International Journal of Computer Vision (IJCV), 42(3):145--175, 2001. Google ScholarDigital Library
- A. Popescu and G. Grefenstette. Social media driven image retrieval. In ACM International Conference on Multimedia Retrieval (ICMR), pages 33:1--33:8, 2011. Google ScholarDigital Library
- G. Salton and M. J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, New York, 1983. Google ScholarDigital Library
- J. Sivic and A. Zisserman. Video Google: A Text Retrieval Approach to Object Matching in Videos. In IEEE International Conference on Computer Vision (ICCV), volume 2, pages 1470--1477, 2003. Google ScholarDigital Library
- A. W. M. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain. Content-based image retrieval at the end of the early years. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 22:1349--1380, 2000. Google ScholarDigital Library
- J. van Gemert, C. Veenman, A. Smeulders, and J. Geusebroek. Visual word ambiguity. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), pages 1271--1283, 2009. Google ScholarDigital Library
- G. Wang, D. Hoiem, and D. Forsyth. Building text features for object image classification. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1367--1374, 2009.Google ScholarCross Ref
- J. Wang, J. Yang, K. Yu, F. Lv, T. S. Huang, and Y. Gong. Locality-constrained linear coding for image classification. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3360--3367, 2010.Google ScholarCross Ref
- D. H. Wolpert. Stacked generalization. Neural Networks, 5:241--259, 1992. Google ScholarDigital Library
- Z. Wu and M. Palmer. Verb semantics and lexical selection. In Annual Meeting of the Association for Computational Linguistics, pages 133--138, 1994. Google ScholarDigital Library
- J. Yang, K. Yu, Y. Gong, and T. Huang. Linear spatial pyramid matching using sparse coding for image classification. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1794--1801, 2009.Google Scholar
- K. Yu, T. Zhang, and Y. Gong. Nonlinear learning using local coordinate coding. Advances in Neural Information Processing Systems, 22:2223--2231, 2009.Google Scholar
Index Terms
- Multimodal feature generation framework for semantic image classification
Recommendations
Multimodal fusion using learned text concepts for image categorization
MM '06: Proceedings of the 14th ACM international conference on MultimediaConventional image categorization techniques primarily rely on low-level visual cues. In this paper, we describe a multimodal fusion scheme which improves the image classification accuracy by incorporating the information derived from the embedded texts ...
Image retrieval based on high level concept detection and semantic labelling
This paper presents a novel approach to high-level concept detection and retrieval in images based on a combination of visual thesaurus and multi-class supervised learning. The visual thesaurus includes both conceptual and spatial location information ...
Semantic context learning with large-scale weakly-labeled image set
CIKM '12: Proceedings of the 21st ACM international conference on Information and knowledge managementThere are a large number of images available on the web; meanwhile, only a subset of web images can be labeled by professionals because manual annotation is time-consuming and labor-intensive. Although we can now use the collaborative image tagging ...
Comments