A unified framework for multimodal retrieval
Introduction
The continuously increasing amount of multimedia content on the Internet emerged the imperative need for searching in various online multimedia databases. The traditional text-based retrieval techniques failed to address the requirements for searching this massive media content, therefore, research has been focused on content-based multimedia retrieval methods. Searching for similar results to a query content requires the automatic extraction of low-level features from media, e.g. in case of an image these would be color, texture, shape, etc. Thus, several content-based techniques have been developed in the past, performing retrieval of a single modality, such as 3D objects [11], [19], images [1], [31], video [12], [23] or audio [2], [30].
However, users who search for content are interested in finding semantically similar results to a query, regardless of its modality. Towards this aim, Yang et al. [37] proposed a method for connecting various semantically similar media of different modalities. In order to manage the case of having different modalities that carry the same semantics, the concept of Multimedia Document (MMD) was introduced. An example of a MMD is presented in Fig. 1, which describes a physical entity of an airplane and consists of its 3D representation, real image and sound.
Recently, multimedia search engines have evolved, allowing combinations of queries of different modalities. Multimodal search allows users to enter multiple query types and retrieve multiple types of media simultaneously in the form of MMDs. An approach for multimodal search has been introduced by the I-SEARCH1 framework [3]. I-SEARCH is a real world application, which enables retrieval of several types of media (3D objects, 2D images, sound, video and text) using as query any of the above types or their combinations in the form of MMDs. I-SEARCH made a significant step towards content-based multimedia retrieval, where users can search and retrieve media of any modality using a single unified retrieval framework and not a specialized system for each separate modality. Moreover, users in I-SEARCH can enter multiple queries simultaneously and thus, retrieve more relevant results. However, handling media in the form of MMDs is a highly complicated process, since the successful modeling of the low-level feature associations among the different modalities is required, in order to perform multimodal retrieval.
Section snippets
Dimensionality reduction methods for monomodal retrieval
In content-based retrieval methods, media are usually represented by low-level features in the form of high-dimensional descriptor vectors in which a distance metric (more often Euclidean-based) is applied to calculate similarity. However, since in most cases the extracted high-dimensional descriptor vectors face the problem of Dimensionality Curse [6], such metrics are inappropriate for efficient large scale retrieval. Therefore, nonlinear dimensionality reduction methods based on Manifold
The proposed method
The proposed method is divided into the following two steps: (a) construction of the multimodal semantic space of MMDs and (b) multimodal search and retrieval.
Datasets
Experimental evaluation is performed in five real multimodal datasets, denoted by DS1,2 DS2,3 DS3,4 DS4,5 and DS5.6 Further details are provided in Table 1.
Conclusions
In this paper, a unified framework for multimodal content-based search and retrieval is presented, which supports internal and external MMD-queries. Following an innovative weighting strategy, all monomodal heterogeneous similarities are combined to a global MMD similarity, by considering (a) the different nature of the monomodal similarities and (b) the availability of modalities per MMD in the database. As a result, the local neighborhood of each media modality is preserved into the
Conflict of interest statement
None declared.
Acknowledgment
This work was supported by the EC project 3DVIVANT, GA-248420.
D. Rafailidis was born in Larissa, Greece, in 1982. He received the Diploma in Informatics from the Computer Science Department, the M.Sc. degree in Information Systems and the Ph.D. degree in Music Information Retrieval from the Aristotle University of Thessaloniki, Greece in 2005, 2007 and 2011, respectively. His main research interests include data mining, information retrieval and recommender systems. Currently, he is a Postdoctoral Research Fellow at Information Technologies Institute of
References (39)
- et al.
Robust shape similarity retrieval based on contour segmentation polygonal multiresolution and elastic matching
Pattern Recognition
(2005) - et al.
A scale free distribution of false positives for a large class of audio similarity measures
Pattern Recognition
(2008) - et al.
Incremental Laplacian Eigenmaps by preserving adjacent information between data points
Pattern Recognition Letters
(2009) - et al.
3D object retrieval using the 3D shape impact descriptor
Pattern Recognition
(2009) - et al.
Real-time traffic sign recognition from video by class-specific discriminative features
Pattern Recognition
(2010) - et al.
Manifold-ranking based retrieval using k-regular nearest neighbor graph
Pattern Recognition
(2012) - et al.
A robust digital audio watermarking based on statistics characteristics
Pattern Recognition
(2009) - et al.
Trademark image retrieval using synthetic features for describing global shape and interior structure
Pattern Recognition
(2009) - et al.
Cross-media retrieval using query dependent search methods
Pattern Recognition
(2010) - A. Axenopoulos, P. Daras, S. Malassiotis, V. Croce, M. Lazzaro, J. Etzold, P. Grimm, A. Massari, A. Camurri, T....
The ISOMAP algorithm and topological stability
Science
Laplacian Eigenmaps for dimensionality reduction and data representation
Neural Computation
Adaptive Control ProcessesA Guided Tour
Search and retrieval of rich media objects supporting multiple multimodal queries
IEEE Transactions on Multimedia
3D model comparison using spatial structure circular descriptor
Pattern Recognition
A survey of content-based video retrieval
Journal of Computer Science
Cited by (42)
FNC: A fast neighborhood calculation framework
2022, Knowledge-Based SystemsSemisupervised charting for spectral multimodal manifold learning and alignment
2021, Pattern RecognitionCitation Excerpt :Processing of multimodal data is of critical importance [1-3]. This idea, also called information fusion, has proved its advantages, especially in pattern recognition applications [4-9]. Taking advantage of the diversity of information provided by multiple data modalities, new data representation knowledge can be extracted which is not accessible from each modality separately.
A novel strategy to balance the results of cross-modal hashing
2020, Pattern RecognitionGraph-based multimodal fusion with metric learning for multimodal classification
2019, Pattern RecognitionCycleMatch: A cycle-consistent embedding network for image-text matching
2019, Pattern RecognitionLearning visual and textual representations for multimodal matching and classification
2018, Pattern RecognitionCitation Excerpt :In this work, our focus is on jointly modeling the multimodal matching and classification between vision and language. The multimodal research underpins many critical applications in the computer vision field, including image captioning [1–3], cross-modal retrieval [4–6], and zero-shot recognition [7–10]. Specifically, multimodal matching has been studied for decades, with the aim of searching for a latent space, where visual and textual features can be unified to be latent embeddings.
D. Rafailidis was born in Larissa, Greece, in 1982. He received the Diploma in Informatics from the Computer Science Department, the M.Sc. degree in Information Systems and the Ph.D. degree in Music Information Retrieval from the Aristotle University of Thessaloniki, Greece in 2005, 2007 and 2011, respectively. His main research interests include data mining, information retrieval and recommender systems. Currently, he is a Postdoctoral Research Fellow at Information Technologies Institute of the Center for Research and Technology Hellas (CERTH).
S. Manolopoulou was born in Larissa, Greece, in 1984. She received the B.Sc. degree in informatics and the M.Sc. degree in digital media, both from Department of Informatics, Aristotle University of Thessaloniki, Thessaloniki, Greece, in 2007 and 2009, respectively. She has been a Research Assistant at the Information Technologies Institute (ITI) of the Centre for Research and Technology Hellas (CERTH) since January 2010. Her main research interests include digital processing of medical images, biomedical signal processing, and multimedia content-based search and retrieval.
P. Daras was born in Athens, Greece, in 1974. He received the Diploma degree in electrical and computer engineering, the M.Sc. degree in medical informatics, and the Ph.D. degree in electrical and computer engineering, all from the Aristotle University of Thessaloniki, Thessaloniki, Greece, in 1999, 2002, and 2005, respectively. He is a Researcher Grade C at the Information Technologies Institute of the Centre for Research and Technology Hellas (CERTH). His main research interests include search, retrieval and recognition of 3D objects, 3D object processing, medical informatics applications, medical image processing, 3D object watermarking, and bioinformatics. He serves as a reviewer/evaluator of European projects.