Elsevier

Pattern Recognition

Volume 46, Issue 12, December 2013, Pages 3358-3370
Pattern Recognition

A unified framework for multimodal retrieval

https://doi.org/10.1016/j.patcog.2013.05.023Get rights and content

Highlights

  • We propose an automatic weighting scheme for multimodal retrieval.

  • We handle the case of external multimodal queries.

  • Our experimental evaluation is performed in 5 multimodal datasets.

  • High performance in terms of retrieval accuracy and computational efficiency.

Abstract

In this paper, a unified framework for multimodal content retrieval is presented. The proposed framework supports retrieval of rich media objects as unified sets of different modalities (image, audio, 3D, video and text) by efficiently combining all monomodal heterogeneous similarities to a global one according to an automatic weighting scheme. Then, a multimodal space is constructed to capture the semantic correlations among multiple modalities. In contrast to existing techniques, the proposed method is also able to handle external multimodal queries, by embedding them to the already constructed multimodal space, following a space mapping procedure of a submanifold analysis. In our experiments with five real multimodal datasets, we show the superiority of the proposed approach against competitive methods.

Introduction

The continuously increasing amount of multimedia content on the Internet emerged the imperative need for searching in various online multimedia databases. The traditional text-based retrieval techniques failed to address the requirements for searching this massive media content, therefore, research has been focused on content-based multimedia retrieval methods. Searching for similar results to a query content requires the automatic extraction of low-level features from media, e.g. in case of an image these would be color, texture, shape, etc. Thus, several content-based techniques have been developed in the past, performing retrieval of a single modality, such as 3D objects [11], [19], images [1], [31], video [12], [23] or audio [2], [30].

However, users who search for content are interested in finding semantically similar results to a query, regardless of its modality. Towards this aim, Yang et al. [37] proposed a method for connecting various semantically similar media of different modalities. In order to manage the case of having different modalities that carry the same semantics, the concept of Multimedia Document (MMD) was introduced. An example of a MMD is presented in Fig. 1, which describes a physical entity of an airplane and consists of its 3D representation, real image and sound.

Recently, multimedia search engines have evolved, allowing combinations of queries of different modalities. Multimodal search allows users to enter multiple query types and retrieve multiple types of media simultaneously in the form of MMDs. An approach for multimodal search has been introduced by the I-SEARCH1 framework [3]. I-SEARCH is a real world application, which enables retrieval of several types of media (3D objects, 2D images, sound, video and text) using as query any of the above types or their combinations in the form of MMDs. I-SEARCH made a significant step towards content-based multimedia retrieval, where users can search and retrieve media of any modality using a single unified retrieval framework and not a specialized system for each separate modality. Moreover, users in I-SEARCH can enter multiple queries simultaneously and thus, retrieve more relevant results. However, handling media in the form of MMDs is a highly complicated process, since the successful modeling of the low-level feature associations among the different modalities is required, in order to perform multimodal retrieval.

Section snippets

Dimensionality reduction methods for monomodal retrieval

In content-based retrieval methods, media are usually represented by low-level features in the form of high-dimensional descriptor vectors in which a distance metric (more often Euclidean-based) is applied to calculate similarity. However, since in most cases the extracted high-dimensional descriptor vectors face the problem of Dimensionality Curse [6], such metrics are inappropriate for efficient large scale retrieval. Therefore, nonlinear dimensionality reduction methods based on Manifold

The proposed method

The proposed method is divided into the following two steps: (a) construction of the multimodal semantic space of MMDs and (b) multimodal search and retrieval.

Datasets

Experimental evaluation is performed in five real multimodal datasets, denoted by DS1,2 DS2,3 DS3,4 DS4,5 and DS5.6 Further details are provided in Table 1.

Conclusions

In this paper, a unified framework for multimodal content-based search and retrieval is presented, which supports internal and external MMD-queries. Following an innovative weighting strategy, all monomodal heterogeneous similarities are combined to a global MMD similarity, by considering (a) the different nature of the monomodal similarities and (b) the availability of modalities per MMD in the database. As a result, the local neighborhood of each media modality is preserved into the

Conflict of interest statement

None declared.

Acknowledgment

This work was supported by the EC project 3DVIVANT, GA-248420.

D. Rafailidis was born in Larissa, Greece, in 1982. He received the Diploma in Informatics from the Computer Science Department, the M.Sc. degree in Information Systems and the Ph.D. degree in Music Information Retrieval from the Aristotle University of Thessaloniki, Greece in 2005, 2007 and 2011, respectively. His main research interests include data mining, information retrieval and recommender systems. Currently, he is a Postdoctoral Research Fellow at Information Technologies Institute of

References (39)

  • M. Balasubramanian et al.

    The ISOMAP algorithm and topological stability

    Science

    (2002)
  • M. Belkin et al.

    Laplacian Eigenmaps for dimensionality reduction and data representation

    Neural Computation

    (2003)
  • R. Bellman

    Adaptive Control ProcessesA Guided Tour

    (1961)
  • S.A. Chatzichristofis, Y. S. Boutalis, CEDD: color and edge directivity descriptor—a compact descriptor for image...
  • P. Daras, A. Axenopoulos, A 3D Shape Retrieval Framework Supporting Multimodal Queries, vol. 89 no. 2–3, September...
  • P. Daras et al.

    Search and retrieval of rich media objects supporting multiple multimodal queries

    IEEE Transactions on Multimedia

    (2012)
  • P. Ehlen, M. Johnston, Location grounding in multimodal local search, in: Proceedings of ICMI-MLMI, 2010, pp....
  • T. Gao et al.

    3D model comparison using spatial structure circular descriptor

    Pattern Recognition

    (2010)
  • P. Geetha et al.

    A survey of content-based video retrieval

    Journal of Computer Science

    (2008)
  • Cited by (42)

    • Semisupervised charting for spectral multimodal manifold learning and alignment

      2021, Pattern Recognition
      Citation Excerpt :

      Processing of multimodal data is of critical importance [1-3]. This idea, also called information fusion, has proved its advantages, especially in pattern recognition applications [4-9]. Taking advantage of the diversity of information provided by multiple data modalities, new data representation knowledge can be extracted which is not accessible from each modality separately.

    • Learning visual and textual representations for multimodal matching and classification

      2018, Pattern Recognition
      Citation Excerpt :

      In this work, our focus is on jointly modeling the multimodal matching and classification between vision and language. The multimodal research underpins many critical applications in the computer vision field, including image captioning [1–3], cross-modal retrieval [4–6], and zero-shot recognition [7–10]. Specifically, multimodal matching has been studied for decades, with the aim of searching for a latent space, where visual and textual features can be unified to be latent embeddings.

    View all citing articles on Scopus

    D. Rafailidis was born in Larissa, Greece, in 1982. He received the Diploma in Informatics from the Computer Science Department, the M.Sc. degree in Information Systems and the Ph.D. degree in Music Information Retrieval from the Aristotle University of Thessaloniki, Greece in 2005, 2007 and 2011, respectively. His main research interests include data mining, information retrieval and recommender systems. Currently, he is a Postdoctoral Research Fellow at Information Technologies Institute of the Center for Research and Technology Hellas (CERTH).

    S. Manolopoulou was born in Larissa, Greece, in 1984. She received the B.Sc. degree in informatics and the M.Sc. degree in digital media, both from Department of Informatics, Aristotle University of Thessaloniki, Thessaloniki, Greece, in 2007 and 2009, respectively. She has been a Research Assistant at the Information Technologies Institute (ITI) of the Centre for Research and Technology Hellas (CERTH) since January 2010. Her main research interests include digital processing of medical images, biomedical signal processing, and multimedia content-based search and retrieval.

    P. Daras was born in Athens, Greece, in 1974. He received the Diploma degree in electrical and computer engineering, the M.Sc. degree in medical informatics, and the Ph.D. degree in electrical and computer engineering, all from the Aristotle University of Thessaloniki, Thessaloniki, Greece, in 1999, 2002, and 2005, respectively. He is a Researcher Grade C at the Information Technologies Institute of the Centre for Research and Technology Hellas (CERTH). His main research interests include search, retrieval and recognition of 3D objects, 3D object processing, medical informatics applications, medical image processing, 3D object watermarking, and bioinformatics. He serves as a reviewer/evaluator of European projects.

    View full text