Learning shape retrieval from different modalities
Introduction
The rapid growth of the World Wide Web has raised the need and interest in developing efficient tools that search in large data collections in order to find relevant information. While text, images and videos were, and still are, the dominant type of visual information that is commonly shared, 3D model collections started to become part of the web. This is due in part to the emergence of commodity devices and easy-to-use modeling tools, but also to their importance to many fields of science including engineering, architecture, medicine, biology and digital entertainment industry. As a consequence, the need for efficient 3D search tools is growing. Such needs, however, may vary among different categories of users. Some, for example, may require access to 3D models using textual queries. Others may want to use sketches, images, or even other 3D models to query 3D data collections. Thus, a 3D search engine should provide a mechanism that enables users to search for relevant 3D models using as query (a combination of) different modalities. Here, by modalities we mean 3D meshes, 2D images (such as the ones available on the internet) and hand-drawn sketches. This, however, requires narrowing the gap between different modals or fusing these multimodal representations, which is still a very challenging and active topic in machine learning in general and in 3D shape analysis in particular.
In this paper, we address the issue of using multimodal representations for 3D shape classification and retrieval. Although there has been several papers that proposed mechanisms for shape retrieval, most of them consider only one modality [1], [2], [10]. Recently, Li et al. [3] proposed a framework in which they addressed the joint combination of two different modalities. To do so, they first compute an embedding from 3D model similarities based on handcrafted features. They then project images into the embedding using Convolutional Neural Network (CNN), which allows them computing distances between images and 3D models.
In this paper, we propose the use of three types of modalities; 3D shapes, 2D images and sketches. We then aim at discovering the explicit as well as the implicit relationships, which can be non-linear, between the various modalities used for shape retrieval. For this end, we project the three modalities into a common k-dimensional space in such a way that similar entities, treated as points in regardless of whether they are 3D models, images, or sketches, will be close to each other in the target space . In order to construct the target space we first start by computing shape signatures from a collection of 3D objects using deep CNN. We then propose to compute a mapping function P that maps the 3D shape signatures to some common specific space such that the dot product between mapped signatures is as close as possible to an original kernel representing the similarity between two shapes. Once the target space of the shape signatures is constructed, we project onto it, via a different CNN architecture, two other modalities, namely images and sketches, such that similar objects independently whether they are represented as 3D models, images or sketches, can be easily identified.
We show that by using this framework, one can achieve good performance in multimodal 3D shape retrieval. We demonstrate the utility and performance of the proposed approach for retrieving 3D shapes using different modalities and validate our approach using various benchmarks including large scale SHREC 3D datasets.
In this paper, we propose a framework for embedding multiple entities into a common target space in which similarities in both the same as well as cross entities can easily be computed. We study three different entities, namely, hand-drawn sketches, 2D images and 3D mesh models. Several papers have studied this problem in the context of image and video retrieval using multiple modalities such as text, audio, images, and videos [4], [5], [6]. In this section, we survey the methods that are most related to our work including 3D shape retrieval, image based shape retrieval and sketch based shape retrieval. Please refer to [7] for a review of the state-of-the-art in image retrieval.
3D shape retrieval. 3D shape retrieval has been a very active field of research in the past decade. We refer the reader to some of the recent surveys on the topic [2], [8], [9], [10], [11], [12]. In this section, we do not aim to provide a full taxonomy of the topic. We instead view the existing methods from the perspective of the way the features that have been used are computed. Early methods, for instance, used handcrafted descriptors in order to characterize the geometric and topological attributes of 3D shapes [13], [14], [15], [16], [17], [18], [19], [20], [21], [22]. Their performance, however, is often limited, particularly when dealing with medium or large databases, since they do not capture the essence or semantics of the shapes being indexed. More recent methods used learned descriptors involving either supervised or unsupervised feature extraction [23], [24], [25], [26]. Approaches that use Convolutional Neural Networks (CNN) help discover automatically efficient shape representations from voxelized [23] or depth [25] data.
While there is a large amount of research dedicated to designing efficient shape descriptors, little attention has been given to the way 3D databases are queried. Most of the existing techniques use 3D models as queries in order to retrieve other 3D models that are similar. This is, however, not practical since often users want to retrieve 3D models for which they only have either an image or an abstract description, which can be transposed into a hand-drawn sketch.
Cross-modality 3D shape retrieval. Retrieving 3D shapes using 2D images or 2D/3D sketches as search queries requires a mechanism for comparing descriptors computed on images with descriptors computed on 3D models. This is not a straightforward process since images and 3D models have different representations and thus lie in two different spaces. Moreover, the features that can be extracted from 2D images are often different from those that can be computed from 3D models and thus, direct comparison is not feasible. Daras and Axenopoulos [27] used global description method for comparing 3D shapes and 2D images. More recently, Li et al. [3] proposed a joint embedding (using CNN) of images and 3D shapes into a common space in which they can compute a similarity between both entities. The embedding space in [3] is constructed from 3D shape similarities computed from handcrafted descriptors [28].
Methods that used sketch-based 3D shape retrieval aimed at finding a mechanism for mapping 3D shapes into a space in which it can be compared with 2D sketches, see [2], [29] for a survey. Some techniques extract silhouettes from 3D models by projecting them into a binary image. The projected silhouettes are then compared with the 2D sketches. This technique has recently been used in different works such as Kanai [30], Li et al. [31], and Aono and Iwabuchi [32]. From the literature, we also can find the contour or outline feature view which has also been used as a series of points where the surface blends sharply and becomes invisible to the viewer [32], [33], [34], [35]. Other techniques use suggestive contour feature views [33] to build sketch-based 3D model retrieval systems [31], [36], [37], [38], [39]. Suggestive contours are contours in the nearby views, that is, they will become contours after slightly rotating the model. Wang et al. [40] have recently proposed to learn feature representations using CNN and suggestive contour views for sketch based shape retrieval.
Fig. 1 overviews the approach proposed in this paper. The contributions of this work are three-fold. First, we design and develop an efficient framework for multimodal shape querying that involves 3D shapes (3D meshes), 2D images and hand-drawn sketches. Second, we propose a novel 3D shape descriptor based on local CNN features encoded using vectors of locally aggregated descriptors (VLAD) instead of conventional global CNN. Third, using a kernel function computed from 3D shape similarity, we finally build a target space in which wild images and sketches can be projected through two different CNNs learned from 3D shapes. After the target space construction, matching can be performed in the common space between the same entities (sketch–sketch, image–image and 3D shape–3D shape) and more importantly across different entities (sketch-image, sketch-3D shape and image-3D shape). Note that, the proposed framework can be also used for single entity retrieval using multiple entities as well as multiple entities retrieval using single or multiple entities.
Section snippets
3D shape description
Prior to feature extraction, we normalize the pose and scale of the 3D objects in order to ensure invariance to translation and scaling. Here, we do not perform the pose normalization for rotation because the locations of local features are completely ignored in our method. We first translate the 3D objects to their center of mass and then scale them so that their minimum bounding sphere is of radius 1 [41]. We then represent a 3D object using a set of 2D views captured by virtual cameras
Target space construction
Using the similarities between the collection of 3D shapes, our aim is to build a target space in which wild images and sketches can be projected. The target space allows discovering the explicit as well as implicit relationships between the different modalities, which can be non-linear. Below, we present the details of the construction of the target space .
Experimental results
The main benefit of the proposed framework is that since 3D shapes, images and sketches are all projected into a common space, the user has a high flexibility in the way the query can be specified. The user can also select the preferred retrieved modality. We carry out a set of experiments in which we present results from different choices of queries and results including retrieval from the same modality (3D shape-3D shape), single cross modality (image-3D shape as well as sketch-3D shape) and
Conclusion
We addressed in this paper the problem of 3D shape retrieval using multiple modalities. We proposed a general framework for comparing objects from three different modalities. With this approach, images and 2D sketches are embedded through their similarities with 3D models into a common space having an Euclidean structure. By treating the diverse objects as points in the common embedding space, one can easily quantify their similarities, using various types of distance measures. As a result, the
Hedi Tabia received the engineer degree from the Engineering School of SFAX (ENIS), in 2007, and the M.S. degree from the INSA of Rouen Public school of engineers, France, in 2008, both in computer science. In 2011, he obtained the Ph.D. degree in computer science from the University of Lille. From October 2011 to August 2012, he held a postdoctoral research associate position at the IEF laboratory (University of Paris-sud). Since September 2012, He is associate professor at the ENSEA in the
References (56)
- et al.
A comparison of 3D shape retrieval methods based on a large-scale benchmark supporting multimodal queries
Comput. Vis. Image Underst.
(2015) - et al.
Joint embeddings of shapes and images via CNN image purification
ACM Trans. Graph. (TOG)
(2015) - et al.
A novel feature representation for automatic 3D object recognition in cluttered scenes
Neurocomputing
(2016) - et al.
Neural shape codes for 3D model retrieval
Pattern Recognit. Lett.
(2015) - et al.
A comparison of methods for sketch-based 3D shape retrieval
Comput. Vis. Image Underst.
(2014) - et al.
SHREC’11 track: shape retrieval on non-rigid 3D watertight meshes
Proceedings of the Fourth Eurographics Conference on 3D Object Retrieval (3DOR)
(2011) - et al.
Multimodal deep learning
Proceedings of the Twenty-eighth International Conference on Machine Learning (ICML-11)
(2011) - et al.
Multimodal learning with deep Boltzmann machines
Advances in Neural Information Processing Systems
(2012) - et al.
Devise: a deep visual-semantic embedding model
Advances in Neural Information Processing Systems
(2013) - et al.
State-of-the-art in Content-based Image and Video Retrieval
(2013)
A survey of content based 3D shape retrieval methods
Multimed. Tools Appl.
Content based 3D model retrieval: a survey
Proceedings of the 2008 International Workshop on Content-Based Multimedia Indexing (CBMI)
SHREC’12 track: generic 3D shape retrieval
Proceedings of the Fifth Eurographics Conference on 3D Object Retrieval (3DOR)
A survey on shape correspondence
Computer Graphics Forum
A survey on partial retrieval of 3D shapes
J. Comput. Sci. Technol.
3D-shape retrieval using curves and HMM
Proceedings of the Twentieth International Conference on Pattern Recognition (ICPR)
Covariance-based descriptors for efficient 3D shape matching, retrieval, and classification
IEEE Trans. Multimed.
Shape google: geometric words and expressions for invariant shape retrieval
ACM Trans. Graph.
Covariance descriptors for 3D shape matching and retrieval
Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition
Visual vocabulary signature for 3D object retrieval and partial matching
Proceedings of the Second Eurographics Conference on 3D Object Retrieval (3DOR)
Local visual patch for 3D shape retrieval
Proceedings of the ACM Workshop on 3D Object Retrieval (3DOR)
A parts-based approach for automatic 3D shape categorization using belief functions
ACM Trans. Intell. Syst. Technol. (TIST)
3D shape matching via two layer coding
IEEE Trans. Pattern Anal. Mach. Intell.
Shape vocabulary: a robust and efficient shape representation for shape matching
IEEE Trans. Image Process.
3D ShapeNets: a deep representation for volumetric shapes
Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition
Learning high-level feature by deep belief networks for 3-D model retrieval and recognition
IEEE Trans. Multimed.
A 3D shape retrieval framework supporting multimodal queries
Int. J. Comput. Vis.
Cited by (17)
Octagonal lattice-based triangulated shape descriptor engaging second-order derivatives supplementing image retrieval
2024, Journal of Visual Communication and Image RepresentationPAGML: Precise Alignment Guided Metric Learning for sketch-based 3D shape retrieval
2023, Image and Vision ComputingHDA<sup>2</sup>L: Hierarchical Domain-Augmented Adaptive Learning for sketch-based 3D shape retrieval
2023, Knowledge-Based SystemsCross-domain retrieving sketch and shape using cycle CNNs
2020, Computers and Graphics (Pergamon)Citation Excerpt :Then, these two networks are trained using the deep metric learning network (e.g., siamese network or triplet network [14,15]). The aforementioned methods [2,16] directly project the data of different modalities from their own space into one single common feature space, where the Euclidean distance of features directly correspond to the semantic similarity between images and shapes. This methodology appears pragmatic, but it also brings a number of puzzling issues.
Deep point-to-subspace metric learning for sketch-based 3D shape retrieval
2019, Pattern Recognition
Hedi Tabia received the engineer degree from the Engineering School of SFAX (ENIS), in 2007, and the M.S. degree from the INSA of Rouen Public school of engineers, France, in 2008, both in computer science. In 2011, he obtained the Ph.D. degree in computer science from the University of Lille. From October 2011 to August 2012, he held a postdoctoral research associate position at the IEF laboratory (University of Paris-sud). Since September 2012, He is associate professor at the ENSEA in the ETIS Laboratory.
Hamid Laga received the M.Sc. and Ph.D. degrees in Computer Science from Tokyo Institute of Technology in 2003 and 2006, respectively. He is currently an associate professor at the School of Engineering and IT, Murdoch University (Australia). Prior to joining Murdoch, Hamid worked as senior research fellow at the Phenomics and Bioinformatics Research Centre (PBRC) of the University of South Australia (2012–2016), Associate Professor at the Institut Telecom, Telecom Lille1 in France (2010–2012), Assistant Professor at Tokyo Institute of Technology (2006–2010), and PostDoctoral fellow at Nara Institute of Science and Technology in Japan (2006). His research interests span various fields of computer vision, computer graphics, and pattern recognition, with a special focus on the 3D acquisition, modeling and analysis of static and deformable 3D objects. His contributions in these fields received the Best Paper Award at the IEEE International Conference on Shape Modeling (2006), the Best Paper Award at NICOGRAPH Paper Context (2007), the International Paper Grand Prix (Best Paper Award) by the Japan Society of Art and Science (2008), and the APRS/IAPR Best Paper Prize at the IEEE International Conference on Digital Image Computing, Techniques and Applications (DICTA) 2012.