Learning shape retrieval from different modalities

doi:10.1016/j.neucom.2017.01.101

Neurocomputing

Volume 253, 30 August 2017, Pages 24-33

https://doi.org/10.1016/j.neucom.2017.01.101 Get rights and content

Highlights

•
New shape retrieval framework using queries of different modalities is proposed.
•
Kernel function computed from 3D shape similarity is used to build a common space.
•
CNNs are used to embed three different entities into a common space.
•
A novel 3D shape descriptor based on local CNN features is proposed.
•
We demonstrate the performance of our framework using different benchmarks.

Abstract

We propose in this paper a new framework for 3D shape retrieval using queries of different modalities, which can include 3D models, images and sketches. The main scientific challenge is that different modalities have different representations and thus lie in different spaces. Moreover, the features that can be extracted from 2D images or 2D sketches are often different from those that can be computed from 3D models. Our solution is a new method based on Convolutional Neural Networks (CNN) that embeds all these entities into a common space. We propose a novel 3D shape descriptor based on local CNN features encoded using vectors of locally aggregated descriptors instead of conventional global CNN. Using a kernel function computed from 3D shape similarity, we build a target space in which wild images and sketches can be projected via two different CNNs. With this construction, matching can be performed in the common target space between same entities (sketch–sketch, image–image and 3D shape–3D shape) and more importantly across different entities (sketch-image, sketch-3D shape and image-3D shape). We demonstrate the performance of the proposed framework using different benchmarks including large scale SHREC 3D datasets.

Introduction

The rapid growth of the World Wide Web has raised the need and interest in developing efficient tools that search in large data collections in order to find relevant information. While text, images and videos were, and still are, the dominant type of visual information that is commonly shared, 3D model collections started to become part of the web. This is due in part to the emergence of commodity devices and easy-to-use modeling tools, but also to their importance to many fields of science including engineering, architecture, medicine, biology and digital entertainment industry. As a consequence, the need for efficient 3D search tools is growing. Such needs, however, may vary among different categories of users. Some, for example, may require access to 3D models using textual queries. Others may want to use sketches, images, or even other 3D models to query 3D data collections. Thus, a 3D search engine should provide a mechanism that enables users to search for relevant 3D models using as query (a combination of) different modalities. Here, by modalities we mean 3D meshes, 2D images (such as the ones available on the internet) and hand-drawn sketches. This, however, requires narrowing the gap between different modals or fusing these multimodal representations, which is still a very challenging and active topic in machine learning in general and in 3D shape analysis in particular.

In this paper, we address the issue of using multimodal representations for 3D shape classification and retrieval. Although there has been several papers that proposed mechanisms for shape retrieval, most of them consider only one modality [1], [2], [10]. Recently, Li et al. [3] proposed a framework in which they addressed the joint combination of two different modalities. To do so, they first compute an embedding from 3D model similarities based on handcrafted features. They then project images into the embedding using Convolutional Neural Network (CNN), which allows them computing distances between images and 3D models.

In this paper, we propose the use of three types of modalities; 3D shapes, 2D images and sketches. We then aim at discovering the explicit as well as the implicit relationships, which can be non-linear, between the various modalities used for shape retrieval. For this end, we project the three modalities into a common k-dimensional space $T$ in such a way that similar entities, treated as points in $T,$ regardless of whether they are 3D models, images, or sketches, will be close to each other in the target space $T$ . In order to construct the target space $T,$ we first start by computing shape signatures from a collection of 3D objects using deep CNN. We then propose to compute a mapping function P that maps the 3D shape signatures to some common specific space such that the dot product between mapped signatures is as close as possible to an original kernel representing the similarity between two shapes. Once the target space $T$ of the shape signatures is constructed, we project onto it, via a different CNN architecture, two other modalities, namely images and sketches, such that similar objects independently whether they are represented as 3D models, images or sketches, can be easily identified.

We show that by using this framework, one can achieve good performance in multimodal 3D shape retrieval. We demonstrate the utility and performance of the proposed approach for retrieving 3D shapes using different modalities and validate our approach using various benchmarks including large scale SHREC 3D datasets.

In this paper, we propose a framework for embedding multiple entities into a common target space in which similarities in both the same as well as cross entities can easily be computed. We study three different entities, namely, hand-drawn sketches, 2D images and 3D mesh models. Several papers have studied this problem in the context of image and video retrieval using multiple modalities such as text, audio, images, and videos [4], [5], [6]. In this section, we survey the methods that are most related to our work including 3D shape retrieval, image based shape retrieval and sketch based shape retrieval. Please refer to [7] for a review of the state-of-the-art in image retrieval.

3D shape retrieval. 3D shape retrieval has been a very active field of research in the past decade. We refer the reader to some of the recent surveys on the topic [2], [8], [9], [10], [11], [12]. In this section, we do not aim to provide a full taxonomy of the topic. We instead view the existing methods from the perspective of the way the features that have been used are computed. Early methods, for instance, used handcrafted descriptors in order to characterize the geometric and topological attributes of 3D shapes [13], [14], [15], [16], [17], [18], [19], [20], [21], [22]. Their performance, however, is often limited, particularly when dealing with medium or large databases, since they do not capture the essence or semantics of the shapes being indexed. More recent methods used learned descriptors involving either supervised or unsupervised feature extraction [23], [24], [25], [26]. Approaches that use Convolutional Neural Networks (CNN) help discover automatically efficient shape representations from voxelized [23] or depth [25] data.

While there is a large amount of research dedicated to designing efficient shape descriptors, little attention has been given to the way 3D databases are queried. Most of the existing techniques use 3D models as queries in order to retrieve other 3D models that are similar. This is, however, not practical since often users want to retrieve 3D models for which they only have either an image or an abstract description, which can be transposed into a hand-drawn sketch.

Cross-modality 3D shape retrieval. Retrieving 3D shapes using 2D images or 2D/3D sketches as search queries requires a mechanism for comparing descriptors computed on images with descriptors computed on 3D models. This is not a straightforward process since images and 3D models have different representations and thus lie in two different spaces. Moreover, the features that can be extracted from 2D images are often different from those that can be computed from 3D models and thus, direct comparison is not feasible. Daras and Axenopoulos [27] used global description method for comparing 3D shapes and 2D images. More recently, Li et al. [3] proposed a joint embedding (using CNN) of images and 3D shapes into a common space in which they can compute a similarity between both entities. The embedding space in [3] is constructed from 3D shape similarities computed from handcrafted descriptors [28].

Methods that used sketch-based 3D shape retrieval aimed at finding a mechanism for mapping 3D shapes into a space in which it can be compared with 2D sketches, see [2], [29] for a survey. Some techniques extract silhouettes from 3D models by projecting them into a binary image. The projected silhouettes are then compared with the 2D sketches. This technique has recently been used in different works such as Kanai [30], Li et al. [31], and Aono and Iwabuchi [32]. From the literature, we also can find the contour or outline feature view which has also been used as a series of points where the surface blends sharply and becomes invisible to the viewer [32], [33], [34], [35]. Other techniques use suggestive contour feature views [33] to build sketch-based 3D model retrieval systems [31], [36], [37], [38], [39]. Suggestive contours are contours in the nearby views, that is, they will become contours after slightly rotating the model. Wang et al. [40] have recently proposed to learn feature representations using CNN and suggestive contour views for sketch based shape retrieval.

Fig. 1 overviews the approach proposed in this paper. The contributions of this work are three-fold. First, we design and develop an efficient framework for multimodal shape querying that involves 3D shapes (3D meshes), 2D images and hand-drawn sketches. Second, we propose a novel 3D shape descriptor based on local CNN features encoded using vectors of locally aggregated descriptors (VLAD) instead of conventional global CNN. Third, using a kernel function computed from 3D shape similarity, we finally build a target space in which wild images and sketches can be projected through two different CNNs learned from 3D shapes. After the target space construction, matching can be performed in the common space between the same entities (sketch–sketch, image–image and 3D shape–3D shape) and more importantly across different entities (sketch-image, sketch-3D shape and image-3D shape). Note that, the proposed framework can be also used for single entity retrieval using multiple entities as well as multiple entities retrieval using single or multiple entities.

Section snippets

3D shape description

Prior to feature extraction, we normalize the pose and scale of the 3D objects in order to ensure invariance to translation and scaling. Here, we do not perform the pose normalization for rotation because the locations of local features are completely ignored in our method. We first translate the 3D objects to their center of mass and then scale them so that their minimum bounding sphere is of radius 1 [41]. We then represent a 3D object using a set of 2D views captured by virtual cameras

Target space construction

Using the similarities between the collection of 3D shapes, our aim is to build a target space in which wild images and sketches can be projected. The target space allows discovering the explicit as well as implicit relationships between the different modalities, which can be non-linear. Below, we present the details of the construction of the target space $T$ .

Experimental results

The main benefit of the proposed framework is that since 3D shapes, images and sketches are all projected into a common space, the user has a high flexibility in the way the query can be specified. The user can also select the preferred retrieved modality. We carry out a set of experiments in which we present results from different choices of queries and results including retrieval from the same modality (3D shape-3D shape), single cross modality (image-3D shape as well as sketch-3D shape) and

Conclusion

We addressed in this paper the problem of 3D shape retrieval using multiple modalities. We proposed a general framework for comparing objects from three different modalities. With this approach, images and 2D sketches are embedded through their similarities with 3D models into a common space having an Euclidean structure. By treating the diverse objects as points in the common embedding space, one can easily quantify their similarities, using various types of distance measures. As a result, the

Hedi Tabia received the engineer degree from the Engineering School of SFAX (ENIS), in 2007, and the M.S. degree from the INSA of Rouen Public school of engineers, France, in 2008, both in computer science. In 2011, he obtained the Ph.D. degree in computer science from the University of Lille. From October 2011 to August 2012, he held a postdoctoral research associate position at the IEF laboratory (University of Paris-sud). Since September 2012, He is associate professor at the ENSEA in the

References (56)

LiB. et al.
A comparison of 3D shape retrieval methods based on a large-scale benchmark supporting multimodal queries
Comput. Vis. Image Underst.
(2015)
LiY. et al.
Joint embeddings of shapes and images via CNN image purification
ACM Trans. Graph. (TOG)
(2015)
S.A.A. Shah et al.
A novel feature representation for automatic 3D object recognition in cluttered scenes
Neurocomputing
(2016)
S. Bai et al.
Neural shape codes for 3D model retrieval
Pattern Recognit. Lett.
(2015)
LiB. et al.
A comparison of methods for sketch-based 3D shape retrieval
Comput. Vis. Image Underst.
(2014)
Z. Lian et al.
SHREC’11 track: shape retrieval on non-rigid 3D watertight meshes
Proceedings of the Fourth Eurographics Conference on 3D Object Retrieval (3DOR)
(2011)
J. Ngiam et al.
Multimodal deep learning
Proceedings of the Twenty-eighth International Conference on Machine Learning (ICML-11)
(2011)
N. Srivastava et al.
Multimodal learning with deep Boltzmann machines
Advances in Neural Information Processing Systems
(2012)
A. Frome et al.
Devise: a deep visual-semantic embedding model
Advances in Neural Information Processing Systems
(2013)
R. Veltkamp et al.
State-of-the-art in Content-based Image and Video Retrieval
(2013)

J.W. Tangelder et al.

A survey of content based 3D shape retrieval methods

Multimed. Tools Appl.

(2008)

QinZ. et al.

Content based 3D model retrieval: a survey

Proceedings of the 2008 International Workshop on Content-Based Multimedia Indexing (CBMI)

(2008)

LiB. et al.

SHREC’12 track: generic 3D shape retrieval

Proceedings of the Fifth Eurographics Conference on 3D Object Retrieval (3DOR)

(2012)

O. Van Kaick et al.

A survey on shape correspondence

Computer Graphics Forum

(2011)

LiuZ.-B. et al.

A survey on partial retrieval of 3D shapes

J. Comput. Sci. Technol.

(2013)

H. Tabia et al.

3D-shape retrieval using curves and HMM

Proceedings of the Twentieth International Conference on Pattern Recognition (ICPR)

(2010)

H. Tabia et al.

Covariance-based descriptors for efficient 3D shape matching, retrieval, and classification

IEEE Trans. Multimed.

(2015)

A.M. Bronstein et al.

Shape google: geometric words and expressions for invariant shape retrieval

ACM Trans. Graph.

(2011)

H. Tabia et al.

Covariance descriptors for 3D shape matching and retrieval

Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition

(2014)

R. Toldo et al.

Visual vocabulary signature for 3D object retrieval and partial matching

Proceedings of the Second Eurographics Conference on 3D Object Retrieval (3DOR)

(2009)

H. Tabia et al.

Local visual patch for 3D shape retrieval

Proceedings of the ACM Workshop on 3D Object Retrieval (3DOR)

(2010)

H. Tabia et al.

A parts-based approach for automatic 3D shape categorization using belief functions

ACM Trans. Intell. Syst. Technol. (TIST)

(2013)

X. Bai et al.

3D shape matching via two layer coding

IEEE Trans. Pattern Anal. Mach. Intell.

(2015)

X. Bai et al.

Shape vocabulary: a robust and efficient shape representation for shape matching

IEEE Trans. Image Process.

(2014)

WuZ. et al.

3D ShapeNets: a deep representation for volumetric shapes

Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition

(2015)

S. Bu et al.

Learning high-level feature by deep belief networks for 3-D model retrieval and recognition

IEEE Trans. Multimed.

(2014)

S. Bai, X. Bai, Z. Zhou, Z. Zhang, L.J. Latecki, Gift: a real-time and scalable 3D shape search engine, in: Proceedings...

P. Daras et al.

A 3D shape retrieval framework supporting multimodal queries

Int. J. Comput. Vis.

(2010)

Cited by (17)

Octagonal lattice-based triangulated shape descriptor engaging second-order derivatives supplementing image retrieval
2024, Journal of Visual Communication and Image Representation
Erstwhile shape description schemes lack primarily in establishing trade-offs with accuracy and computational load. Accordingly, a lightweight shape descriptor offering precise definition and compaction of high-frequency features is contributed in this paper using a simple geometrical shape for localization and shape characterization. Initially, the input image is octagonally tessellated and triangularly decomposed into sub-regions whose side-wise differences are evaluated and subjected to second-order differentiation to produce three high-frequency values representing triangle corners. The resultant is processed by the law of sines to yield localized shape features exhibiting congruence and is reiterated on the residual regions, followed by a novel octal encoding scheme encompassing maximal variations in the localized regions. The resulting features are globally fabricated into shape histograms in a non-overlapping manner representing the shape vector. This scheme validated on widely popular benchmark shape datasets demonstrates superior retrieval and recognition accuracies greater than 93% which is lacking in its competitors.
PAGML: Precise Alignment Guided Metric Learning for sketch-based 3D shape retrieval
2023, Image and Vision Computing
Sketch-based 3D shape retrieval has always been a hot research topic in the computer vision community. The main challenge is to alleviate the cross-modality discrepancies such that the retrieval accuracies can be improved. In this paper, we propose a novel Precise Alignment Guided Metric Learning (PAGML) method based on master-auxiliary cross-modality retrieval framework. An auxiliary learning network is developed to indirectly guide the master learning model to extract features of rich semantic information, so as to achieve a semantic alignment between the cross-modality data. Furthermore, affected by the intra-class variability and inter-class imbalance issue, the learned class distributions may exhibit unevenness in the common embedding space and cause poor retrieval performance. A loss function dedicated for cross-modality retrieval is designed to achieve a rigid alignment between sketches and 3D shapes of the same category by pulling their rich semantic representations to the rigid center of the category. As a result, a more precise alignment between the cross-modality embedding features of the same category is approached gradually, which further alleviates the cross-modality discrepancies, inter-class variability, and inter-class imbalance, thus improving the cross-modality retrieval accuracies. Extensive experiments on two public benchmark datasets demonstrate that the proposed PAGML surpasses the state-of-the-art methods in retrieval accuracy and has excellent generalization abilities to unseen classes.
HDA<sup>2</sup>L: Hierarchical Domain-Augmented Adaptive Learning for sketch-based 3D shape retrieval
2023, Knowledge-Based Systems
The sketch-based 3D shape retrieval has been an active but challenging task for several decades. In this paper, we deeply analyze the challenges and propose a novel Hierarchical Domain-Augmented Adaptive Learning (HDA $^{2}$ L) for sketch-based 3D shape retrieval. The first notable challenge is the vast cross-modality discrepancies between sketches and 3D shapes. To this issue, the existing methods restrict the consistency of the final features by establishing a shared cross-domain loss but ignore the feature extraction process, resulting in a limited effect. We argue that the mutual information of the samples from the same class but different domains can provide an important cue to enhance common features captured in the feature extraction process. Thus, we design an Inter-Domain Augmented Network (Inter-DAN) by employing inter-domain feature correlation learning to capture cross-domain mutual information to learn augmented common global features for both sketches and 3D shapes. Another challenge is that the input sketch is various: it may be particularly abstract and contains only the overall outline of the target model, or it may be too sketchy and only contains some salient local regions of the target model. Though existing methods have demonstrated their capability to capture overall features, they always ignore the learning of local discriminative features and fail to adapt to the various changes of input sketches. To address this issue, we design an Intra-Domain Augmented Network (Intra-DAN) for sketches and 3D shapes, respectively, which learns augmented local discriminative features by adopting cascading cross-layer bilinear pooling operations. In addition, we design a Source-Agnostic Adversarial Network (SAAN) to accomplish the adaptive hierarchical domain features fusion, which forces the network to adaptively focus on more discriminative information from global features and local features and further adapt the diversity of the input sketches. The experiments on three benchmark datasets demonstrate that our method obtains superior retrieval performance than the state-of-the-art sketch-based 3D shape retrieval approaches.
Cross-domain retrieving sketch and shape using cycle CNNs
2020, Computers and Graphics (Pergamon)
Citation Excerpt :
Then, these two networks are trained using the deep metric learning network (e.g., siamese network or triplet network [14,15]). The aforementioned methods [2,16] directly project the data of different modalities from their own space into one single common feature space, where the Euclidean distance of features directly correspond to the semantic similarity between images and shapes. This methodology appears pragmatic, but it also brings a number of puzzling issues.
In this paper, we present a deep learning approach for cross-domain retrieval of 3D shape and 2D sketch image. Cross-domain retrieval has received significant attention to flexibly find information across different modalities of data. Effective measuring the similarity between different modalities of data is the key of cross-domain retrieval. Different modalities such as shape and sketch have imbalanced and complementary relationships, which contain unequal amount of information when describing the same semantics. Existing methods based on deep learning networks mostly construct one common space for different modalities, and these nets usually loss exclusive modality-specific characteristics.
To address this problem, we propose a novel Cycle CNNs to estimate the cross-domain mapping between the space of 3D shape descriptors and the one of 2D sketch features. First, we employ the existing networks to construct independent feature spaces for each modality. For each feature space, modality-specific properties within one modality are fully exploited. Next, we use the designed Cycle CNNs to learn the mapping function between different feature spaces. This network can capture the mapping relationship between 3D shape feature space and 2D sketch feature domain. Finally, we use the explored mapping between the feature spaces of different modalities to perform cross-domain retrieval. We demonstrate a variety of promising results, where our method achieves better retrieval accuracy than existing state-of-the-art approaches.
Deep point-to-subspace metric learning for sketch-based 3D shape retrieval
2019, Pattern Recognition
One key issue in managing a large scale 3D shape dataset is to identify an effective way to retrieve a shape-of-interest. The sketch-based query, which enjoys the flexibility in representing the user’s intention, has received growing interests in recent years due to the popularization of the touchscreen technology. Essentially, the sketch depicts an abstraction of a shape in a certain view while the shape contains the full 3D information. Matching between them is a cross-modality retrieval problem, and the state-of-the-art solution is to project the sketch and the 3D shape into a common space with which the cross-modality similarity can be calculated by the feature similarity/distance within. However, for a given query, only part of the viewpoints of the 3D shape is representative. Thus, blindly projecting a 3D shape into a feature vector without considering what is the query will inevitably bring query-unrepresentative information. To handle this issue, in this work we propose a Deep Point-to-Subspace Metric Learning (DPSML) framework to project a sketch into a feature vector and a 3D shape into a subspace spanned by a few selected basis feature vectors. The similarity between them is defined as the distance between the query feature vector and its closest point in the subspace by solving an optimization problem on the fly. Note that, the closest point is query-adaptive and can reflect the viewpoint information that is representative to the given query. To efficiently learn such a deep model, we formulate it as a classification problem with a special classifier design. To reduce the redundancy of 3D shapes, we also introduce a Representative-View Selection (RVS) module to select the most representative views of a 3D shape. By conducting extensive experiments on various datasets, we show that the proposed method can achieve superior performance over its competitive baseline methods and attain the state-of-the-art performance.
Pagml: Precise Alignment Guided Metric Learning for Sketch-Based 3d Shape Retrieval
2023, SSRN

View all citing articles on Scopus

Hamid Laga received the M.Sc. and Ph.D. degrees in Computer Science from Tokyo Institute of Technology in 2003 and 2006, respectively. He is currently an associate professor at the School of Engineering and IT, Murdoch University (Australia). Prior to joining Murdoch, Hamid worked as senior research fellow at the Phenomics and Bioinformatics Research Centre (PBRC) of the University of South Australia (2012–2016), Associate Professor at the Institut Telecom, Telecom Lille1 in France (2010–2012), Assistant Professor at Tokyo Institute of Technology (2006–2010), and PostDoctoral fellow at Nara Institute of Science and Technology in Japan (2006). His research interests span various fields of computer vision, computer graphics, and pattern recognition, with a special focus on the 3D acquisition, modeling and analysis of static and deformable 3D objects. His contributions in these fields received the Best Paper Award at the IEEE International Conference on Shape Modeling (2006), the Best Paper Award at NICOGRAPH Paper Context (2007), the International Paper Grand Prix (Best Paper Award) by the Japan Society of Art and Science (2008), and the APRS/IAPR Best Paper Prize at the IEEE International Conference on Digital Image Computing, Techniques and Applications (DICTA) 2012.

View full text

Learning shape retrieval from different modalities

Highlights

Abstract

Introduction

Section snippets

3D shape description

Target space construction

Experimental results

Conclusion

Comput. Vis. Image Underst.

ACM Trans. Graph. (TOG)

Neurocomputing

Pattern Recognit. Lett.

Comput. Vis. Image Underst.

SHREC’11 track: shape retrieval on non-rigid 3D watertight meshes

Proceedings of the Fourth Eurographics Conference on 3D Object Retrieval (3DOR)

Multimodal deep learning

Proceedings of the Twenty-eighth International Conference on Machine Learning (ICML-11)

Multimodal learning with deep Boltzmann machines

Advances in Neural Information Processing Systems

Devise: a deep visual-semantic embedding model

Advances in Neural Information Processing Systems

State-of-the-art in Content-based Image and Video Retrieval

A survey of content based 3D shape retrieval methods

Multimed. Tools Appl.

Content based 3D model retrieval: a survey

Proceedings of the 2008 International Workshop on Content-Based Multimedia Indexing (CBMI)

SHREC’12 track: generic 3D shape retrieval

Proceedings of the Fifth Eurographics Conference on 3D Object Retrieval (3DOR)

A survey on shape correspondence

Computer Graphics Forum

A survey on partial retrieval of 3D shapes

J. Comput. Sci. Technol.

3D-shape retrieval using curves and HMM

Proceedings of the Twentieth International Conference on Pattern Recognition (ICPR)

Covariance-based descriptors for efficient 3D shape matching, retrieval, and classification

IEEE Trans. Multimed.

Shape google: geometric words and expressions for invariant shape retrieval

ACM Trans. Graph.

Covariance descriptors for 3D shape matching and retrieval

Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition

Visual vocabulary signature for 3D object retrieval and partial matching

Proceedings of the Second Eurographics Conference on 3D Object Retrieval (3DOR)

Local visual patch for 3D shape retrieval

Proceedings of the ACM Workshop on 3D Object Retrieval (3DOR)

A parts-based approach for automatic 3D shape categorization using belief functions

ACM Trans. Intell. Syst. Technol. (TIST)

3D shape matching via two layer coding

IEEE Trans. Pattern Anal. Mach. Intell.

Shape vocabulary: a robust and efficient shape representation for shape matching

IEEE Trans. Image Process.

3D ShapeNets: a deep representation for volumetric shapes

Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition

Learning high-level feature by deep belief networks for 3-D model retrieval and recognition

IEEE Trans. Multimed.

A 3D shape retrieval framework supporting multimodal queries

Int. J. Comput. Vis.