Full length article
JFLN: Joint Feature Learning Network for 2D sketch based 3D shape retrieval

https://doi.org/10.1016/j.jvcir.2022.103668Get rights and content

Abstract

Cross-modal retrieval attracts much research attention due to its wide applications in numerous search systems. Sketch based 3D shape retrieval is a typical challenging cross-modal retrieval task for the huge divergence between sketch modality and 3D shape view modality. Existing approaches project the sketches and shapes into a common space for feature update and data alignment. However, these methods contain several disadvantages: Firstly, the majority approaches ignore the modality-shared information for divergence compensation in descriptor generation process. Secondly, traditional fusion method of multi-view features introduces much redundancy, which decreases the discrimination of shape descriptors. Finally, most approaches only focus on the cross-modal alignment, which omits the modality-specific data relevance. To address these limitations, we propose a Joint Feature Learning Network (JFLN). Firstly, we design a novel modality-shared feature extraction network to exploit both modality-specific characteristics and modality-shared information for descriptor generation. Subsequently, we introduce a hierarchical view attention module to gradually focus on the effective information for multiview feature updating and aggregation. Finally, we propose a novel cross-modal feature learning network, which can simultaneously contribute to modality-specific data distribution and cross-modal data alignment. We conduct exhaustive experiments on three public databases. The experimental results validate the superiority of the proposed method. Full Codes are available at https://github.com/dlmuyy/JFLN.

Introduction

With the rapid development of multimedia and computer vision technologies, 3D shapes have been widely applied in many fields, such as computer-aided designing, virtual reality technologies, and 3D printing. Exploring effective shape retrieval algorithms is essential for managing the rapid-increasing 3D shapes. In many conditions, the traditional search methods could not work well. For example, the text-based search methods should take a proper text description of a 3D shape as a query; however, it is challenging since some shapes contain lots of visual details. The Example-based methods that take the 3D model as a query are straightforward but not convenient because people often do not have access to 3D model data. Recently, sketch-based 3D shape retrieval has attracted much research attention due to its wide applications in many fields such as industrial design, Human–machine Interaction (HMI), and so on. Exploring effective sketch-based 3D shape retrieval methods is becoming more and more significant.

However, how to implement the 3D shape retrieval with the sketches as queries is a challenging task since the divergence between sketch and 3D shape representations is significant. Recently, many methods have been proposed for the cross-modal retrieval [1], [2], [3], [4] task, most of which separately extract the modality-specific features and maps them into a common space for further feature learning and sample similarity measurement. For example, Dai et al. [5] introduced a deep correlated holistic metric learning (DCHML) method for cross-modal retrieval task. The proposed DCHML learns two transformations that map the features from two modalities into a new feature space. Moreover, DCHML employs the correlation losses for discrepancy mitigation. Xie et al. [6] proposed to utilize a deep metric model for this task. A discriminative loss is applied on the Wasserstein barycenters of both 3D shape and 2D sketch for cross-modal data alignment. Chen et al. [7] introduced a deep cross-modality adaptation model based on adversarial learning. The model could transfer the sketch features to the 3D shape modality, and utilize the transformation model to reduce the cross-modal divergence, which improves robustness and effectiveness of the proposed network (see Fig. 1).

In general, the traditional sketch-based approaches mainly contain the following disadvantages:

(1) The majority of methods focus on the modality-specific characteristics extraction but ignore the modality-shared information in the descriptor generation process. The huge divergence between the separated sketch and shape feature limits the effectiveness of the subsequent feature learning process. Exploiting the modality-shared information could significantly reduce the difference between sketch and shape descriptors, which contributes to the discrepancy mitigation.

(2) Extracting informative and descriptive 3D shape descriptors is one of the key issues to improve the feature learning effectiveness in the 3D modal. The traditional fusion methods of multi-view features introduce much redundancy for shape representation, which decreases the discrimination of shape descriptors. Exploring an effective feature aggregation mechanism, keeping as much the descriptive information while removing the redundancy plays an important role in the feature learning task.

(3) Most of the cross-modal feature learning approaches mainly address the cross-modal data alignment problem. However, the modality-specific data relevance is omitted which markedly influences the data distribution within each modality. Therefore, it is necessary to consider both the data distribution within and between modalities for learning cross-modal representations.

To address these problem, in this paper, we propose a novel Joint Feature Learning Network (JFLN) for sketch-based 3D shape retrieval. The motivation of JFLN is shown in Fig. 1. Firstly, a novel modality-shared feature extraction module is designed for visual characteristics exploration, comprehensively utilizing both the modality-specific and modality-shared information. Then, for the 3D shape modality, a hierarchical view attention module is introduced for view feature updating, considering and exploring the view-wise correlations to focus on the effective information and reduce the multiview redundancy, which significantly improves the effectiveness of multiview-feature aggregation. Meanwhile, we employ a balance module in the sketch modality to make sure the dimension of learned sketch descriptors have the same dimension with the shape descriptors. Subsequently, we design a novel cross-modal feature learning module for global data alignment. The architecture contains three parts: intra-modal feature learning, cross-modal data alignment, and the global classifier. These components work together and simultaneously contribute to the modality-specific and cross-modal data distribution process, which effectively improves the cross-modal learning performance. Finally, our model outputs representative sketch and shape descriptors for the cross-modal retrieval task.

Here, we summarize the contributions of this paper:

  • First, a novel modality-shared feature extraction architecture is proposed. It can utilize both the modality-specific characteristics and modality shared information for comprehensive feature extraction. In this way, the cross-modality data divergence can be effectively compensated during the visual feature extraction process.

  • Second, during the 3D shape feature learning process, we utilize the hierarchical view attention module to perform the multiview based feature aggregation, which can preserve the view dependencies and the content visual information of the multiple views for more effective representation. As a result, the discrimination of learned 3D shape descriptors can be significantly improved.

  • Third, we propose a novel cross-modal feature learning module to further bridge the gap between different modalities. It not only controls the modality-specific feature distribution but also facilitates the cross-modal data alignment, which guarantees the final retrieval performance.

Moreover, we conduct comparative experiments on three public databases: SHREC’13, SHREC’14, and MI3DOR. Significant improvements can be obtained over the state-of-the-art methods. The rest of this paper is organized as follows. In Section 2, we introduce the related work. Section 3 presents the detail of our proposed network. And in Section 4, we present the experiments to demonstrate the effectiveness of JFLN. In Section 5, we conclude this paper.

Section snippets

3D shape retrieval method

In general, there are two categories that can be roughly summed the existing 3D shape retrieval approaches, the model-based methods and the view-based methods.

The model-based methods directly use the raw representation of 3D shapes, such as voxel [8], [9], [10], 3D mesh [11], [12] and point cloud [13], [14], [15] for 3D shape retrieval task. Recently, the point cloud has become the most commonly utilized data due to its effective representation for the structural characteristics of the 3D

Method

In this section, we introduce the details of our approach. The architecture is shown in Fig. 2. Generally, there exist three parts: (1) Shape representation: we render multiple views of the 3D shape from different angles for shape representation, in order to maintain more visual and structural information; (2) Modality-shared feature extraction module: we design a novel architecture to extract features of sketch images and shapes, considering the modality-specific characteristics and the

Database

To demonstrate the effectiveness of our proposed method, we conduct comprehensive experiments on the SHREC’13 [42] and SHREC’14 [43] databases. As the supplementary verification of the proposed network, we also validate our JFLN on the MI3DOR database, which is used for the 2D image based shape retrieval task. The databases are introduced in detail as follows:

SHREC13 is a large-scale prevalent database built from the hand-drawing sketch database [32] and the Princeton Shape Benchmark (PSB) [44]

Conclusion

In this paper, we propose a novel Joint Feature Learning Network (JFLN) for sketch-based 3D shape retrieval task. Firstly, we employ the multi-view rendering method for 3D shape representation, with the purpose of keeping more visual and structural information. Subsequently, in order to compensate the divergences between sketch and 3D shape modalities, we introduce the modality-shared feature extraction architecture and cross-modal feature learning network, hierarchically reduces the

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported in part by the National Key Research and Development Program of China (2020YFB1711704), and the National Natural Science Foundation of China (61872267, 61702471, 61772359). This work was supported by the Tianjin Research Innovation Project for Postgraduate Student (Project Number: 2021YJSB154), Tianjin Municipal Education Commission .

References (53)

  • J. Chen, Y. Fang, Deep cross-modality adaptation via semantics preserving adversarial learning for sketch-based 3d...
  • LiuZ. et al.

    Point-voxel CNN for efficient 3D deep learning

  • HanZ. et al.

    Unsupervised learning of 3-D local features from raw voxels based on a novel permutation voxelization strategy

    IEEE Trans. Cybern.

    (2017)
  • Y. Feng, Y. Feng, H. You, X. Zhao, Y. Gao, MeshNet: Mesh neural network for 3D shape representation, in: Proceedings of...
  • R. Klokov, V. Lempitsky, Escape from cells: Deep kd-networks for the recognition of 3d point cloud models, in:...
  • C.R. Qi, H. Su, K. Mo, L.J. Guibas, Pointnet: Deep learning on point sets for 3d classification and segmentation, in:...
  • X. Liu, Z. Han, Y.-S. Liu, M. Zwicker, Point2sequence: Learning the shape representation of 3d point clouds with an...
  • Y. Yang, C. Feng, Y. Shen, D. Tian, Foldingnet: Point cloud auto-encoder via deep grid deformation, in: Proceedings of...
  • Y. Shen, C. Feng, Y. Yang, D. Tian, Mining point cloud local structures by kernel correlation and graph pooling, in:...
  • CheraghianA. et al.

    3Dcapsule: Extending the capsule architecture to classify 3d point clouds

  • HanZ. et al.

    3D2seqviews: Aggregating sequential views for 3d global feature learning by cnn with hierarchical attention aggregation

    IEEE Trans. Image Process.

    (2019)
  • T. Yu, J. Meng, J. Yuan, Multi-view harmonized bilinear network for 3d object recognition, in: Proceedings of the IEEE...
  • A. Kanezaki, Y. Matsushita, Y. Nishida, Rotationnet: Joint object categorization and pose estimation using multiviews...
  • GongB. et al.

    Hamming embedding sensitivity guided fusion network for 3D shape representation

    IEEE Trans. Image Process.

    (2020)
  • K. Simonyan, A. Zisserman, Very Deep Convolutional Networks for Large-Scale Image Recognition, in: International...
  • C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich, Going deeper with...
  • Cited by (0)

    This paper has been recommended for acceptance by Zicheng Liu.

    View full text