Bag of shape descriptor using unsupervised deep learning for non-rigid shape recognition

https://doi.org/10.1016/j.image.2021.116297Get rights and content

Highlights

  • Our method is specially designed to learn high-level and hierarchical shape features from multi-scale context structures.

  • An improved decomposing strategy is redesigned to generate valuable contour fragments, results in local to global feature learning.

  • An unsupervised learning framework is also applied to the contour fragment for its feature expression based on the context structure and SSAE (Stack Sparse Auto Encode).

Abstract

Highly discriminative feature expression for non-rigid shape recognition is an important and challenging task, which requires both abstract and robust shape descriptors. However, the majority of existing low-level descriptors are designed via hand-crafted, which are sensitive to local changes and larger deformation. To address this issue, this paper proposes a bag of shape descriptor based on unsupervised deep learning and Bag of Words (BoW) for shape recognition. Different from existing pipelines, our method is specially designed to learn high-level and hierarchical shape features from multi-scale context structures. It effectively overcomes obstacles, such as irregular topology, orientation ambiguity, and rigid or non-rigid transformation in the hierarchical learning of contour fragments. Specifically, by adopting an improved decomposing strategy, the shape can be decomposed to a series of valuable contour fragments, results in local to global feature learning. An unsupervised learning framework is also applied to the contour fragment for its feature expression based on the context structure and SSAE (Stack Sparse Auto Encode). In the process of shape representation, a high-level shape dictionary is learned by K-clustering to achieve discriminative feature coding. In addition, to achieve a compact and simplified shape representation, SPM (Spatial Pyramid Matching) is adopted by max-pooling, which effectively incorporates spatial layout information of the given shape. The experiments demonstrate that the proposed method achieves state-of-the-art performance on several public shape datasets comparing with the latest approaches. Our method also obtains high performance under the noisy and occlusion condition.

Introduction

As high-level visual information, shape feature is easy to be memorized and recognized by the human brain, even if the objects lost color, brightness, and texture [1], [2], [3]. Due to this discriminative and sparse descriptiveness, shape-based object recognition is a fundamental and important task with various applications, such as robot navigation [3], gesture recognition [4], pedestrian detection [5], and object tracking [6]. The related shape descriptors under rigid transformation have been widely studied in the field of computer vision, most of which are the geometric or spectrum-based methods. However, it remains a difficult topic to form a discriminative descriptor under the larger non-rigid shape changes, the noisy condition, and the occlusion [7], [8], [9]. Moreover, the light-weight deep learning on the 2D shape recognition still demands further exploration. It is more challenging to solve structure obstacles between the shape feature and deep learning, such as irregular topology, orientation ambiguity, and rigid or non-rigid transformation [10], [11]. To tackle these issues, we center to learn discriminative shape features for non-rigid shape recognition based on deep learning and BoW (Bag of Words).

Traditionally, shape recognition is usually considered a fundamental classification problem, which consists of three steps, including feature expression, evaluation metrics, and classification optimization. One of the most important and difficult parts is feature expression, which directly affects the recognition efficiency and accuracy. Therefore, our work also investigates this field. In the last decades, many local and global shape descriptors are proposed to extract discriminative features [11], [12], [13], [14], [15], [16], [17] Global shape descriptors encode geometric and spatial attributes of a model into feature space and accomplish further matches. Although some studies have achieved encouraging performance especially in shape retrieval, they hardly solve problems of complex conditions, such as severe occlusion, local larger deformation, and clutter. Hence, they do not perform well for different non-rigid shape classes [18], [19]. On the contrary, local methods achieve patch or point-wise correspondences among fragments by constructing local feature descriptors. Therefore, local descriptors are more effective and robust to incomplete and occluded shapes. Nevertheless, existing local descriptors are directly designed by hand-crafted or fixed geometry such as normal, curvature, and distance. Moreover, these descriptors are constructed on a single region, leading to high sensitivity for different scale deformations, which limit local descriptors discriminability. From these perspectives, it is highly desired to explore feature learning models and multi-scale strategies to learn high feature patterns for shape recognition.

The BoW is originally developed for NLP (natural language processing). In this framework, highly discriminative feature expression can be formed by encoding the context relationship among words. Based on these advantages, many researchers applied BoW to shape recognition [1], [9], [19], [20]. A pioneer work, namely, BoF (Bag of Features) is introduced firstly[2], where the shape is considered as a document and represented by a set of shape words using contour fragments. Though shape words, the obtained dictionary is regarded as the basic primitive for shape representation. Finally, feature coding is used to achieve final shape representation [21], [22], [23]. These methods are relatively stable, insensitive, and robust to small deformations, occlusion, and noises. Therefore, our method is inspired by the success of the BoW framework. However, all existing methods for shape features in BoW are captured using low-level geometry descriptors, such as shape context [11], curvature [18], and skeleton paths [9], [24]. Furthermore, the spatial information among the high-level shape features is discarded in BoW, which plays an essential role to enhance the discriminability of feature representations. Different from existing BoW proposals, we employ unsupervised deep learning to learn the discriminative feature to select correct shape correspondences from the contour fragment through analyzing the similarities and differences both intra-class and inter-class, especially when huge amounts of shapes are trained for public use.

In recent years, deep learning has been attracting more and more research attention in feature representation. In addition, the more intrinsic feature of the training data can be also obtained in an unsupervised way, such as GANs (Generative adversarial networks) and Auto Code [25], [26], [27]. Our research is also inspired by deep learning and applies related ideas to the shape recognition field. Unlike the natural images or 3D-grid, which are distributed on a regular grid with a clear parameterization, deep learning cannot be directly adopted to learn features from the original 2D-shape just as the same way from imageCNN. The significant challenges of introducing in this paper are: (1). The topology mismatch between the irregular shape feature and regular deep learning models; (2). The multi-resolution of contour fragment; (3). The permutation-variant is caused by ambiguous orientation of shape features; (4) The poor performance for rigidly or non-rigidly transformations of the shape. To tackle these issues, we adopt a context structure in terms of multi-views and local reference for shape feature learning. Furthermore, this structure does not introduce larger information loss and retains raw spatial distribution and geometric attributes of the given shape. It has better generalization than the traditional local feature descriptors, such as SIFT [23], HOG [24], and LBP [25]. Note that, regular and ordered context structure also enables the light-weight SSAE (Stack Sparse Auto Encode) to learn directly from contour fragments.

In this paper, we propose a novel shape descriptor framework to extract highly discriminative features from multi-scale context structures. Our key innovation is to force irregular and multi-scale contour fragments to be learned effectively by combining the two frameworks of the traditional BoW and deep learning. To obtain valuable and sufficient feature primitives, we redesign the shape decomposing strategy using Geodesic Distance. The sparse and regularization terms are added in the objective function to decompose all context features into robust and compact elements. In this way, the high-level shape feature is mapped into a new space via LLC (local-constrained linear coding). Moreover, considering the lacking spatial information among high-level shape feature in BoW, we adopt SPM (Spatial Pyramid Matching) to incorporate spatial correlations for a given shape by max-pooling [26], so that the final shape representation encodes not only the multi-scale structures feature but also the dependencies with space relationship. Few papers of deep learning on 2D shape recognition are available. Our paper has the following main contributions.

  • We design an improved shape decomposing method. The method takes sufficient and complementary contour fragments as the basic primitive for high-level feature expression. Comparing with the traditional decomposing methods, the improved contour fragment method can capture more sufficient and discriminative information.

  • A novel unsupervised learning framework in terms of context structure and the SSAE is proposed for shape feature expression, which enables to learn the high-level and hierarchical shape feature from contour fragment. It also effectively overcomes structural obstacles between the shape feature and deep learning, such as irregular topology, orientation ambiguity, and rigid or non-rigid transformation of the shape.

  • Coding discriminability of the high-level shape feature is verified base on the obtained high-level shape dictionary, LLC, and SPM. Where the final shape feature is represented by high correlation patterns and space relation, which captures sparsity of shape words and compact spatial information.

Our work is organized as follows. Section 2 reviews the related work. In Section 3, we will introduce the details of the proposed unsupervised shape feature learning framework. Next, the high-level shape dictionary is learned in Section 4. More details of shape coding and pooling are introduced in Section 5. Experiment parameters setup and result analysis are shown in Section 6, Finally, Section 7 concludes our paper and proposes content for further work.

Section snippets

Related work

Three related works are briefly introduced in this section, including (1) Hand-crafted shape descriptors, (2) BoW for shapes recognition, (3) Deep learning for shapes descriptors.

Overview of the proposed descriptor

An overview of the proposed method is introduced in the following four steps and also illustrated in Fig. 1.

Contour Fragment extraction. First, a set of potential training shapes are decomposed into valuable parts using improved contour fragment, as marked in Fig. 2(b). Different from traditional BoW, a text or a document, is represented as an occurrence frequency histogram of the monosyllabic word, improved contour fragment contains both local and global shape information. Therefore, we take

High-level shape dictionary learning

The learned high-level shape feature h = { h k| k[I, M] is generated from the proposed learning framework at each contour fragment, where k is the number of SSAE output. More specifically, the all extracted hk from contour fragments are collected into a high-level feature set H. Such that H = { hkj | kϵ[I, M], jϵ[I, MVk], where the MVk denotes the number of all the high-level shape features set. To learn the high-level shape feature dictionary , hjk are clustered into KH clusters, where each

Shape encoding and pooling

In the BoW, the contour fragment is encoded by mapping the corresponding high-level feature into a new space-based on its local shape dictionary. In the new space, contour fragments with high-level shape feature have better expression than raw information by an informative shape coding. Inspired by the latest works in [1], [24], we adopt LLC to achieve the encoding, as it has been proved to be effective and robust for object classification. The LLC method is constructed by minimizing the

Result and analysis

In this section, the parameter setup and performance analysis of the proposed method for shape recognition are presented. We first discuss the parameter setup process. By analyzing how these parameters affect the shape recognition performance in the experiment, the parameters tuning procedure are determined. Then, the proposed method is compared with the state-of-the-art shape recognition approaches under a variety of shape datasets, including MPEG-7 dataset, Swedish leaf dataset, Animal

Conclusion

In this paper, a novel bag of shape descriptor based on unsupervised deep learning and BoW is proposed for learning discriminative and compact shape feature. Specifically, the improved contour fragments provide abundant basic primitives for high-level shape representation, result in local to global learning. The low-level shape feature is constructed using context structure for high-level and hierarchical shape feature. This strategy can effectively overcome the obstacles between shape feature

CRediT authorship contribution statement

Linjie Yang: Conceptualization, Methodology, Software, Data curation, Writing - original draft, Investigation, Writing - review & editing. Luping Wang: Conceptualization, Methodology, Software, Data curation, Writing - original draft, Investigation, Writing - review & editing. Yijing Su: Conceptualization, Methodology, Visualization, Investigation, Writing - review & editing. Yin Gao: Conceptualization, Methodology, Visualization, Investigation, Writing - review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This research was supported by the National Science Foundation for Young Scientists of China (Grant No. 61906178), and Science and Technology Program of Quanzhou, China (No. 2019C009R)

References (61)

  • PassalisN. et al.

    Neural Bag-of-Features learning

    Pattern Recognit.

    (2017)
  • KrestenitisM. et al.

    Recurrent bag-of-features for visual information analysis

    Pattern Recognit.

    (2020)
  • LateckiL.J. et al.

    Convexity rule for shape decomposition based on discrete contour evolution

    Comput. Vis. Image Underst.

    (1999)
  • DongZ. et al.

    A novel binary shape context for 3D local surface description

    ISPRS J. Photogramm. Remote Sens.

    (2017)
  • AttallaE. et al.

    Robust shape similarity retrieval based on contour segmentation polygonal multiresolution and elastic matching

    Pattern Recognit.

    (2005)
  • RoyO. et al.

    Defect sizing in thin components

  • DaliriM.R. et al.

    Robust symbolic representation for shape recognition and retrieval

    Pattern Recognit.

    (2008)
  • RameshB. et al.

    Shape classification using invariant features and contextual information in the bag-of-words model

    Pattern Recognit.

    (2015)
  • KaragözC.S. et al.

    Coordinated navigation of multiple independent disk-shaped robots

    IEEE Trans. Robot.

    (2014)
  • PoularakisS. et al.

    Low-complexity hand gesture recognition system for continuous streams of digits and letters

    IEEE Trans. Cybern.

    (2016)
  • CaoJ. et al.

    Pedestrian detection inspired by appearance constancy and shape symmetry

  • BiswasS. et al.

    An efficient and robust algorithm for shape indexing and retrieval

    IEEE Trans. Multimedia

    (2010)
  • HanZ.

    BoSCC: Bag of spatial context correlations for spatially enhanced 3D shape representation

    IEEE Trans. Image Process.

    (2017)
  • YangJ.

    Modeling point clouds with self-attention and gumbel subset sampling

  • WangY. et al.

    Dynamic graph Cnn for learning on point clouds

    ACM Trans. Graph.

    (2019)
  • MoriG. et al.

    Shape contexts enable efficient retrieval of similar shapes

  • HanZ. et al.

    Unsupervised 3D local feature learning by circle convolutional restricted Boltzmann machine

    IEEE Trans. Image Process.

    (2016)
  • HanZ. et al.

    Unsupervised learning of 3-D local features from raw voxels based on a novel permutation voxelization strategy

    IEEE Trans. Cybern.

    (2019)
  • BaiX. et al.

    Shape vocabulary: A robust and efficient shape representation for shape matching

    IEEE Trans. Image Process.

    (2014)
  • LazebnikS. et al.

    Beyond bags of features: spatial pyramid matching for recognizing natural scene categories to cite this version: Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories

  • Cited by (7)

    • A local-global shape characterization scheme using quadratic Bezier triangle aiding retrieval

      2023, Digital Signal Processing: A Review Journal
      Citation Excerpt :

      The mechanism failed to capitulate local features owing to the drastic variations in the matching phase and hence rendered poor retrieval results. Lately, several models engaging Deep Learning (DL) schemes for achieving improved retrieval rates were offered [29–31]. The Stack Sparse Auto Encoder [29] learned high-level and hierarchical shape features by fusing unsupervised DL with Bag of Words (BoW) for shape discrimination.

    • L-shaped geometry-based pattern descriptor serving shape retrieval

      2023, Expert Systems with Applications
      Citation Excerpt :

      As the l-shaped descriptor is highly localized and congruent, this resulted in acute shape characterization that yielded good recognition accuracy. The slight improvement over the l-Shaped descriptor witnessed for BoF-USDL (L. Yang et al., 2021) is attributed to the hierarchical merging of BoF features by the DL model. Also, the computational dimensions of the diverse BoF feature highly influence the models’ complexity undermining its usefulness when extended to real-time environment.

    View all citing articles on Scopus
    View full text