Abstract
In this paper, we propose a novel pooling approach for shape classification and recognition using the bag-of-words pipeline, based on topological persistence, a recent tool from Topological Data Analysis. Our technique extends the standard max-pooling, which summarizes the distribution of a visual feature with a single number, thereby losing any notion of spatiality. Instead, we propose to use topological persistence, and the derived persistence diagrams, to provide significantly more informative and spatially sensitive characterizations of the feature functions, which can lead to better recognition performance. Unfortunately, despite their conceptual appeal, persistence diagrams are difficult to handle, since they are not naturally represented as vectors in Euclidean space and even the standard metric, the bottleneck distance is not easy to compute. Furthermore, classical distances between diagrams, such as the bottleneck and Wasserstein distances, do not allow to build positive definite kernels that can be used for learning. To handle this issue, we provide a novel way to transform persistence diagrams into vectors, in which comparisons are trivial. Finally, we demonstrate the performance of our construction on the Non-Rigid 3D Human Models SHREC 2014 dataset, where we show that topological pooling can provide significant improvements over the standard pooling methods for the shape pose recognition within the bag-of-words pipeline.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
In the recent years, databases of 3-dimensional objects have been getting larger and larger. In order to automatically process these databases, many algorithms relying on retrieval have been proposed. However, for certain tasks, classification techniques can be more efficient. Efficient classification pipelines have been proposed for images and some elements of these techniques such as the bag-of-words methods [1] or feature learning using deep network architectures [2] have been used to perform retrieval and shape comparison. Traditionally, the bag-of-words method relies on extracting an unordered collection of descriptors from the shapes we consider, which are then quantized into a set of vectors called “words”. The information given by this quantization process is then summarized using a pooling scheme, which produces a vector usable by standard learning algorithms. Ideally, all the steps of this framework should be robust to transformations of the shape: translations, rotations, changes of scale, etc. Modern bag-of-words approaches for 3D-shapes usually rely on a pooling method called sum-pooling [1] which consists in taking the average of the value of each words across the shape.
Since its introduction for image processing in [3], the bag-of-words pipeline, which we present in Sect. 2, has been improved in various ways. Here, we focus on the pooling part of the framework. Apart from the traditional sum-pooling approach, a popular pooling method, called max-pooling introduced in [4], consists in taking the maximum of the value for each visual word. Several works have highlighted the improvement in accuracy obtained using this pooling scheme as well as its compatibility with the linear kernel for learning purposes, [4, 5]. The strength of max pooling is due in part to its remarkable robustness properties. One of the main assumptions made in the bag-of-words approach is that the “word” values that compose the output of the encoding step, are, for a given class and a given word, i.i.d random variables. Refinements of the max-pooling scheme have been proposed under this assumption: for instance [6] proposed to consider the k highest values for each words to estimate the probability of at least k features being present in the object. However, the independence assumption of the word functions is unrealistic; for 3D shapes close vertices tend to have similar word functions, as illustrated in Fig. 1. Thus, in this example, the generalization proposed by [6] ends up capturing the same feature multiple times and providing multiple redundant values. On the other hand, pooling on different parts of an image [7] and 3D shape [8, 9] has been proposed to take advantage of spatial information, an approach known as Spatial Pyramid Matching. This approach has drastically improved the performance of the bag-of-words procedures on multiple datasets, although it contradicts the identically distributed assumption, and lacks proper robustness guarantees.
In this work, we propose to see the word functions not as a unordered collection of random values but as a random function defined on the vertices of a graph (in our case, the mesh of the shape). Following this approach, we propose to use persistent homology to capture information regarding the global structure of the word functions which is not available for the traditional max-pooling approach.
Persistent homology was first introduced in the context of Topological Data Analysis under the name size theory [10]. It was later generalized to higher dimensions as persistent homology theory [11, 12]. The 0-dimensional persistent homology of the superlevel-sets of a function encodes the prominence of the peaks of the function into a collection of points in the plane, called a persistence diagram. These diagrams enjoy strong robustness properties [13–15]. One option to compare persistence diagrams is to use a distance between diagrams such as the bottleneck distance and to use nearest-neighbor algorithms as it was done by [16]. However, in this work, we aim at being able to use classification algorithms such as SVM or logistic regression that requires a Hilbert space structure, which is not the case of the space of persistence diagrams. One approach to tackle this issue is to make use of the “kernel trick” by using a positive-definite kernel in order to map the persistence diagrams into a Hilbert space. As recently shown by Reininghaus et al. [17], one cannot rely on natural distances such as the Wasserstein distance to build traditional distance-based kernels such as the Gaussian kernel. This led the authors to propose another kind of kernel. A major limitation of their approach, however, is that these types of kernel are non-linear and the complexity of the classification becomes linear with the size of the training set which causes scalability issues. Another approach to directly embed persistence diagrams into a Hilbert Space was proposed in [18]. However this embedding is highly memory-consuming as it maps a single diagram into a set of functions and is not appropriate for dealing with large datasets.
In this work, we propose to perform pooling by computing the persistence diagrams of each word function. We then map these persistence diagrams into \(\mathbb {R}^d\) for some reasonable value of d -\(< 20\)- by considering the peaks with highest prominence. Since we provide a direct mapping of persistence diagrams into \(\mathbb {R}^d\), we can use it for the pooling stage for the bag-of-words procedure and achieve good performance with respect to the classification phase. We call this pooling approach Topological Pooling. Since it relies on persistence diagrams, this method is stable with respect to most transformations the shape can undergo: translations, rotations, etc., as long as the descriptors used in input are also invariant to these transformations. Moreover, we show that this pooling approach is robust to perturbations of the descriptors. Finally we demonstrate the validity of our approach compared to both sum-pooling and max-pooling by performing pose recognition on the SHREC 2014 dataset.
2 The Bag of Words Pipeline
The bag-of-words pipeline consists of three main steps: feature extraction, coding and pooling. Here we describe each step briefly taking a functional point of view, and we also introduce the notations we will need to define our new pooling method. We will assume that the input to the pipeline is a set of M 3D-shapes \(G_i\) represented as triangle meshes with vertices \(V_i\).
Feature extraction aims at deriving a meaningful representation of the shape: the feature function denoted as \(\mathcal {F}_i : V_i \rightarrow \mathbb {R}^N\). It is usually done by computing local descriptors (such as HKS [19], SIHKS [20], WKS [21], Shape-net features [2], etc.) on each vertex of the mesh.
The purpose of coding is to decompose the values of the \(\mathcal {F}_i\) by projecting them on a set of points \(W=(w_k)_{k \in [|1,K|]} \in \mathbb {R}^N\) called a codebook. This allows to replace each feature function by a family of functions \((C_{i} : V_i \rightarrow \mathbb {R}^K)_{i \in [1,M]}\), called the word functions. In other words, for a coding procedure Coding and codebook W, the \(C_i\) are defined through
There exist various coding methods, such as Vector Quantization [22], Sparse Coding [4], Locally Constrained Linear Coding [23], Fisher Kernel [24] or Supervector [25]. The codebook is usually computed using K-means but supervised codebook learning methods [5, 23] generally achieve better accuracy. In the Sparse Coding approach, the one we use in this paper, W and C are computed on the training set following
with constraint \(\Vert w_i\Vert \le 1\) and regularization parameter \(\lambda \). During the testing phase, the optimization is only performed on C with the codebook already computed.
The pooling step aims at summarizing properties of the family \((C_{i})_{i \in [1,M]}\) and representing them through a compact vector \((\mathcal {P}_i)_{i \in [1,M]}\) which can then be used in standard learning algorithms such as the SVM (Support Vector Machine). Usually, the pooling method depends on the coding scheme. For Vector Quantization, one traditionally uses sum-pooling:
Max-pooling was introduced along the Sparse Coding scheme by Yang et al. in [4]. With this pooling technique, we summarize a function by its maximum:
It is interesting to note that the max-pooling approach is more robust than the sum-pooling. Indeed, it is robust to usual transformations the shape can undergo: translations, rotations, changes of scales, etc. However, it is still quite limited as it summarizes a whole function by a single value. A natural idea is to not limit ourselves to the global maximum of the function but rather to capture all local maxima. On the other hand, in this naive form, the method results in a very unstable pooling vector since arbitrarily small perturbations of the word functions can create many local maxima, as shown in Fig. 2. Thus, a pooling approach consisting of taking the highest k local maxima is not stable. On the other hand, in the example shown in Fig. 2, we can see that, while there are a lot of local maxima for the noisy function, both functions show only two “prominent peaks”. These notions of “peak” and “prominence” are properly defined in the 0-dimensional persistent homology framework which provides us with tools to derive a robust pooling method.
3 Introducing 0-dimensional Persistent Homology
0-persistent homology provides a formal definition of prominence and measures the prominence of each peak of a function f, with the promise that the most prominent ones are stable under small perturbations of f. We provide a brief overview of the computation of 0-dimensional persistent homology for the superlevel-sets of a function defined on a graph, and invite the reader to consult [11] for a more general introduction.
Let f be a function defined on the vertices of a finite graph \(G = (V,E)\). In 0-dimensional persistent homology, one focuses on the evolution of the connectivity of the subgraphs \(F_\alpha \) of G induced by the superlevel-sets of f: \(F_\alpha = (\{v \in V \mid f(v) \ge \alpha \}, \{ (u,v) \in E \mid \min (f(u), f(v)) \ge \alpha \})\), as \(\alpha \) decreases from \(+\infty \) to \(-\infty \), as shown in Fig. 3. A vertex v is a local maximum if, for any edge (v, u) in E, we have \(f(u) \le f(v)\). A peak p corresponds to a local maximum \(b_p = f(v_p)\) of f. We say that p is born at \(b_p\), see Fig. 3(b). For a local maximum \(v_p\), let \(C(v_p, \alpha )\) be the connected component of \(v_p\) in \(F_\alpha \) and let \(d_p\) be the largest value of \(\alpha \) such that the maximum of f over \(C(v_p, \alpha )\) is larger than \(b_p\), we say that p dies at \(d_p\). Intuitively, a peak dies when its connected component gets merged with the one of another peak that has a higher maximum. Thus, there exists a vertex \(u_p\) which connects the two components such that \(f(u_p) = d_p\). \(u_p\) is called a saddle, see Fig. 3(c). The “prominence” of p is then the difference \(b_p - d_p\). The peak corresponding to the global maximum of f dies when \(\alpha \) reaches the minimum value of f on G Footnote 1. Thus, a peak of f can be described by the couple \((b_p, d_p)\). The set of such points (with multiplicity) in the plane is called a persistence diagram, denoted \(\Delta _f\), see Fig. 3(g).
Persistence diagrams are endowed with a natural metric called the bottleneck distance. The definition of this metric involves the notion of partial matching. A partial matching M between two diagrams \(\Delta _1\) and \(\Delta _2\) is a subset of \(\Delta _1 \times \Delta _2\) such that each point of \(\Delta _1\) and \(\Delta _2\) appears at most once in M. The bottleneck cost C(M) of a partial matching M between two diagrams \(\Delta _1\) and \(\Delta _2\) is the infimum of \(\delta \ge 0\) that satisfy the following conditions:
-
For any \((p_1, p_2) \in M\), \(||p_1 - p_2||_\infty \le \delta \), and
-
For any other point (b, d) of \(\Delta _1\) or \(\Delta _2\), \(b-d \le 2\delta \).
The bottleneck distance between two diagrams \(D_1\) and \(D_2\), is then defined as:
Intuitively, the bottleneck distance can be seen as the cost of a minimum perfect matching between persistence diagrams (with possibility to match points to the diagonal \(y=x\)), where the cost is the length of the longest line, see Fig. 4. A remarkable property of persistence diagrams, proven by [13, 15], is their robustness with respect to perturbations of f. Given two functions f and g defined on some graph G, we have:
In other words, if we compare the diagrams of a function f and of a noisy version of a function \(\tilde{f}\) then each point \(p \in D_{\tilde{f}}\) can either be matched to a point of \(D_f\) or it has a low prominence, see Fig. 4.
Computation As 0-dimensional persistence encodes the evolution of the connectivity of the superlevel-sets of a function, computing it can be done using a simple variant of a Union-find algorithm; in practice we use Algorithm 1 described by Chazal et al. [26], with parameter \(\tau \) set to infinity. This algorithm has close to linear complexity in the number of vertices of the meshes; more precisely it has complexity \(O(|V| \log (|V|) + |V|\alpha (|V|))\) where \(\alpha \) is the inverse of the Ackermann function.
4 Using Persistence Diagrams for Pooling
As we previously mentioned at the end of Sect. 2, a simple idea to enhance the max-pooling approach is to consider the values of multiple local maxima. However, this can be highly unstable under small perturbations of the word functions. As we saw in Sect. 3, we can use persistence diagrams to deal with this issue. Given a persistence diagram \(\Delta \), we define the prominence p of a point \((b,d) \in \Delta \) by \( p = b - d\); in other words, the prominence corresponds to the lifespan of a peak during the computation of the persistence diagram. Given a function f on a graph G, we define the infinite-dimensional Topological Pooling vector of f with \(i-th\) coordinate given by
where \(p_i(\Delta _f)\) is the i-th highest prominence of the points of \(\Delta _f\) if there is at least i points in \(\Delta _f\) and 0 otherwise. Since the stability of persistence diagrams given in Eq. 1 implies the stability of the prominence of the points of \(\Delta _f\), such a construction yields some stability for our pooling scheme.
Proposition 1
Let G be a graph and f and g two functions on a graph G with vertices V. Then, for any integer n, and any \(0<k <n\),
Of course, in practice we cannot use an infinite-dimensional vector and we simply consider a truncation of this vector keeping n first coordinates, we denote such a truncated pooling vector “TopoPool-n”. Using the notations of Sect. 2, given some \(n > 0\), the pooling vectors \((\mathcal {P}_i)_{1 \le i \le M}\) we consider are
5 Experiments
In this section we evaluate the sum-pooling, the max-pooling and our topological pooling approaches on the SHREC 2014 dataset “Shape Retrieval of Non-Rigid 3D Human Models” [27], which we modify by applying a random rotation to each 3D shape. The dataset is composed of 400 meshes of 40 subjects taking 10 different poses (Fig. 5) and we wish to classify each of these meshes with respect to the pose taken by the subject. We consider both SIHKS features [20] and curvature-based features corresponding to the unary features from [28] and composed of 64 values corresponding to the curvatures, the Gaussian curvature, the mean curvature ...The coding step is performed using Sparse Coding [4] and the computation are performed performed using the SPAMS toolbox [29]. The learning part is done using a Support Vector Machine. We use 3 shapes per class for the training set, 2 for the validation set and 5 for the testing set. We compare the traditional sum-pooling with our TopoPool-n with different values for n -remark that \(n=1\) is equivalent to max-pooling- and under different codebook sizes. As a baseline, we also display the results obtained using a rigid Iterated Closest Point (ICP) [30] and a 1-nearest neighbour classification, which aims at iteratively minimizing the distance between two point clouds through rigid deformations. In our case it corresponds to finding the correct rotation to align the shapes as two shapes in a similar pose are close, however the approach can fail if it gets stuck in a local minimum and is not able to recover the correct rotation. We run the experiment a hundred times, selecting the training and testing sets at random. We display the mean accuracy over the multiple runs in Table 1.
The first noticeable fact about our experiments being the overall better results obtained by our Topological Pooling scheme compared to the max-pooling and to the sum-pooling for the SIHKS features. In the case of curvature features, Topological Pooling and sum-pooling gives similar accuracy results for large codebooks but in the case of smaller codebooks, Topological pooling gives much better results. It is interesting to notice that the gap between the different pooling scheme decreases as the size of the codebook increases. Indeed, the smaller the codebook, the richer each word function in terms of topology -and thus the richer each persistence diagrams will be-.
Regarding the running time of our experiment in the case of SIHKS features, online testing using the bag-of-words procedure with the largest codebook to a given shape takes around 40 seconds, where most of the time is devoted to computing the SIHKS. On the other hand, performing ICP between two shapes takes 6 seconds, thus the online testing time for a single shape with ICP is 6 times the cardinality of our training set seconds; in our case 5 minutes. On the other hand, with the ICP approach requires no offline training while the bag of words requires to compute the codebook, perform the whole bag-of-words pipeline on each training shape and compute the SVM which takes roughly 45 minutes. Overall we have to classify 350 shapes, the bag-of-words approach requires 4 hour and a half while the ICP approach requires more than a day.
6 Conclusion
In this paper, we proposed to use the canonical graph structure on shapes to capture neighborhood information between the different feature vectors. We then built discrete “word functions” on this graph instead of following the traditional approach of considering a collection of independent “word” vectors. We then proposed to consider new pooling features making use of this new information and generalizing the classical max-pooling approach by using the critical points of the “word functions”. We proposed to use 0-dimensional persistent homology to ensure stability of a pooling output relying on these features. Finally, we designed a new pooling method relying on these new features and we experimentally showed that these features are efficient in a pooling context.
Notes
- 1.
This point is slightly different from the traditional persistent homology framework. Usually, the death value of the peak corresponding to the global maximum is set to \(-\infty \).
References
Bronstein, A.M., Bronstein, M.M., Guibas, L.J., Ovsjanikov, M.: Shape google: Geometric words and expressions for invariant shape retrieval. ACM Trans. Graph. 30, 1–20 (2011)
Masci, J., Boscaini, D., Bronstein, M.M., Vandergheynst, P.: Shapenet: Convolutional neural networks on non-euclidean manifolds. http://arxiv.org/abs/1501.06297
Fei-Fei, L., Pietro, P.: A bayesian hierarchical model for learning natural scene categories. In: Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2005), CVPR 2005, vol. 2, pp. 524–531. IEEE Computer Society, Washington, DC (2005). http://dx.doi.org/10.1109/CVPR.2005.16
Yang, J., Yu, K., Gong, Y., Huang, T.: Linear spatial pyramid matching using sparse coding for image classification. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2009)
Boureau, Y.-L., Bach, F., LeCun, Y., Ponce, J.: Learning mid-level features for recognition. In: Proceedings of CVPR (2010)
Liu, L., Wang, L., Liu, X.: In defense of soft-assignment coding. In: ICCV 2011, pp. 2486–2493 (2011)
Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In: Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2006, vol. 2, pp. 2169–2178 (2006)
López-Sastre, R.J., García-Fuertes, A., Redondo-Cabrera, C., Acevedo-Rodríguez, F.J., Maldonado-Bascón, S.: Evaluating 3D spatial pyramids for classifying 3D shapes. Comput. Graph. 37, 473–483 (2013)
Li, C., Hamza, A.B.: Intrinsic spatial pyramid matching for deformable 3D shape retrieval. IJMIR 2, 261–271 (2013)
Verri, A., Uras, C., Frosini, P., Ferri, M.: On the use of size functions for shape analysis. Biol. Cybern. 70, 99–107 (1993)
Edelsbrunner, H., Harer, J.: Computational Topology - An Introduction. American Mathematical Society, New York (2010)
Zomorodian, A., Carlsson, G.: Computing persistent homology. Discrete Comput. Geom. 33, 249–274 (2005)
Cohen-Steiner, D., Edelsbrunner, H., Harer, J.: Stability of persistence diagrams. In: Proceedings of 21st ACM Symposium Computer Geometry, pp. 263–271 (2005)
Chazal, F., Cohen-Steiner, D., Guibas, L.J., Glisse, M., Oudot, S.Y.: Proximity of persistence modules, their diagrams. In: Proceedings of 25th ACM Symposium Computer Geometry (2009)
Chazal, F., de Silva, V., Glisse, M., Oudot, S.: The structure and stability of persistence modules (2012). http://arxiv.org/abs/1207.3674
Li, C., Ovsjanikov, M., Chazal, F.: Persistence-based structural recognition. In: CVPR, pp. 2003–2010 (2014)
Reininghaus, J., Huber, S., Bauer, U., Kwitt, R.: A stable multi-scale kernel for topological machine learning. In: CVPR (2015)
Bubenik, P.: Statistical topology using persistence landscapes. JMLR 16, 77–102 (2015)
Sun, J., Ovsjanikov, M., Guibas, L.: A concise, provably informative multi-scale signature based on heat diffusion. In: Proceedings of the Symposium on Geometry Processing, SGP 2009, pp. 1383–1392 (2009)
Bronstein, M.M., Kokkinos, I.: Scale-invariant heat kernel signatures for non-rigid shape recognition. In: Proceedings of CVPR (2010)
Bay, H., Ess, A., Tuytelaars, T., Van Gool, L.: Speeded-up robust features (SURF). Comput. Vis. Image Underst. 110, 346–359 (2008)
Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. McGraw-Hill Inc., New York (1986)
Wang, J., Yang, J., Yu, K., Lv, F., Huang, T., Gong, Y.: Locality-constrained linear coding for image classification. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2010)
Perronnin, F., Sánchez, J., Mensink, T.: Improving the fisher kernel for large-scale image classification. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part IV. LNCS, vol. 6314, pp. 143–156. Springer, Heidelberg (2010)
Zhou, X., Yu, K., Zhang, T., Huang, T.S.: Image classification using super-vector coding of local image descriptors. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part V. LNCS, vol. 6315, pp. 141–154. Springer, Heidelberg (2010)
Chazal, F., Guibas, L.J., Oudot, S.Y., Skraba, P.: Persistence-based clustering in riemannian manifolds. J. ACM 60, 41 (2013)
Pickup, D., et al.: SHREC 2014 track: Shape retrieval of non-rigid 3D human models, EG 3DOR 2014 (2014)
Kalogerakis, E., Hertzmann, A., Singh, K.: Learning 3D mesh segmentation and labeling. ACM Trans. Graph. 29, 102 (2010)
Mairal, J., Bach, F., Ponce, J., Sapiro, G.: Online learning for matrix factorization and sparse coding. J. Mach. Learn. Res. 11, 19–60 (2010)
Besl, P.J., McKay, N.D.: A method for registration of 3-d shapes. IEEE Trans. Pattern Anal. Mach. Intell. 14, 239–256 (1992)
Acknowledgements
This work was supported by ANR project TopData ANR-13-BS01-0008. First author was supported by the French Délégation Générale de l’Armement (DGA). Second author was supported by Marie-Curie CIG-334283-HRGP, a CNRS chaire dexcellence, a chaire Jean Marjoulet from Ecole Polytechnique, and a Faculty Award from Google Inc.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Bonis, T., Ovsjanikov, M., Oudot, S., Chazal, F. (2016). Persistence-Based Pooling for Shape Pose Recognition. In: Bac, A., Mari, JL. (eds) Computational Topology in Image Context. CTIC 2016. Lecture Notes in Computer Science(), vol 9667. Springer, Cham. https://doi.org/10.1007/978-3-319-39441-1_3
Download citation
DOI: https://doi.org/10.1007/978-3-319-39441-1_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-39440-4
Online ISBN: 978-3-319-39441-1
eBook Packages: Computer ScienceComputer Science (R0)