Abstract
With the development of the Internet and other digital technologies, the speed of data generation has become considerably faster than the speed of data processing. Because big data typically contain massive redundant information, it is possible to significantly simplify a big dataset while maintaining the key information. In this paper, we develop a high-dimensional (HD) dataset simplification method based on the eigenvalues and eigenfunctions of the Laplace-Beltrami operator (LBO). Specifically, given a dataset that can be considered as an unorganized data point set in an HD space, a discrete LBO defined on the HD dataset is constructed, and its eigenvalues and eigenvectors are calculated. Then, the local extremum and saddle points of the eigenvectors are proposed to be the feature points of the HD dataset, constituting a simplified dataset. Moreover, we develop feature point detection methods for the functions defined on an unorganized data point set in HD space, and devise metrics for measuring the fidelity of the simplified dataset to the original set. Finally, examples and applications are demonstrated to validate the efficiency and effectiveness of the proposed methods, demonstrating that the developed HD dataset simplification method is feasible for processing a maximum-sized dataset using a limited data processing capability.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Albattah, W.: The role of sampling in big data analysis. In: Proceedings of the International Conference on Big Data and Advanced Wireless Technologies, pp. 1–5 (2016)
Aoyama, H.: A study of the stratified random sampling. Ann. Inst. Stat. Math. 6(1), 1–36 (1954)
Barber, C.B., Dobkin, D.P., Huhdanpaa, H.: The quickhull algorithm for convex hulls. ACM Trans. Math. Softw. 22(4), 469–483 (1996)
Belkin, M., Niyogi, P.: Laplacian eigenmaps for dimensionality reduction and data representation. Neural Comput. 15(6), 1373–1396 (2003)
Belkin, M., Sun, J., Wang, Y.: Constructing Laplace operator from point clouds in \(\mathbb{R}^d\). In: Proceedings of the twentieth annual ACM-SIAM symposium on Discrete algorithms, pp. 1031–1040. SIAM (2009)
Bengio, Y., Paiement, J.F., Vincent, P., Delalleau, O., Roux, N.L., Ouimet, M.: Out-of-sample extensions for LLE, isomap, MDS, eigenmaps, and spectral clustering. In: Advances in Neural Information Processing Systems, pp. 177–184 (2004)
Benko, K., Kothe, M., Semmler, K.D., Simon, U.: Eigenvalues of the Laplacian and curvature. In: Colloquium Mathematicum, vol. 42, pp. 19–31. Institute of Mathematics Polish Academy of Sciences (1979)
Carr, H., Duke, D.J.: Joint contour nets. IEEE Trans. Vis. Comput. Graph. 20(8), 1100–1113 (2014)
Carr, H., Snoeyink, J., Van De Panne, M.: Simplifying flexible isosurfaces using local geometric measures. In: IEEE Visualization 2004, pp. 497–504 (2004)
Castellani, U., Cristani, M., Fantoni, S., Murino, V.: Sparse points matching by combining 3d mesh saliency with statistical descriptors. Comput. Graph. Forum 27(2), 643–652 (2008)
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res 16, 321–357 (2002)
Dong, S., Bremer, P.T., Garland, M., Pascucci, V., Hart, J.C.: Spectral surface quadrangulation. ACM Trans. Graph. 25(3), 1057–1066 (2006)
Dubuisson, M.P., Jain, A.K.: A modified hausdorff distance for object matching. In: Proceedings of 12th International Conference on Pattern Recognition, vol. 1, pp. 566–568 (1994)
Edelsbrunner, H., Harer, J., Patel, A.K.: Reeb spaces of piecewise linear mappings. In: Proceedings of the Twenty-Fourth Annual Symposium on Computational Geometry, pp. 242–250 (2008)
Gyulassy, A., Bremer, P.T., Hamann, B., Pascucci, V.: A practical approach to morse-smale complex computation: scalability and generality. IEEE Trans. Vis. Comput. Graph. 14(6), 1619–1626 (2008)
Johnson, A.E., Hebert, M.: Using spin images for efficient object recognition in cluttered 3d scenes. IEEE Trans. Pattern Anal. Mach. Intell. 21(5), 433–449 (1999)
Körtgen, M., Park, G.J., Novotni, M., Klein, R.: 3d shape matching with 3d shape contexts. In: The 7th Central European Seminar on Computer Graphics, vol. 3, pp. 5–17. Budmerice (2003)
Liu, Z., Zhang, A.: Sampling for big data profiling: a survey. In: IEEE Access, p. 1 (2020)
Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision 60(2), 91–110 (2004)
Luo, P., Wang, X., Tang, X.: Hierarchical face parsing via deep learning. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2480–2487 (2012)
Mahmud, M., Huang, J., Salloum, S., Emara, T., Sadatdiynov, K.: A survey of data partitioning and sampling methods to support big data analysis. Big Data Min. Anal. 3, 85–101 (2020)
Nassiuma, D.K.: Survey Sampling: Theory and Methods. Nairobi University Press, Nairobi (2001)
Ovsjanikov, M., Mérigot, Q., Mémoli, F., Guibas, L.: One point isometric matching with the heat kernel. Comput. Graph. Forum 29(5), 1555–1564 (2010)
Peng, X., Feris, R.S., Wang, X., Metaxas, D.N.: RED-Net: a recurrent encoder-decoder network for video-based face alignment. Int. J. Comput. Vision 126(10), 1103–1119 (2018)
Pérez-Cruz, F.: Kullback-Leibler divergence estimation of continuous distributions. In: 2008 IEEE International Symposium on Information Theory, pp. 1666–1670 (2008)
Prechelt, L.: Automatic early stopping using cross validation: quantifying the criteria. Neural Netw. 11(4), 761–767 (1998)
Reeb, G.: Sur les points singuliers dune forme de pfaff compltement intgrable ou dune fonction numrique. Comptes Rendus Hebdomadaires des Sances de lAcadmie des Sciences 222, 847–849 (1946)
Reuter, M., Wolter, F.E., Peinecke, N.: Laplace-Beltrami spectra as ‘shape-DNA’ of surfaces and solids. Comput.-Aided Des. 38(4), 342–366 (2006)
Singh, G., Mémoli, F., Carlsson, G.E.: Topological methods for the analysis of high dimensional data sets and 3d object recognition. In: SPBG, pp. 91–100 (2007)
Sun, J., Ovsjanikov, M., Guibas, L.: A concise and provably informative multi-scale signature based on heat diffusion. Comput. Graph. Forum 28(5), 1383–1392 (2009)
Sun, Y., Wang, X., Tang, X.: Deep convolutional network cascade for facial point detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3476–3483 (2013)
Tomek, I.: Two modifications of CNN. IEEE Trans. Syst. Man Cybern. 6, 769–772 (1976)
Vallet, B., Lévy, B.: Spectral geometry processing with manifold harmonics. Comput. Graph. Forum 27(2), 251–260 (2008)
Wang, N., Gao, X., Tao, D., Yang, H., Li, X.: Facial feature point detection: a comprehensive survey. Neurocomputing 275, 50–65 (2018)
Wu, X., Zhu, X., Wu, G.Q., Ding, W.: Data mining with big data. IEEE Trans. Knowl. Data Eng. 26(1), 97–107 (2013)
Acknowledgments
This work is supported by the National Natural Science Foundation of China under Grant Nos. 61872316, 61932018, and the National Key R&D Plan of China under Grant No. 2020YFB1708900.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendices
Appendix 1. The Detailed Calculation of Metrics
KL-divergence-Based Metric. KL-divergence, also called relative entropy, is typically employed to measure the difference between two discrete probability distributions \(P = \{p_{i}\}\) and \(Q = \{q_{i}\}\), i.e., \( D_{KL}(P\Vert Q) = \sum _{i} p_{i} log\left( \frac{p_{i}}{q_{i}}\right) .\) To define the KL-divergence-based metric, the original dataset \(\mathcal {H}\) and simplified dataset \(\mathcal {S}\) are represented as \(M_{\mathcal {H}} \in \mathbb {R}^{n \times d}\) and \(M_{\mathcal {S}}\in \mathbb {R}^{m \times d}\), where n and m \((n > m)\) are the number of data points in \(\mathcal {H}\) and \(\mathcal {S}\), respectively. Each row of the matrices stores a data point in \(\mathbb {R}^d\). First, we calculate the probability distribution \(P^{(k)} = \{p^{(k)}_i, i = 1,2,\cdots ,l\}\) for the \(k^{th}\) dimension of the data points in \(\mathcal {H}\), corresponding to the \(k^{th}\) column of matrix \(M_{\mathcal {H}}\). To do this, the \(k^{th}\) column elements of matrix \(M_{\mathcal {H}}\) are normalized into the interval [0, 1], which is divided into l subintervals, (in our implementation, we set \(l=100\)). By counting the number of elements of the \(k^{th}\) column of matrix \(M_{\mathcal {H}}\) whose values lie in the l subintervals, the value distribution histogram is generated, which is then used as the probability distribution \(P^{(k)} = \{p^{(k)}_i, i=1,2,\cdots ,l\}\) of the \(k^{th}\) column of matrix \(M_{\mathcal {H}}\). Similarly, the probability distribution \(Q^{(k)} = \{q^{(k)}_i, i=1,2,\cdots ,l\}\) of the \(k^{th}\) column of matrix \(M_{\mathcal {S}}\) can be calculated. Then, the KL-divergence for the \(k^{th}\) column elements of matrix \(M_{\mathcal {H}}\) and \(M_{\mathcal {S}}\) is \(D_{KL}(P^{(k)}\Vert Q^{(k)}) = \sum _{i} p^{(k)}_{i} log\left( \frac{p^{(k)}_{i}}{q^{(k)}_{i}}\right) \). And the KL-divergence-based metric is defined as
The KL-divergence-based metric (6) transforms the HD dataset into its probability distribution, and measures the difference between the probability distributions of the original and simplified datasets. As the simplified dataset \(\mathcal {S}\) becomes increasingly closer to the original dataset \(\mathcal {H}\), the KL-divergence-based metric \(d_{KL}\) becomes increasingly smaller. When \(\mathcal {S} = \mathcal {H}\), \(d_{KL} = 0\).
Hausdorff Distance. The Hausdorff distance measures how far two sets are away from each other. Suppose \(\mathcal {X}\) and \(\mathcal {Y}\) are two dataset in d-dimensional space, and denote \(D(\mathcal {X},\mathcal {Y}) = \sup _{x\in \mathcal {X}}\inf _{y\in \mathcal {Y}} d(x,y)\) as the maximum distance from an arbitrary point in set \(\mathcal {X}\) to set \(\mathcal {Y}\), where d(x, y) is the Euclidean distance between two points x and y. The Hausdorff distance between \(\mathcal {H}\) and \(\mathcal {S}\) is defined as \( d_{H}(\mathcal {H},\mathcal {S}) = max \{D(\mathcal {H},\mathcal {S}),D(\mathcal {S},\mathcal {H})\}\). Note that the simplified dataset \(\mathcal {S}\) is a subset of the original dataset \(\mathcal {H}\), and then, \(D(\mathcal {S},\mathcal {H}) = 0\). Therefore, the Hausdorff distance between \(\mathcal {H}\) and \(\mathcal {S}\) can be calculated by

That is, the Hausdorff distance is the maximum distance from an arbitrary point in the original dataset \(\mathcal {H}\) to the simplified dataset \(\mathcal {S}\).
Volume of Convex Hull. As the third metric, we use the difference of the volumes of the convex hulls to measure the fidelity of \(\mathcal {S}\) to \(\mathcal {H}\), i.e.,
where \(\mathcal {H}_{conv}\) and \(\mathcal {S}_{conv}\) are the convex hulls of \(\mathcal {H}\) and \(\mathcal {S}\), and \(V(\mathcal {H}_{conv})\) and \(V(\mathcal {S}_{conv})\) are their volumes. Because the volume of a convex hull in an HD space is difficult to calculate, we first perform the DR operation [4] to project the datasets into 3D space. Afterwards, we employ the quickhull algorithm [3] to construct the convex hulls of the two datasets, and calculate their volumes correspondingly.
Appendix 2. The Simplification Procedures on MNIST and Human Face Dataset
Simplification of the dataset MNIST. (a–d): The simplified datasets by detecting the first 5, 10, 20, 30 LBO eigenvectors, which consist of 82, 120, 196, 280 data points. (e,f): The DR result (using t-SNE) of the original dataset, illustrated by colorful digits (e) and original images (f). (Color figure online)
MNIST: For clarity of the visual analysis, we selected 900 units of the gray-images for the digits ‘0’ to ‘4’, from the MNIST dataset, and resized each image as \(8 \times 8\). The selected images were projected into the 2D-plane using the t-SNE. Whereas in Fig. 6(e), colorful digits are placed at the projected 2D points, in Fig. 6(f), the images themselves are located at the projected points. The simplified datasets are displayed in Fig. 6(a)–6(d). For illustration, their positions in the 2D-plane are extracted from the dimensionality reduced dataset (Fig. 6(e)). In the simplified dataset in Fig. 6(a), which has 82 data points (approximately \(9\%\) of the points of the original dataset), all five categories of the digits are represented. In the simplified dataset in Fig. 6(c), which has 196 data points (approximately \(22\%\) of the points of the original dataset), virtually all of the key features of the dataset are represented, including the subset of digit ‘2’ in the class of digit ‘1’. In Fig. 6(d), the simplified dataset contains 280 data points (approximately \(31\%\) of the points of the original dataset). We can observe that the boundary of each class of data points has formed (Fig. 6(d)), and the geometric shape of each class is similar to that of the corresponding class in the original dataset (Fig. 6(e)).
The simplified datasets of the human face dataset, after DR by PCA. Each simplified dataset is comprised of the images marked with red frames. (a–e) The simplified datasets by detecting the first 5, 10, 20, 30, 50 LBO eigenvectors, which consist of 16, 40, 95, 150, 270 data points, respectively. (f) The feature points of the first 5 eigenvectors. (Color figure online)
Human Face: The human face dataset contains 698 gray-images(\(64 \times 64\)) of a plaster head sculpture. The images of the dataset were captured by adjusting three parameters: up-down pose, left-right pose, and lighting direction. Therefore, the images in the dataset can be regarded as distributing on a 3D-manifold. Thus, as illustrated in Fig. 7, after the human face dataset is dimensionally reduced into a 2D-plane using PCA, it can observed that the longitudinal axis represents the sculpture’s rotation in the right-left direction, and the horizontal axis indicates the change of lighting.
In Fig. 7, the images marked with red frames constitute the simplified dataset. In the simplified dataset in Fig. 8(a), the critical points at the boundary are captured. In Figs. 8(b)–8(d), increasingly more critical points at the boundary and interior are added into the simplified datasets. In Fig. 8(e), virtually all of the images at the boundary are added into the simplified set. For clarity, we display the feature points of the first five LBO eigenvectors in Fig. 7(f), which indeed capture several ‘extreme’ images in the human face dataset. Specifically, the two feature points of the first eigenvector, \(\phi _1\), are two images with ‘leftmost pose + most lighting’ and ‘rightmost pose + most lighting’. In the feature point set of the second eigenvector, \(\phi _2\), an image with ‘leftmost pose + fewest lighting’ is added. Similarly, in the feature point sets of the eigenvectors \(\phi _3, \phi _4\), and \(\phi _5\), there are more ‘extreme’ images of the pose and lighting (Fig. 7(f)).
Appendix 3. Additional Comparisons
To compare the two methods in more detail, the simplified datasets by the SRS method with the best performance in the 200 times of samplings are visualized in Fig. 8, after DR by PCA. Compared with the simplified datasets generated by our method in Fig. 7, the simplified datasets by our method contain more data points on the boundaries. It is well known that the boundary points are more important than the inner points of a dataset, so the simplified datasets generated by our method contain more important information than those by the SRS method.
Compared with other existing statistical sampling methods, the proposed method can be applied to a broader range of dataset types. For some sampling methods, such as stratified sampling [2], ‘Tomek links’ [32], and ‘SMOTE’ [11], data classification is required as supplementary information. However, when the data classification labels are missing, the above methods cannot be implemented. Compared with the topological dataset simplification methods, such as Reeb spaces [14], Joint contour nets [8], and Mapper [29], our method also has obvious advantages. For our method, the size of the simplified dataset is easier to control than those of topological methods. More importantly, in some cases, topological methods may generate new data points whose ‘legality’ cannot be guaranteed. For example, the simplified human face dataset obtained by topological methods can not guarantee that each data point is a ‘correct’ plaster photo. Therefore, topological simplification methods are suitable for extracting the topological structure of a given dataset, but unsuitable for generating simplified datasets with specified numbers.
Appendix 4. Additional Applications
Speedup of DR: The computation of the DR algorithms for an HD dataset is typically complicated. Given an original dataset \(\mathcal {H}\), if we first perform the DR on the simplified dataset \(\mathcal {S}\), and then employ the ‘out-of-sample’ algorithm [6] to calculate the DR coordinates of the remaining data points, the computation requirement can be reduced significantly. Because the simplified dataset \(\mathcal {S}\) captures the feature points of the original dataset, the result generated by DR on simplified set + ‘out-of-sample’ is similar as that by DR on the original set. In Table 2, the running time for the methods ‘DR on simplified set + out-of-sample’ and ‘DR on original set’ are listed. The running-time savings of one orders of magnitude.
Training Data Simplification: With the increase of the use of training data in supervised learning, the cost of manual annotation has become increasingly more expensive. Moreover, repeated samples or ‘bad’ samples can be added into the training dataset, adding confusion and complication into the training data, and influencing the convergence of the training. Therefore, in supervised and active learning, simplifying the training dataset by retaining the feature points and deleting the repeated or ‘bad’ samples can lighten the burden of the manual annotation, reduce the number of epochs, and the save storage space.
First, we studied the relationship between the simplification rate and test accuracy on MNIST and CIFAR-10. In Fig. 9(a), the diagram of the simplification rate \(r_i\) v.s. test accuracy is illustrated. For the dataset MNIST, when \(r_i\) = 0.177, the test accuracy achieves \(97.4\%\), which is close to the test accuracy of \(98.98\%\) using the original training data. For the dataset CIFAR-10, when \(r_i\) = 0.584, the test accuracy is \(78.2\%\), whereas using the original training images, the test accuracy is \(80.2\%\). Therefore, with the proposed simplification method, the simplified dataset is sufficient for training in the two image classification tasks.
Furthermore, with the simplified training set, the training process can terminate earlier than that with the original training set, thus reducing the number of epochs. In Fig. 9(b), the diagrams of the number of epochs v.s. test accuracy are illustrated. The ‘early stopping’ criterion proposed in [26] was used to determine the stopping time (indicated by arrows). For the dataset MNIST, the training process with the original training set stopped at 12000 epochs, with a test accuracy \(99.07\%\). With the simplified training set, the training process stopped as early as at 6000 epochs, with a test accuracy \(95.25\%\). For the dataset CIFAR-10, the training process stopped at 8000 epochs with the simplified training set, whereas the training process with the original training set stopped at 11500 epochs. Therefore, with the simplified training dataset, the training epochs can be reduced significantly.
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Xu, C., Lin, H. (2021). High-Dimensional Dataset Simplification by Laplace-Beltrami Operator. In: Magnenat-Thalmann, N., et al. Advances in Computer Graphics. CGI 2021. Lecture Notes in Computer Science(), vol 13002. Springer, Cham. https://doi.org/10.1007/978-3-030-89029-2_43
Download citation
DOI: https://doi.org/10.1007/978-3-030-89029-2_43
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-89028-5
Online ISBN: 978-3-030-89029-2
eBook Packages: Computer ScienceComputer Science (R0)