Elsevier

Pattern Recognition

Volume 46, Issue 6, June 2013, Pages 1638-1647
Pattern Recognition

A global structure-based algorithm for detecting the principal graph from complex data

https://doi.org/10.1016/j.patcog.2012.11.015Get rights and content

Abstract

Principal curves arising as an essential construct in dimensionality reduction and pattern recognition have recently attracted much attention from theoretical as well as practical perspective. Existing methods usually employ the first principal component of the data as an initial estimate of principal curves. However, they may be ineffective when dealing with complex data with self-intersecting characteristics, high curvature, and significant dispersion. In this paper, a new method based on global structure is proposed to detect the principal graph—a set of principal curves from complex data. First, the global structure of the data, called an initial principal graph, is extracted based on a thinning technique, which captures the approximate topological features of the complex data. In terms of the characteristics of the data, vertex-merge step and the improved fitting-and-smoothing phase are then proposed to control the deviation of the principal graph and improve the process of optimizing the principal graph. Finally, the restructuring step introduced by Kégl is used to rectify imperfections of the principal graph. By using synthetic and real-world data sets, the proposed method is compared with other existing algorithms. Experimental results show the effectiveness of the global structure based method.

Highlights

► Improve the quality of the initial estimate of complex data with the global structure. ► A vertex-merge step is presented to improve the algorithm's efficiency. ► The projection strategy is improved in projection step. ► We redefine objective function used in fitting-and-smoothing phase. ► An effective method is proposed to detect the principal graph from complex data.

Introduction

Principal component analysis [1] is a well-known technique in multivariate analysis. It is used in dimensionality reduction, feature extraction, and image coding and enhancement. As a nonlinear generalization of principal component analysis, principal curves are defined as one-dimensional (1D) curves that pass through the “middle” of a set of p-dimensional data points, providing smooth and curvilinear summaries of p-dimensional data. These curves satisfy the self-consistency property, i.e., a point on the curve is an average of all data points that project onto it. Principal curves have received significant attention since Hastie and Stuetzle (hereafter HS) introduced the notion of principal curves to solve the problems in traditional machine learning and multivariate data analysis [2]. Considerable work has been reported on the applications of principal curves, such as high-dimensional data partition [3], shape detection [4], [5], image skeletonization [6], [7], speech recognition [8], noise robustness improvement of time warping methods [9], feature extraction, bill recognition [10], [11], [12], intelligent transportation analysis [13], and regression analysis [14].

HS first proposed the concept of principal curve and developed an algorithm for constructing principal curves. The HS's principal curves algorithm (HSPC) finds principal curves by iterating between projecting data onto the curve and estimating conditional expectations on projectors by the scatter smoother or the spline smoother [2]. Referring to the HS's algorithm, many researchers have offered improvements to the theory as well as algorithmic developments. To address the problem of model bias of the HSPC algorithm, Tibshirani introduced a semi-parametric principal curve model (hereafter TPC), in which an EM algorithm was used to estimate principal curves [15]. In 2000, Kégl et al. defined a polygonal curve with k segments and length L as principal curves to solve the problem of convergence of the HSPC algorithm (KPC). The KPC algorithm seeks principal curves by starting with the shortest segment of the first principal component line f1,n which contains all of the projected data points, and in each iteration of the algorithm, it increases the number of segments by one by adding a new vertex to the polygonal curve fk,n produced in the previous iteration. After adding a new vertex, the positions of all vertices are updated so that the value of a penalized distance function becomes minimized [16], [22]. Delicado in 2001 defined principal curves as principal curves of oriented points to correct bias (DPC). The DPC algorithm finds the principal oriented points one by one and orderly links them to estimate principal curves [17], [18]. Verbeek et al. defined K-segment principal curves (VPC). The VPC algorithm estimates principal curves by incrementally combining local line segments into the polygonal line to achieve an objective similar to that of Tibshirani's [19]. More recently, Einbeck et al. introduced local principal curves, which were based on the localization of principal component analysis (hereafter LPC). The LPC algorithm generates principal curves by connecting a series of local centers of mass of the data using interpolation or splines, which is similar to Delicado's algorithm [20]. Zhang et al. proposed Riemannian principal curves to address the problem of non-constant data distributions in 2010 (hereafter RPC). The RPC algorithm constructs principal curves by revisiting the projection of the samples onto the curve and incorporating Riemannian distances to reflect the middle of data distribution [21]. Ozertem and Erdogmus in 2011 introduced principal curves and surfaces from a new point of view. They expressed principal curves and surfaces in terms of gradient and the Hessian (hereafter OEPC). OEPC algorithm generates principal curves and surfaces by using subspace constrained mean shift (SCMS) based on kernel density estimation and Gaussian mixture models [23]. Gérard et al. in 2011 proposed parameter selection for principal curves. They considered the principal curve problem from an empirical risk minimization perspective and addressed the parameter selection issue using the point of view of model selection via penalization [24].

Despite the progress reported in the development of principal curve algorithms, there are still some issues that need to be resolved. For instance, the first principal component is often used as the initial estimate of the principal curve when lacking the prior knowledge in existing principal curves algorithms. However, for the complex data with high curvature, significant dispersion and self-intersecting, such as spiral-shaped data, fingerprint data, and alike, the first principal component cannot reflect topological features of these data, so good results cannot be achieved during implementation process of these algorithms. Though some recent PC algorithms do not depend on initial estimates and can achieve good results on data with loops, self-intersecting and bifurcations, such as the OEPC algorithm proposed by Ozertem and Erdogmus in 2011 [23], a good initial estimate may be helpful to improve the performance of the principal curves algorithm, especially for processing the complex data mentioned above. Therefore, we refer to the ideas of Granular Computing [25], [26], [27], [28], [29], [30] and propose a global principal graph method (referred to as GPG), which is based on a global structure to detect the principal graph—a set of principal curves from the complex data. Instead of starting with a simple topology such as the first principal component, the GPG directly generate an initial principal graph, which captures the approximate topological features of the complex data by using thinning algorithm. However, the principal graph is not smooth and does not satisfy the self-consistency property. To remedy these drawbacks, we adopt the fitting-and-smoothing step introduced by Kégl [16] to optimize the principal graph by updating the positions of all vertices. During the process of optimizing the principal graph, we find that the complex data may cause the deviation of the principal graph and result in a low efficiency of the algorithm. To address these problems, a vertex-merge step and improved the fitting-and-smoothing step are proposed. Finally, the restructuring step introduced by Kégl [16] is used to further rectify imperfections of the principal graph.

The remainder of this paper is organized as follows. Section 2 gives a brief overview of the concept of principal curves. In Section 3, the global structure based principal curve algorithm is described in detail. Section 4 evaluates and analyzes the performance of the proposed algorithm on synthetic and real-world data sets. Finally, Section 5 provides some conclusions of this study.

Section snippets

Principal curves-some preliminaries

In this section, we review some basic concepts of principal curves. For more details, the reader is referred to [2].

Hastie and Stuetzle generalized the self-consistency property of principal components and introduced the notion of principal curves. Let X denote a random vector in Rd, and f(λ)=(f1(λ),,fd(λ)) be a smooth curve in Rd parameterized by λR. For any XRd, let λf(X) denote the largest parameter value λ for which the distance between X and f(λ) is minimized. More formally, the

The global principal graph algorithm

Assume that a set of data Xn={x1,,xn} is given. We determine the smooth principal graph which passes through the “middle” of such cloud of data. The principal graph is constructed following the strategy outlined below:

  • (1)

    Initialize the data points to extract global structure called an initial principal graph Gvs0;

  • (2)

    Merge the adjacent vertices of Gvs0 to increase the ratio of graph vertex vi number to the data point xi number;

  • (3)

    Project the data points Xn and partition them into “the nearest neighbor

Experimental results and analysis

In this section, we first discuss the parameter setting of the GPG algorithm. Then synthetic data sets and real images are used to demonstrate the performance of GPG algorithm and contrast this performance with the results produced by some traditional methods. Finally, the performance of the GPG algorithm and the impact of noise on the performance of algorithms are investigated.

Conclusions

Principal curves are nonlinear generalizations of principal components. Since the notion of principal curves was put forward, there have been many methods of how to find principal curves from data sets. However, for the complex data with self-intersecting, high curvature and dispersion, those existing methods may not perform well. On the basis of researching existing work on constructing principal curves, the paper proposed a novel method based on global structure to solve this problem. The

Acknowledgment

This work is supported by the National Natural Science Foundation of China (grant No. 61075056, 60970061, 61103067 and 61175054).

Hongyun Zhang is a Ph.D. holder, a lecturer of Department of Computer Science and Technology at Tongji University, Shanghai, China. She is currently a visiting scholar in the Department of Electrical and Computer Engineering, University of Alberta, Edmonton, Canada. Her research interests include principal curve, data mining, pattern recognition, and granular computing.

References (33)

  • D.C. Stanford et al.

    Finding curvilinear features in spatial point patternsprincipal curve clustering with noise

    IEEE Transactions Pattern Analysis and Machine Intelligence

    (2000)
  • E. Bas et al.

    Principal curves as skeletons of tubular objects: locally characterizing the structures of axons

    Neuroinformatics

    (2011)
  • U. Ozertem et al.

    Principal curve time warping

    IEEE Transactions on Signal Processing

    (2009)
  • H.Y. Zhang, The Research of Off-line Handwritten Character Recognition Based on Principal Curves, Ph.D. Thesis, Tongji...
  • H.Y. Zhang et al.

    Modified principal curves based fingerprint minutiae extraction and pseudo minutiae detection

    International Journal of Pattern Recognition and Artificial Intelligence

    (2011)
  • H.Y. Zhang et al.

    Analysis and extraction of structural features of off -line handwritten digits based on principal curves

    Journal of Computer Research and Development

    (2005)
  • Cited by (12)

    • Robust spherical principal curves

      2023, Pattern Recognition
    • H-ProMed: Ultrasound image segmentation based on the evolutionary neural network and an improved principal curve

      2022, Pattern Recognition
      Citation Excerpt :

      Principal curves form a nonlinear generalization of principal component analysis. Compared with the first principal component [40], the principal curve has two advantages, including 1) it retains more information about the data, and 2) it fits the data more closely and obtains more accurate data's geometric shape. The presence of sparse, uneven distributions and abnormal points pj in Pn can potentially decrease the performance of the principal curve-based method and consequently make obtained coarse prostate contour deviates from the approximate trajectory of the initial points.

    • A-LugSeg: Automatic and explainability-guided multi-site lung detection in chest X-ray images

      2022, Expert Systems with Applications
      Citation Excerpt :

      Compared with the traditional Polyline Searching algorithm (PS) (Kégl et al., 2000; Kégl, & Krzyzak, 2002), our previous works (Peng et al., 2019; Peng, Xu, Wang, Zhou et al., 2020; Peng et al., 2018) propose an improved Closed Polyline Searching algorithm (CPS) for the first time by adding several limitation conditions. Based on our previous works, inspired by Ref. (Yang et al., 2014; Zhang et al., 2013), we proposed the ACPS by adding certain improvements, including the improved projection step, improved vertex optimization step, and adaptive radius selection method. Compared with the traditional Back Propagation algorithm (BP), inspired by Ref. (Chen et al., 2020; Wang et al., 2017), we proposed the FOBL, which used the Caputo-type fractional gradient descent algorithm-based backpropagation step.

    • H-ProSeg: Hybrid ultrasound prostate segmentation based on explainability-guided mathematical model

      2022, Computer Methods and Programs in Biomedicine
      Citation Excerpt :

      However, the vertex optimization step of the CKPC method has low efficiency when dealing with complex data, and the practical usefulness of the CKPC method may be affected by abnormal data in the wrong position. Based on previously described methods [36] and [37], we developed a new RCKPC method to address the aforementioned issues by adding two improvements, i.e., (1) a redefined vertex optimization step and (2) vertex filtering. The improved RCKPC method is denoted as Algorithm 1.

    • Interactive Ultrasound Prostate Cancer Segmentation using Deep Learning with Principal Curve-based Fine-tuning

      2023, Proceedings - 2023 2023 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2023
    View all citing articles on Scopus

    Hongyun Zhang is a Ph.D. holder, a lecturer of Department of Computer Science and Technology at Tongji University, Shanghai, China. She is currently a visiting scholar in the Department of Electrical and Computer Engineering, University of Alberta, Edmonton, Canada. Her research interests include principal curve, data mining, pattern recognition, and granular computing.

    Witold Pedrycz is a professor and Canada Research Chair in the Department of Electrical and Computer Engineering, University of Alberta, Edmonton, Canada. He is also with the Systems Research Institute of the Polish Academy of Sciences. He is actively pursuing research in Computational Intelligence, fuzzy modeling, pattern recognition, knowledge discovery, neural networks, granular computing and software engineering. He has published vigorously in these areas. He is an author of 11 research monographs and numerous journal papers in highly reputable journals. Dr. Pedrycz has been a member of numerous program committees of international conferences in the area of Computational Intelligence, Granular Computing, fuzzy sets and neurocomputing. He currently serves as an Associate Editor of IEEE Transactions on Fuzzy Systems, IEEE Transactions on Neural Networks. He is also on editorial boards of over 10 international journals. Dr. Pedrycz is also an Editor-in-Chief of Information Sciences and IEEE Transactions on Systems, Man, and Cybernetics part A. He is the past president of IFSA and NAFIPS. He is a Fellow of the IEEE.

    Duoqian Miao is a professor of Department of Computer Science and Technology at Tongji University, Shanghai, China. He has published more than 60 papers in international proceedings and journals. His research interests include soft computing, rough sets, pattern recognition, data mining, machine learning and granular computing.

    Caiming Zhong is currently pursuing his Ph.D. in Computer Sciences at Tongji University, Shanghai, China. His research interests include cluster analysis, manifold learning and image segmentation.

    View full text