A repartition method improving visual quality for PCA image coding
Graphical abstract
Introduction
Numerous data analysis techniques, such as regression and principal component analysis (PCA), possess time or space complexity and are thus impractical for large datasets [6], [25]. Therefore, instead of applying such techniques directly to the entire dataset, researchers adopt cluster analysis and apply these techniques to each cluster, which consists of only a portion of the original data. Depending on the type of cluster analysis, the number of clusters, and the accuracy with which the clusters represent the data, the results can be comparable with those that would have been obtained by using all data. Cluster analysis techniques have recently been applied to microarray data, image analysis, and marketing science [13], [26].
Cluster analysis [11] is a core issue in data mining with innumerable applications spanning many fields. In order to mathematically identify clusters in a dataset, it is usually necessary to first define a measure of similarity or proximity which will establish a rule for assigning patterns to a particular cluster. The measure of similarity is usually data dependent. The clustering aims to optimize a cost function that is defined over all possible groupings. Moreover, the cost function depends on the manner by which the data are decomposed and has limited meaning on one separate item [20]. In this technique, the collected information is divided into various clusters to show the system behavior patterns effectively. In other words, patterns in the same group are similar in some sense and patterns in different groups are dissimilar in the same sense [4], [5]. In terms of analysis of variance (ANOVA), the within-variance is low and between-variance is high. Here, “variance” means the sample variance among all possible linear combination of observations [8]. We will apply this property to the proposed method, in which PCA is employed as the data analysis technique for image coding. In this study, we adopt the K-means algorithm [10], [24], [30] proposed by Mac Queen (1967) to minimize the sum of the distance from each data to its cluster center. The K-means algorithm is a popular clustering method for its capability to group huge datasets efficiently.
Data reduction techniques aim to efficiently represent data [14], [15], [28]. One example is the Karhunen–Loeve Transform (KLT), in which a higher dimensional input space is mapped to a lower dimensional feature space through linear transformation [19]. As an alternative approach to feature extraction in the n-dimensional space, PCA finds the m (m < n) basis components, such that the projection to the corresponding subspace possesses the largest variations [27]. In a similar fashion, PCA computes for the covariance matrix of input data with zero mean. After solving the eigenvalues of a covariance matrix, PCA extracts the eigenvectors corresponding to the maximum eigenvalues [7], [16]. Dimension reduction is achieved by using the eigenvectors with the most significant eigenvalues, which form an orthogonal basis for a low dimension subspace. Every vector in the original space can be approximated by a corresponding to a vector in the subspace [9], [22]. Dimensionality reduction is frequently used as a pre-processing step in data mining. Selecting a smaller number of features carries a significant role in applications involving hundreds or thousands of features. Besides relevant features, there might be derogatory features, indifferent features, and redundant (dependent) ones. Removal of these features not only makes the learning task easier, by reducing computational constraint but also often improves the performance of the classifier [4], [5]. Such data reduction is applied to images to achieve image compression. In this work, we separately use PCA for each cluster, which consists of some specified block images, to reconstruct the original (or input) image [29].
The genetic algorithm (GA) [17], [18], [21], originally developed by Holland over the course of the 1960s and 1970s, is a biological analogy. In the selective breeding of plants or animals, for example, offspring is produced as a combination of the parent chromosomes according to certain characteristics that are determined at the genetic level. When the fitness landscape (or cost surface) of the problem is unclear or riddled with a large number of local optima, the GA usually has good searching capability because the candidate solutions will not become stuck at the local optima [23]. The GA has been successfully applied to many fields of science and engineering [12]. In the proposed algorithm, we partition the dataset into numerous clusters, in which the numbers of principal components using PCA can vary. In this work, we use GA as a framework with three phases, namely, GA operation, repartition clustering, and clustering PCA for image coding. In repartition clustering, the clustering and the number of principal components for each cluster are determined progressively.
Some GA-based clustering algorithms such as stochastic clustering algorithms based on GA, Simple GA (SGA), Hybrid Niching GA (HNGA), and multi-objective GA are mentioned in [1]. In the latter study, these methods are considered only able to find compact hyperspherical, equisized, and convex clusters like those detected by the K-means algorithm [2]. If clusters of different geometric shapes are present in the same dataset, the above methods will not be able to find all of them perfectly [3]. This paper provides a preliminary study in this direction. Here, we apply PCA to the whole dataset obtained from an image to achieve image compression. To improve the reconstructed image quality, we use K-means to partition the dataset, and then apply PCA to each cluster separately. In this method, different numbers of principal components are allowed, and GA is used to identify the optimal number of principal components for each cluster. Finally, we propose the repartition clustering method to improve the image quality and visual effect.
The proposed method can improve the homogeneity in each cluster by increasing the within-group correlation corresponding to PCA image coding. Under the condition that the total numbers of variables to store are roughly the same, the proposed algorithm removes redundant variables in clusters with simple structures and increases the number of principal components to improve the reconstructed quality of certain clusters with complex structures. Experimental results show that the proposed method can effectively increase image quality and improve the visual effect.
Section snippets
PCA image coding
PCA is a variable reduction procedure that is useful when the data that are obtained on a number of variables (possibly a large number of variables) may have some redundancy. In this case, redundancy indicates that there could be features whose presence in the dataset does not affect the performance of a classifier at all. There could even be some correlated set of features and selection of just a few of them might be sufficient for the classifier. This redundancy facilitates the reduction of
PCA image coding with clustering
In this section, we partition the dataset scanned from the original image in Section 2 into K clusters and apply PCA to each cluster separately. The block diagram of the clustering method is shown in Fig. 2. To obtain the optimal number of principal components, GA is introduced. After decoding each cluster, we can reconstruct the image by merging.
Proposed PCA method with repartition clustering
The proposed method imposes a repartition mechanism to the PCA clustering method. For a given dataset , , the approach is to partition S into K groups by minimizing the within-group MSE in Eq. (3.4) under some pre-specified number of variables to record.
Our goal is to approximate the data point using a representation involving a restricted number m, the number of principal components, with m < n of variables corresponding to a projection onto a lower dimensional subspace. The m
Experimental results
In the clustering methods, there is always a problem “how many clusters”. In most cases, it depends on the dataset itself and the choice is usually heuristic. In our case, the types of important visual information of image blocks to human are smooth region, horizontal/vertical edges, diagonal/subdiagonal edges, and texture. So we adopt the number of cluster K = 4 as the main issue for experiments.
We partition the training set into K clusters and apply PCA to each cluster using the proposed
Conclusions
This study formulated a clustering method embedded in a GA framework to improve the performance of clustering PCA image coding. For cluster analysis, we proposed a repartition clustering algorithm that partitions the image blocks into groups, such that individuals of the same group are homogeneous, and vice versa. Furthermore, the homogeneity property in a group is in favor of the PCA subspace projection mechanism in terms of preserving most of the information. Thus, the proposed method can
Acknowledgment
This work has been supported by the National Science Council of Taiwan under grant NSC 102-2221-E-214-048.
References (30)
- et al.
GAPS: a clustering method using a new point symmetry based distance measure
Pattern Recognit.
(2007) - et al.
Feature selection with SVD entropy: some modification and extension
Inf. Sci.
(2014) Image compression using principal component neural networks
Image Vis. Comput.
(2001)- et al.
A new algorithm for initial cluster centers in K-means algorithm
Pattern Recognit. Lett.
(2011) - et al.
Design of optimal residuals from partial principal component models for fault diagnosis in linear system
J. Process Control
(2005) - et al.
Recursive PCA for adaptive process monitoring
J. Process Control
(2000) - et al.
Grouping genetic algorithms: an efficient method to solve the cell formation problem
Math. Comput. Simul.
(2000) - et al.
Performance evaluation and dynamic node generation criteria for ‘principal component analysis’ neural networks
Math. Comput. Simul.
(2000) - et al.
Fuzzy symmetry based real-coded genetic clustering technique for automatic pixel classification in remote sensing imagery
Fundam. Inform.
(2008) - et al.
A point symmetry based clustering technique for automatic evolution of clusters
IEEE Trans. Knowl. Data Eng.
(2008)