Discernible visualization of high dimensional data using label information
Graphical abstract
Introduction
Our 3-dimensional perspective limits our conceptual experience of higher dimension space. Nevertheless our interaction with high dimension spaces is getting more and more inevitable. Increasing development of science and technology has led to substantial growth in data production beyond any human being conception capability. Billions of Web pages in cyberspace, huge geographical information, large amounts of biological data and numerous amounts of business databases are just small portions of the available data.
This has been the main motivation of obtaining geometric models (like graphs where there are 2 or 3 variables) of multivariate relationships arising in analyzing large sets of high dimensional data. Consequently numerous “visualization” approaches are proposed that try to achieve the best map from k-dimensions to 2 or 3 dimensions which is discernible for the human brain, effortlessly.
Unprecedented growth of data production and the limited ability of the human brain have made data visualization an interesting subject in computer science during recent years. As Card et al. described, visualization is “the use of computer-supported interactive, and visual representation of abstract data to amplify cognition.” Visualization is considered as one of the most intuitive methods for cluster detection and validation, and especially is performing well for the representation of irregularly shaped clusters [27], [32].
Other approaches of overcoming the problems of high dimensionality are dimension reduction [4], [20] and feature selection [24]. Data sampling and data summarization could also help to cope with large amount of data records [17], [28]. Scientists interested in these fields face a similar problem in exploratory analysis or visualization of multivariate data.
Star Coordinate is a visualization technique for mapping k-dimensional data into Cartesian coordinates, in which the coordinate axes are arranged on a circle of a two-dimensional plane with the origin at the center of the circle. It is proved that in this mapping technique, a cluster can always be preserved as a point-cloud (or cluster) in the visual space through linear mappings. But the main problem arises when these mapped point-clouds overlap one another, making their boundaries indistinguishable. Therefore the user is given the ability to push and pull or rotate the axes until the desired outcome is achieved. However, an advantageous adjustment is difficult or even impossible for the human agent to achieve, when visualizing high dimensional data. As a result, some researchers have proposed various dimension reduction methods, as pre-processing steps before applying the Star Coordinate visualization technique.
In this paper, we focus on the problem of automatic axes adjustment in Star Coordinate technique for improved visualization results. Our goal is to find the best projection possible that can represent the original data topology in k-dimensional data especially where k is greater than 50, effectively making manual axes adjustment impossible. The rest of the paper is organized as follows. Section 1.1 presents a discussion of related work. The main features of the Star Coordinate algorithm are briefly discussed in Section 1.2. Then, the proposed method is introduced in Section 2. In Section 3, we present the experimental results that validate the cost model. Section 4 presents a discussion of the experimental results. Finally, Section 5 concludes the presented approach.
Numerous approaches have been proposed for the visualization of multi-dimensional datasets. Scatterplot matrix [9], parallel coordinates [21] and dimensional stacking [31] have been developed to address this issue. Parallel coordinates (PC) [21] is a well-known method in which features are represented by parallel vertical axes linearly scaled within their data range. Each sample is represented by a polygonal line that intersects each axis at its respective attribute data value. Parallel coordinates can be used to study the correlations among various attributes by spotting the locations of the intersection points [44]. Also, they are useful for detecting the data distributions and functional dependencies. The main challenge of parallel coordinate approach is the limited space available for each parallel axis. There are several extended method for parallel coordinate, such as Circular Parallel Coordinates [19] and Hierarchical Parallel Coordinates [13].
Ester et al. [10] proposed DBSCAN to discover arbitrarily shaped clusters. It may not handle data sets that contain clusters with different densities. The OPTICS method, derived from the DBSCAN algorithm, uses visualization for visual cluster analysis [1] and is useful for finding density-based clusters in spatial data. Like most of the clustering algorithms, OPTICS is a parametric approach. Yang et al. [39] proposed a visual hierarchical dimension reduction technique, which groups dimensions and visualizes data by using the subset of dimensions obtained from each group. In [2] and [36], some features that affect the quality of visualization have been introduced and some of the above systems are compared based on listed features.
Another famous approach for data visualization is Star Coordinate [25] and its extensions, such as VISTA [6]. The proposed method is based on the Star Coordinate technique. Star Coordinates arranges coordinate axes on a two-dimensional surface, where each axis shares the same origin point. It uses a linear mapping to avoid the cluster breaking after k-dimensional to 2D space mapping. (This has been proven in [8] mathematically). So far, several extensions for VISTA have been introduced. iVIBRATE [7] is a framework for visualizing large datasets using data sampling and the Star Coordinate model. In [37], an Enhanced VISTA is proposed which improves visualization and eases the human computer interaction. The experiments have shown that visual cluster rendering can improve the understanding of clusters, and validate and refine the algorithmic clustering result effectively [25].
VISTA is a very good interactive approach for visualization of k-dimensional data where K < 50, and its efficiency has been proven by various articles. The main shortcoming of this method is that the dimension must be less than 50. Since, according to each dimension of data, a coordinate axis is drawn, when the number of dimensions is more than 50, working with VISTA tools would be very exhausting for humans and, practically, its interactivity property would be useless. This problem becomes more serious when the number of dimensions is much greater than 50. However, there are many datasets with a large amount of features in the world, e.g., textual data, image data, bioinformatics data, etc.
In this paper we propose a novel semi-supervised visualization method for high dimensional data, where a fraction of the data is labeled. The visualization result achieved by applying this method is optimal in terms of discernibility by the user. This work extends Star Coordinates capabilities in working with high-dimensional datasets.
Star Coordinates is a visualization technique for mapping high-dimensional data into two dimensions. In this technique a 2D plane is divided into k equal sectors (, the angle of the sectors, is set to by default). Therefore there are k coordinate axes, with each axis representing one dimension of data and all axes sharing their origins at the center of a circle on the 2D space (Fig. 1) having the same length [25]. Data points are scaled to the length of the axis, in way that the smallest is mapped to the origin and the largest to the other end of the axis. Then unit vectors on each coordinate axis are calculated accordingly to allow scaling of data values to the length of the coordinate axes.
The mapping of a point from k-dimensional space to a point in the two dimensional Cartesian coordinates is determined by the sum of all unit vectors (), on each coordinate multiplied by the value of the data element for that coordinate, as shown in Formula (1):where is a k-dimensional data element and is its two-dimensional projected point. The main idea of Star Coordinate is to arrange the coordinate axes on a two-dimensional plane, where the coordinate axes are not necessarily orthogonal to each other [25]. The projection of high dimensional data to 2D space inevitably introduces overlapping and ambiguities, and even bias. It means that multiple points in k-dimensional space may map into one point in Cartesian space. But it is shown that a cluster can always be preserved as a point-cloud in the visual space through linear mappings [8]. The only problem is that these point-clouds may overlap one another. To make sure one finds the best possible mapping, Star Coordinates and its extension VISTA [6] provide several visual adjustment mechanisms, such as axis scaling (α-adjustment in VISTA) and axis angle rotation. Both axis scaling and angle rotation are linear transformation. Since linear mapping does not break clusters, the clusters in the multi-dimensional space are still visualized as dense point-clouds (the “visual clusters”) in two-dimensional space and the visible gaps between the visual clusters in two-dimensional visual space indicate the real gaps between point-clouds in the original high dimensional space [7]. In the following we mention these two mappings.
Rotating an axis modifies the direction of the axis's unit vector and changes the correlation of the corresponding feature (dimension) with other features. Axis rotation changes the direction of axes, thus making a particular data attribute more or less correlated with other attributes. This can resolve the overlapping problem substantially. It helps the user distinguish between clusters that may incorrectly overlap. This is possible by modeling the Star Coordinate using the Euler formula: , where , and i is the imaginary unit. However, as experimental results have shown, adjusting the scaling transformation is enough in order to find a satisfactory visualization. Therefore we can leave θi to be constant as [8].
Scaling transformations allow users to change the length of an axis, thus increasing or decreasing the contribution of a particular data column (particular dimension or feature) on the resultant visualization [25]. Using axis scaling interactively, a user can observe that the data distribution changes dynamically. This is done by adding α to Formula (1) in iVIBRATE [7] as following:where provides the visually adjustable parameters. As mentioned in [7], covers a considerable range of mapping functions and this range combined with the scaling factor c, is effective enough for finding a satisfactory visualization. It is known that linear mapping does not break clusters, but may cause cluster overlaps [26]. In Fig. 2 initial data distribution of the Iris dataset from the UCI machine learning repository (available online from http://www.ics.uci.edu/∼mlearn/databases/) is shown. Iris has a four-dimensional dataset with 150 records and 3 clusters. Fig. 3(A) depicts the original data distribution of Iris dataset together with the cluster indices achieved by applying the K-means clustering algorithm in VISTA, in which clusters overlap. Fig. 3(B) shows a better separated cluster distribution of Iris using α-adjustment performed interactively by an expert user.
As shown in the above figure, the Star Coordinate approach can effectively visualize data using user interaction. However, as mentioned in [7], Star Coordinate or its variants such as VISTA [6] are limited to visualizing data with a maximum of 50 dimensions. When the number of dimensions is more than 50, visualization using user interaction is practically impossible. As a result, the cluster overlapping problem could not be resolved in high-dimensional data visualization using conventional approaches. In the proposed method, this problem is resolved, provided that a fraction of data (even if small) is labeled. Using this label information, k-dimensional data can be mapped onto the two-dimensional plane, with clusters of the data as recognizable as possible for the system user.
Section snippets
The proposed approach
In the proposed approach, label information of a fraction of the data is employed to enhance visualization results. In the field of pattern recognition, there is a similar subject named semi-supervised clustering [5] and in data mining it is known as domain knowledge based clustering. The effect of domain knowledge application in information visualization has been shown in the studies in these fields [6].
In order to find the best mapping for data visualization, the optimal configuration of axes
Experimental result
In this section, we present several experimental results that illustrate the effectiveness of the proposed method. We test our approach on four data sets. Some properties of the data sets are shown in Table 1. The proposed method was implemented in MATLAB running under Windows XP. The results of the experiments have been compared to those of VISTA and also with a dataset visualized manually by an expert user. Moreover, since there is no comparable technique for extending Star Coordination to
Effect of the number of labeled samples
In this section, the effect of the number of labeled samples on the proposed method is studied. As shown in Fig. 9, the results are satisfactory over a wide range of values but, as the number of labeled samples decreases, the overlapping between clusters increases.
As shown in Fig. 9A and B, when the number of labeled samples is too small (1 per cluster), the degree of cluster overlap is high. In Fig. 9F, 11 samples (less than 0.08% of all 150 data points) are used but cluster separation is
Conclusion and future work
In this paper, we presented an extension to the Star Coordinate method that enables the application of this method to the visualization of high-dimensional data and requires no manual axes adjustment by the user. Our approach addresses the main problem with the Star Coordinate approach, namely that when the number of data dimensions is large (about 50 and more) manual modification to visualization parameters is almost impossible to achieve. We showed that the best data visualization is achieved
References (33)
- et al.
OPTICS: Ordering points to identify the clustering structure
- et al.
Quality metrics in high-dimensional data visualization: an overview and systematization
- et al.
Local dimensionality reduction: a new approach to indexing high dimensional spaces
- et al.
VISTA: Validating and refining clusters via visualization
J. Inform. Vis.
(2004) - et al.
iVIBRATE: Interactive visualization-based framework for clustering large datasets
ACM Trans. Inform. Syst. (TOIS)
(2006) - et al.
CloudVista: Visual cluster exploration for extreme scale data in the cloud
Visualizing Data
(1993)- et al.
A density-based algorithm for discovering clusters in large spatial databases with noise
- et al.
Jacobi–Davidson style QR and QZ algorithms for the reduction of matrix pencils
SIAM J. Sci. Comput.
(1998)
Regularized discriminant analysis
J. Am. Stat. Assoc.
Hierarchical parallel coordinates for exploration of large datasets
Introduction to Statistical Pattern Recognition
Principle Manifolds for Data Visualization and Dimension Reduction
A survey of text summarization extractive techniques
J. Emerg. Technol. Web Intell.
Table Visualizations: A Formal Model and Its Applications
Cited by (1)
Interactive visual analysis of mass spectrometry imaging data using linear and non-linear embeddings
2020, Information (Switzerland)