Elsevier

Applied Soft Computing

Volume 27, February 2015, Pages 474-486
Applied Soft Computing

Discernible visualization of high dimensional data using label information

https://doi.org/10.1016/j.asoc.2014.09.026Get rights and content

Highlights

  • Visualization methods could significantly improve the outcome of automated knowledge discovery systems by involving human judgment.

  • Star coordinate is a visualization technique that maps k-dimensional data onto a circle using a set of axes sharing the same origin at the center of the circle.

  • We propose a novel method toward automatic axes adjustment for high dimensional data in Star Coordinate visualization method.

  • This method finds the best 2-dimensional view point (discernible visualization) that minimizes intra-cluster distances while keeping the inter-cluster distances as large as possible by using label information.

  • The label information could be provided by the user or could be the result of performing a conventional clustering method over the input data.

Abstract

Visualization methods could significantly improve the outcome of automated knowledge discovery systems by involving human judgment. Star coordinate is a visualization technique that maps k-dimensional data onto a circle using a set of axes sharing the same origin at the center of the circle. It provides the users with the ability to adjust this mapping, through scaling and rotating of the axes, until no mapped point-clouds (clusters) overlap one another. In this state, similar groups of data are easily detectable. However an effective adjustment could be a difficult or even an impossible task for the user in high dimensions. This is specially the case when the input space dimension is about 50 or more.

In this paper, we propose a novel method toward automatic axes adjustment for high dimensional data in Star Coordinate visualization method. This method finds the best two-dimensional view point that minimizes intra-cluster distances while keeping the inter-cluster distances as large as possible by using label information. We call this view point a discernible visualization, where clusters are easily detectable by human eye. The label information could be provided by the user or could be the result of performing a conventional clustering method over the input data. The proposed approach optimizes the Star Coordinate representation by formulating the problem as a maximization of a Fisher discriminant. Therefore the problem has a unique global solution and polynomial time complexity. We also prove that manipulating the scaling factor alone is effective enough for creating any given visualization mapping. Moreover it is showed that k-dimensional data visualization can be modeled as an eigenvalue problem. Using this approach, an optimal axes adjustment in the Star Coordinate method for high dimensional data can be achieved without any user intervention. The experimental results demonstrate the effectiveness of the proposed approach in terms of accuracy and performance.

Introduction

Our 3-dimensional perspective limits our conceptual experience of higher dimension space. Nevertheless our interaction with high dimension spaces is getting more and more inevitable. Increasing development of science and technology has led to substantial growth in data production beyond any human being conception capability. Billions of Web pages in cyberspace, huge geographical information, large amounts of biological data and numerous amounts of business databases are just small portions of the available data.

This has been the main motivation of obtaining geometric models (like graphs where there are 2 or 3 variables) of multivariate relationships arising in analyzing large sets of high dimensional data. Consequently numerous “visualization” approaches are proposed that try to achieve the best map from k-dimensions to 2 or 3 dimensions which is discernible for the human brain, effortlessly.

Unprecedented growth of data production and the limited ability of the human brain have made data visualization an interesting subject in computer science during recent years. As Card et al. described, visualization is “the use of computer-supported interactive, and visual representation of abstract data to amplify cognition.” Visualization is considered as one of the most intuitive methods for cluster detection and validation, and especially is performing well for the representation of irregularly shaped clusters [27], [32].

Other approaches of overcoming the problems of high dimensionality are dimension reduction [4], [20] and feature selection [24]. Data sampling and data summarization could also help to cope with large amount of data records [17], [28]. Scientists interested in these fields face a similar problem in exploratory analysis or visualization of multivariate data.

Star Coordinate is a visualization technique for mapping k-dimensional data into Cartesian coordinates, in which the coordinate axes are arranged on a circle of a two-dimensional plane with the origin at the center of the circle. It is proved that in this mapping technique, a cluster can always be preserved as a point-cloud (or cluster) in the visual space through linear mappings. But the main problem arises when these mapped point-clouds overlap one another, making their boundaries indistinguishable. Therefore the user is given the ability to push and pull or rotate the axes until the desired outcome is achieved. However, an advantageous adjustment is difficult or even impossible for the human agent to achieve, when visualizing high dimensional data. As a result, some researchers have proposed various dimension reduction methods, as pre-processing steps before applying the Star Coordinate visualization technique.

In this paper, we focus on the problem of automatic axes adjustment in Star Coordinate technique for improved visualization results. Our goal is to find the best projection possible that can represent the original data topology in k-dimensional data especially where k is greater than 50, effectively making manual axes adjustment impossible. The rest of the paper is organized as follows. Section 1.1 presents a discussion of related work. The main features of the Star Coordinate algorithm are briefly discussed in Section 1.2. Then, the proposed method is introduced in Section 2. In Section 3, we present the experimental results that validate the cost model. Section 4 presents a discussion of the experimental results. Finally, Section 5 concludes the presented approach.

Numerous approaches have been proposed for the visualization of multi-dimensional datasets. Scatterplot matrix [9], parallel coordinates [21] and dimensional stacking [31] have been developed to address this issue. Parallel coordinates (PC) [21] is a well-known method in which features are represented by parallel vertical axes linearly scaled within their data range. Each sample is represented by a polygonal line that intersects each axis at its respective attribute data value. Parallel coordinates can be used to study the correlations among various attributes by spotting the locations of the intersection points [44]. Also, they are useful for detecting the data distributions and functional dependencies. The main challenge of parallel coordinate approach is the limited space available for each parallel axis. There are several extended method for parallel coordinate, such as Circular Parallel Coordinates [19] and Hierarchical Parallel Coordinates [13].

Ester et al. [10] proposed DBSCAN to discover arbitrarily shaped clusters. It may not handle data sets that contain clusters with different densities. The OPTICS method, derived from the DBSCAN algorithm, uses visualization for visual cluster analysis [1] and is useful for finding density-based clusters in spatial data. Like most of the clustering algorithms, OPTICS is a parametric approach. Yang et al. [39] proposed a visual hierarchical dimension reduction technique, which groups dimensions and visualizes data by using the subset of dimensions obtained from each group. In [2] and [36], some features that affect the quality of visualization have been introduced and some of the above systems are compared based on listed features.

Another famous approach for data visualization is Star Coordinate [25] and its extensions, such as VISTA [6]. The proposed method is based on the Star Coordinate technique. Star Coordinates arranges coordinate axes on a two-dimensional surface, where each axis shares the same origin point. It uses a linear mapping to avoid the cluster breaking after k-dimensional to 2D space mapping. (This has been proven in [8] mathematically). So far, several extensions for VISTA have been introduced. iVIBRATE [7] is a framework for visualizing large datasets using data sampling and the Star Coordinate model. In [37], an Enhanced VISTA is proposed which improves visualization and eases the human computer interaction. The experiments have shown that visual cluster rendering can improve the understanding of clusters, and validate and refine the algorithmic clustering result effectively [25].

VISTA is a very good interactive approach for visualization of k-dimensional data where K < 50, and its efficiency has been proven by various articles. The main shortcoming of this method is that the dimension must be less than 50. Since, according to each dimension of data, a coordinate axis is drawn, when the number of dimensions is more than 50, working with VISTA tools would be very exhausting for humans and, practically, its interactivity property would be useless. This problem becomes more serious when the number of dimensions is much greater than 50. However, there are many datasets with a large amount of features in the world, e.g., textual data, image data, bioinformatics data, etc.

In this paper we propose a novel semi-supervised visualization method for high dimensional data, where a fraction of the data is labeled. The visualization result achieved by applying this method is optimal in terms of discernibility by the user. This work extends Star Coordinates capabilities in working with high-dimensional datasets.

Star Coordinates is a visualization technique for mapping high-dimensional data into two dimensions. In this technique a 2D plane is divided into k equal sectors (θi, the angle of the sectors, is set to 2πi/k by default). Therefore there are k coordinate axes, with each axis representing one dimension of data and all axes sharing their origins at the center of a circle on the 2D space (Fig. 1) having the same length [25]. Data points are scaled to the length of the axis, in way that the smallest is mapped to the origin and the largest to the other end of the axis. Then unit vectors on each coordinate axis are calculated accordingly to allow scaling of data values to the length of the coordinate axes.

The mapping of a point from k-dimensional space to a point in the two dimensional Cartesian coordinates is determined by the sum of all unit vectors (uxi,uyi), on each coordinate multiplied by the value of the data element for that coordinate, as shown in Formula (1):Pj(x,y)=i=1kuxi(djimini),i=1kuyi(djimini)Dj=(dj0,dj1,...,dji,...,djk),ui=Cimaximinimini=mindji,0jD,maxi=maxdji,0jDwhere Dj is a k-dimensional data element and Pj(x,y) is its two-dimensional projected point. The main idea of Star Coordinate is to arrange the coordinate axes on a two-dimensional plane, where the coordinate axes are not necessarily orthogonal to each other [25]. The projection of high dimensional data to 2D space inevitably introduces overlapping and ambiguities, and even bias. It means that multiple points in k-dimensional space may map into one point in Cartesian space. But it is shown that a cluster can always be preserved as a point-cloud in the visual space through linear mappings [8]. The only problem is that these point-clouds may overlap one another. To make sure one finds the best possible mapping, Star Coordinates and its extension VISTA [6] provide several visual adjustment mechanisms, such as axis scaling (α-adjustment in VISTA) and axis angle rotation. Both axis scaling and angle rotation are linear transformation. Since linear mapping does not break clusters, the clusters in the multi-dimensional space are still visualized as dense point-clouds (the “visual clusters”) in two-dimensional space and the visible gaps between the visual clusters in two-dimensional visual space indicate the real gaps between point-clouds in the original high dimensional space [7]. In the following we mention these two mappings.

Rotating an axis modifies the direction of the axis's unit vector and changes the correlation of the corresponding feature (dimension) with other features. Axis rotation changes the direction of axes, thus making a particular data attribute more or less correlated with other attributes. This can resolve the overlapping problem substantially. It helps the user distinguish between clusters that may incorrectly overlap. This is possible by modeling the Star Coordinate using the Euler formula: eix=cosx+isinx, where z=x+iy, and i is the imaginary unit. However, as experimental results have shown, adjusting the scaling transformation is enough in order to find a satisfactory visualization. Therefore we can leave θi to be constant as θi=2πi/k [8].

Scaling transformations allow users to change the length of an axis, thus increasing or decreasing the contribution of a particular data column (particular dimension or feature) on the resultant visualization [25]. Using axis scaling interactively, a user can observe that the data distribution changes dynamically. This is done by adding α to Formula (1) in iVIBRATE [7] as following:Pj(x,y)=(c/k)i=1kαiuxi(djimini),(c/k)i=1kαiuyi(djimini)where αi(i=1k,αi[1,1]) provides the visually adjustable parameters. As mentioned in [7], αi[1,1] covers a considerable range of mapping functions and this range combined with the scaling factor c, is effective enough for finding a satisfactory visualization. It is known that linear mapping does not break clusters, but may cause cluster overlaps [26]. In Fig. 2 initial data distribution of the Iris dataset from the UCI machine learning repository (available online from http://www.ics.uci.edu/∼mlearn/databases/) is shown. Iris has a four-dimensional dataset with 150 records and 3 clusters. Fig. 3(A) depicts the original data distribution of Iris dataset together with the cluster indices achieved by applying the K-means clustering algorithm in VISTA, in which clusters overlap. Fig. 3(B) shows a better separated cluster distribution of Iris using α-adjustment performed interactively by an expert user.

As shown in the above figure, the Star Coordinate approach can effectively visualize data using user interaction. However, as mentioned in [7], Star Coordinate or its variants such as VISTA [6] are limited to visualizing data with a maximum of 50 dimensions. When the number of dimensions is more than 50, visualization using user interaction is practically impossible. As a result, the cluster overlapping problem could not be resolved in high-dimensional data visualization using conventional approaches. In the proposed method, this problem is resolved, provided that a fraction of data (even if small) is labeled. Using this label information, k-dimensional data can be mapped onto the two-dimensional plane, with clusters of the data as recognizable as possible for the system user.

Section snippets

The proposed approach

In the proposed approach, label information of a fraction of the data is employed to enhance visualization results. In the field of pattern recognition, there is a similar subject named semi-supervised clustering [5] and in data mining it is known as domain knowledge based clustering. The effect of domain knowledge application in information visualization has been shown in the studies in these fields [6].

In order to find the best mapping for data visualization, the optimal configuration of axes

Experimental result

In this section, we present several experimental results that illustrate the effectiveness of the proposed method. We test our approach on four data sets. Some properties of the data sets are shown in Table 1. The proposed method was implemented in MATLAB running under Windows XP. The results of the experiments have been compared to those of VISTA and also with a dataset visualized manually by an expert user. Moreover, since there is no comparable technique for extending Star Coordination to

Effect of the number of labeled samples

In this section, the effect of the number of labeled samples on the proposed method is studied. As shown in Fig. 9, the results are satisfactory over a wide range of values but, as the number of labeled samples decreases, the overlapping between clusters increases.

As shown in Fig. 9A and B, when the number of labeled samples is too small (1 per cluster), the degree of cluster overlap is high. In Fig. 9F, 11 samples (less than 0.08% of all 150 data points) are used but cluster separation is

Conclusion and future work

In this paper, we presented an extension to the Star Coordinate method that enables the application of this method to the visualization of high-dimensional data and requires no manual axes adjustment by the user. Our approach addresses the main problem with the Star Coordinate approach, namely that when the number of data dimensions is large (about 50 and more) manual modification to visualization parameters is almost impossible to achieve. We showed that the best data visualization is achieved

References (33)

  • M. Ankerst et al.

    OPTICS: Ordering points to identify the clustering structure

  • E. Bertini et al.

    Quality metrics in high-dimensional data visualization: an overview and systematization

  • K. Chakrabarti et al.

    Local dimensionality reduction: a new approach to indexing high dimensional spaces

  • K. Chen et al.

    VISTA: Validating and refining clusters via visualization

    J. Inform. Vis.

    (2004)
  • K. Chen et al.

    iVIBRATE: Interactive visualization-based framework for clustering large datasets

    ACM Trans. Inform. Syst. (TOIS)

    (2006)
  • K. Chen et al.

    CloudVista: Visual cluster exploration for extreme scale data in the cloud

  • W.S. Cleveland

    Visualizing Data

    (1993)
  • M. Ester et al.

    A density-based algorithm for discovering clusters in large spatial databases with noise

  • D.R. Fokkema et al.

    Jacobi–Davidson style QR and QZ algorithms for the reduction of matrix pencils

    SIAM J. Sci. Comput.

    (1998)
  • J.H. Friedman

    Regularized discriminant analysis

    J. Am. Stat. Assoc.

    (1989)
  • Y.-H. Fua et al.

    Hierarchical parallel coordinates for exploration of large datasets

  • K. Fukunaga

    Introduction to Statistical Pattern Recognition

    (1990)
  • A.N. Gorban et al.

    Principle Manifolds for Data Visualization and Dimension Reduction

    (2007)
  • V. Gupta et al.

    A survey of text summarization extractive techniques

    J. Emerg. Technol. Web Intell.

    (2010)
  • P.E. Hoffman

    Table Visualizations: A Formal Model and Its Applications

    (1999)
  • View full text