Unsupervised dimensionality reduction for very large datasets: Are we going to the right direction?

https://doi.org/10.1016/j.knosys.2020.105777Get rights and content

Abstract

Given a set of millions or even billions of complex objects for descriptive data mining, how to effectively reduce the data dimensionality? It must be performed in an unsupervised way. Unsupervised dimensionality reduction is essential for analytical tasks like clustering and outlier detection because it helps to overcome the drawbacks of the “curse of high dimensionality”. The state-of-the-art approach is to preserve the data variance by means of well-known techniques, such as PCA, KPCA, SVD, and other techniques based on those that have been mentioned, such as PUFS. But, is it always the best strategy to follow? This paper presents an exploratory study performed to compare two distinct approaches: (a) the standard variance preservation, and; (b) one alternative, Fractal-based solution that is rarely used, for which we propose one fast and scalable Spark-based algorithm using a novel feature partitioning approach that allows it to tackle data of high dimensionality. Both strategies were evaluated by inserting into 11 real-world datasets, with up to 123.5 million elements and 518 attributes, at most 500 additional attributes formed by correlations of many kinds, such as linear, quadratic, logarithmic and exponential, and verifying their abilities to remove this redundancy. The results indicate that, at least for large datasets of dimensionality with up to 1,000 attributes, our proposed Fractal-based algorithm is the best option. It accurately and efficiently removed the redundant attributes in nearly all cases, as opposed to the standard variance-preservation strategy that presented considerably worse results, even when applying the KPCA approach that is made for non-linear correlations.

Introduction

The volume and complexity of data generated in scientific and commercial applications have been increasing in multiple domains. This panorama has motivated the development of techniques that can automatically help users to analyze, understand and extract knowledge from very large datasets, especially from complex data, such as collections of images, audio, data streams, social network graphs, DNA sequences, and so on [1]. For example, given a set with millions or even billions of images from Flickr1 or Facebook,2 how to identify patterns that describe user behavior to support targeted marketing? Although there are many real applications that depend on the analysis of large datasets, like the aforementioned one, the best current algorithms tend to be inefficient or ineffective in data with many attributes [2].

The main challenge in analyzing data with many attributes is a phenomenon known as the “curse of high dimensionality”, which states that increasing the number of attributes in data objects leads to fast degradation in the performance and accuracy of many analytical algorithms [3], [4]. The preprocessing by dimensionality reduction is the main technique applied in these cases; it aims to decrease the number of attributes, thus reducing the effects of the high dimensionality and also the amount of data to be analyzed and stored, which is feasible because real-world data usually present non-uniform distributions and attribute correlations [5], [6]. The existing methods are: supervised or unsupervised.

Supervised dimensionality reduction uses external knowledge, e.g., label information, and aims at getting a subset of non-redundant attributes that are relevant to a very specific mining task, such as classification or regression [3]. Unsupervised methods, by contrast, do not make use of a priori knowledge about the output and aim at removing all the redundant attributes. They can be used as preprocessing tools in a wide variety of descriptive data mining and machine learning tasks, such as clustering and outlier detection [3], [7]. To achieve this goal, the state-of-the-art methods often aim at preserving the most of the data variance, using well-known strategies such as Principal Component Analysis (PCA), Kernel PCA (KPCA) and Singular Value Decomposition (SVD). Unfortunately, these techniques present a central drawback: they are either unable to identify and eliminate non-linear attribute correlations – PCA-based and SVD-based approaches – or they cannot process data of high cardinality with a reasonable computational cost — KPCA-based ones. Since correlations of these types are very likely to exist in real data – for example, in Biology, it is known that the co-expression patterns of genes in a gene network can be non-linear; in Physics, the pressure, volume and temperature of one ideal gas exhibit non-linear relationships [5] – and the amount of data has been increasing, this drawback compromises the usability of unsupervised dimensionality reduction as a whole.

But, is it always the best strategy to follow? This paper investigates the question by presenting an exploratory study performed to compare two distinct approaches for unsupervised dimensionality reduction: (a) the standard variance preservation, and; (b) one alternative, Fractal-based solution that is rarely used, for which we propose one fast and scalable Spark-based algorithm using a novel feature partitioning approach that allows it to tackle data of high dimensionality. We show that, at least for large datasets of dimensionality with up to 1000 attributes, our proposed Fractal-based algorithm is the best option, being scalable to process millions or even billions of objects and more accurate to spot and eliminate correlations of many kinds, such as linear, quadratic, logarithmic and exponential; its main limitation is the requirement of processing large volumes of data, which abound nowadays. As a consequence, we point out in this paper that the increasing volume of data available in the current era of Big Data enables the use of the Fractal Theory to spot relevant attributes in data of high dimensionality. Our main contributions are:

  • 1.

    Extensive exploratory evaluation —We report results of a detailed exploratory study, using 11 datasets with up to 123.5 million elements and 518 attributes from physics, finance, transportation, energy, electricity, image, audio, and climatic domains, systematically evaluating and validating the ability of the variance preservation and the Fractal-based approaches to remove many types of attribute correlations;

  • 2.

    Novel algorithm —We propose the new algorithm Fractal Redundancy Elimination (FReE), a parallel and distributed dimensionality reduction algorithm that uses concepts from the Fractal Theory and Apache Spark3 to deal with data of high cardinality. FReE also implements a novel feature partitioning strategy that we carefully developed to make it suited for high-dimensionality data processing.

Fig. 1 exemplifies our results on 4 of the real datasets of distinct domains that we studied. It compares the standard variance preservation approach with our novel Fractal-based algorithm by: (a) increasing the number of redundant attributes, and (b) increasing the number of attributes involved in each correlation. As it can be seen in Fig. 1a, our novel Fractal-based algorithm (FReE) detected redundant attributes with high accuracy, even when we inserted five hundred redundant attributes in each dataset, unlike the variance preservation PCA, SVD, PUFS, and KPCA approaches. Here, each new attribute inserted depends exclusively on one original attribute. In Fig. 1b, we also show that unlike PCA, SVD, PUFS, and KPCA, the Fractal-based approach detected redundant attributes even when each correlation involved more than one original attribute; our FReE correctly removed each new attribute inserted whose correlation depends on up to dozens of the original attributes. Note that the details and additional results are given later in the paper; see Sections 4 Proposed evaluation, 5 Experimental results.

Observation: for reproducibility, all codes, detailed results and datasets used in this paper are freely available for download at: http://bit.ly/2k5QiQH.

The rest of this paper follows a traditional organization: background concepts and related works (Section 2), methodology (Sections 3 Proposed method, 4 Proposed evaluation), experiments (Section 5), and conclusions (Section 6).

Section snippets

Background concepts and related works

The “curse of high dimensionality” is a common problem in data mining and machine learning, and the dimensionality reduction is the main technique applied to overcome this problem [3], [4], [7], [8]. It removes redundant information by mapping the original data space into other space of lower dimensionality [3]. Here, two concepts are fundamentals [4], [9]: the Embedded Dimensionality E, which is the total number of attributes; and the Intrinsic Dimensionality D, which is the minimum number of

Proposed method

Fractal-based techniques assume that the intrinsic dimensionality of the object described by the dataset is smaller, generally much smaller, than that of the space where the object is represented, i.e., the embedded dimensionality. Since this assumption has been true in the vast majority of real situations, these techniques are useful to indicate a target dimensionality to which the dataset can be reduced, using Fractal Theory techniques or not, thereby diminishing the dimensionality curse.

Proposed evaluation

This section describes the methodology that we propose to evaluate dimensionality reduction techniques. It includes the materials used and details about the systematic evaluation performed to answer the following questions:

  • Q1

    In practice, what types of attribute correlations is our fractal-based algorithm capable of removing? What about PCA, SVD, PUFS, and KPCA?

  • Q2

    What is the influence of increasing the number of redundant attributes regarding the techniques studied?

  • Q3

    Are the techniques capable of

Experimental results

In this section, we discuss the experiments performed to answer the questions posed in the beginning of Section 4. The experiments used a Microsoft Azure cluster with 8 machines: two masters, each one with 4 cores, 14 GB of RAM and 200 GB of disk, and; 6 workers, each one with 8 cores, 28 GB of RAM and 600 GB of disk. We configured the machines with GNU/Linux Ubuntu 16.04 server x64. We also used 6 machines from Amazon Web Services (r5a.4xlarge), each one with 16 cores, 128 GB of RAM and 100 GB

Conclusion

Given a set of millions or even billions of complex objects for descriptive data mining, how to effectively reduce the data dimensionality to overcome the drawbacks of the “curse of high dimensionality”? It must be performed in an unsupervised way. The state-of-the-art approach is to preserve the data variance by means of well-known techniques, such as PCA, KPCA, SVD, and other techniques based on those that have been mentioned, such as PUFS. But, is it always the best strategy to follow? This

CRediT authorship contribution statement

Jadson Jose Monteiro Oliveira: Data curation, Conceptualization, Methodology, Software, Writing - original draft, Investigation, Validation, Writing - review & editing. Robson Leonardo Ferreira Cordeiro: Supervision, Conceptualization, Methodology, Software, Writing - original draft, Investigation, Validation, Writing - review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This research was supported by FAPESP, Brazil (São Paulo State Research Foundation) [grants 2018/05714-5 and 2016/17078-0], CAPES, Brazil (Brazilian Coordination for Improvement of Higher Level Personnel) [grant 001], CNPq, Brazil (Brazilian National Council for Supporting Research) [grant 166887/2017-0], Microsoft Azure Research, Brazil and Amazon Web Services Cloud Credits for Research, Brazil .

References (23)

  • GolayJ. et al.

    Unsupervised feature selection based on the morisita estimator of intrinsic dimension

    Knowl.-Based Syst.

    (2017)
  • ChenH. et al.

    Parallel attribute reduction in dominance-based neighborhood rough set

    Inform. Sci.

    (2016)
  • CordeiroR.L.F. et al.

    Data Mining in Large Sets of Complex Data

    (2013)
  • SunZ. et al.

    Data intensive parallel feature selection method study

  • FraideinberzeA.C. et al.

    Effective and unsupervised fractal-based feature selection for very large datasets: Removing linear and non-linear attribute correlations

  • TungA.K.H. et al.

    Curler: Finding and visualizing nonlinear correlation clusters

  • FaloutsosC. et al.

    Beyond uniformity and independence: Analysis of r-trees using the concept of fractal dimension

  • ChengK. et al.

    Unsupervised feature selection in signed social networks

  • ZhangC. et al.

    Feature selection method based on multi-fractal dimension and harmony search algorithm and its application

    Internat. J. Systems Sci.

    (2016)
  • Traina Jr.C. et al.

    Fast feature selection using fractal dimension

    JIDM

    (2010)
  • Palma-MendozaR.J. et al.

    Distributed relieff-based feature selection in spark

    Knowl. Inf. Syst.

    (2018)
  • Cited by (4)

    • Multi-objective evolutionary optimization of unsupervised latent variables of turning process

      2022, Applied Soft Computing
      Citation Excerpt :

      Unsupervised learning methods are available to describe the data structure without labels [2]. These methods can be used for preprocessing of data sets for posterior analysis [3]. Unsupervised learning methods which advocate data transformation and correlation accounting, such as principal component analysis (PCA) and factor analysis (FA), are originated from multivariate data analysis.

    View full text