Massively Parallel Unsupervised Feature Selection on Spark

Ordozgoiti, Bruno; Gómez Canaval, Sandra; Mozo, Alberto

doi:10.1007/978-3-319-23201-0_21

Massively Parallel Unsupervised Feature Selection on Spark

Bruno Ordozgoiti⁵,
Sandra Gómez Canaval⁵ &
Alberto Mozo⁵

Conference paper
First Online: 01 January 2015

1305 Accesses
2 Citations

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 539))

Abstract

High dimensional data sets pose important challenges such as the curse of dimensionality and increased computational costs. Dimensionality reduction is therefore a crucial step for most data mining applications. Feature selection techniques allow us to achieve said reduction. However, it is nowadays common to deal with huge data sets, and most existing feature selection algorithms are designed to function in a centralized fashion, which makes them non scalable. Moreover, some of them require the selection process to be validated according to some target, which constrains their applicability to the supervised learning setting. In this paper we propose as novelty a parallel, scalable, exact implementation of an existing centralized, unsupervised feature selection algorithm on Spark, an efficient big data framework for large-scale distributed computation that outperforms MapReduce when applied to multi-pass algorithms. We validate the efficiency of the implementation using 1GB of real Internet traffic captured at a medium-sized ISP.

The research leading to these results has been developed within the ONTIC project, which has received funding from the European Union’s Seventh Framework Programme (FP7/2007-2011) under grant agreement no. 619633.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Boutsidis, C., Mahoney, M.W., Drineas, P.: An improved approximation algorithm for the column subset selection problem. In: Proceedings of the Twentieth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 968–977. Society for Industrial and Applied Mathematics, January 2009
Google Scholar
Boutsidis, C., Mahoney, M.W., Drineas, P.: Unsupervised feature selection for principal components analysis. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM (2008)
Google Scholar
Pi, Y., et al.: A scalable approach to column-based low-rank matrix approximation. In: Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence. AAAI Press (2013)
Google Scholar
Sun, Z., Li, Z.: Data intensive parallel feature selection method study. In: 2014 International Joint Conference on Neural Networks (IJCNN). IEEE (2014)
Google Scholar
Reggiani, C., et al.: Minimum redundancy maximum relevance: mapreduce implementation using apache hadoop. In: BENELEARN 2014, p. 2 (2014)
Google Scholar
Bolón-Canedo, V., Sánchez-Maroño, N., Alonso-Betanzos, A.: Distributed feature selection: An application to microarray data classification. Applied Soft Computing 30, 136–150 (2015)
Article Google Scholar
Singh, S., et al.: Parallel large scale feature selection for logistic regression. In: SDM (2009)
Google Scholar
Zhao, Z., et al.: Massively parallel feature selection: an approach based on variance preservation. Machine Learning 92(1), 195–220 (2013)
Article MathSciNet Google Scholar
Farahat, A.K., et al.: Distributed column subset selection on MapReduce. In: 2013 IEEE 13th International Conference on Data Mining (ICDM). IEEE (2013)
Google Scholar
He, Q., et al.: Parallel feature selection using positive approximation based on MapReduce. In: 2014 11th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD). IEEE (2014)
Google Scholar
Pan, C.-T.: On the existence and computation of rank-revealing LU factorizations. Linear Algebra and its Applications 316(1), 199–222 (2000)
Article MathSciNet Google Scholar
Zaharia, M., et al.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation. USENIX Association (2012)
Google Scholar
Higham, N.J.: The accuracy of floating point summation. SIAM Journal on Scientific Computing 14(4), 783–799 (1993)
Article MathSciNet Google Scholar
Kahan, W.: 1965. Pracniques: further remarks on reducing truncation errors. Commun. ACM 8(1), January 1965
Google Scholar
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Communications of the ACM 51(1), 107–113 (2008)
Article Google Scholar
Shvachko, K., et al.: The hadoop distributed file system. In: 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST). IEEE (2010)
Google Scholar
Murray, D.G., et al.: CIEL: a universal execution engine for distributed data-flow computing. In: NSDI, vol. 11 (2011)
Google Scholar
Isard, M., et al.: Dryad: distributed data-parallel programs from sequential building blocks. ACM SIGOPS Operating Systems Review 41(3) (2007)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Systems, University College of Computer Science, Universidad Politécnica de Madrid, Crta. de Valencia Km. 7, 28031, Madrid, Spain
Bruno Ordozgoiti, Sandra Gómez Canaval & Alberto Mozo

Authors

Bruno Ordozgoiti
View author publications
You can also search for this author in PubMed Google Scholar
Sandra Gómez Canaval
View author publications
You can also search for this author in PubMed Google Scholar
Alberto Mozo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bruno Ordozgoiti .

Editor information

Editors and Affiliations

Poznan University of Technology, Poznan, Poland
Tadeusz Morzy
INRIA, Montpellier, France
Patrick Valduriez
National Engineering School for Mechanics and Aerotechnics, Poitiers, France
Ladjel Bellatreche

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ordozgoiti, B., Gómez Canaval, S., Mozo, A. (2015). Massively Parallel Unsupervised Feature Selection on Spark. In: Morzy, T., Valduriez, P., Bellatreche, L. (eds) New Trends in Databases and Information Systems. ADBIS 2015. Communications in Computer and Information Science, vol 539. Springer, Cham. https://doi.org/10.1007/978-3-319-23201-0_21

Download citation

DOI: https://doi.org/10.1007/978-3-319-23201-0_21
Published: 28 August 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-23200-3
Online ISBN: 978-3-319-23201-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics