Probabilistic Leverage Scores for Parallelized Unsupervised Feature Selection

Ordozgoiti, Bruno; Canaval, Sandra Gómez; Mozo, Alberto

doi:10.1007/978-3-319-59147-6_61

Bruno Ordozgoiti¹⁶,
Sandra Gómez Canaval¹⁶ &
Alberto Mozo¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10306))

Included in the following conference series:

International Work-Conference on Artificial Neural Networks

2957 Accesses
1 Citations

Abstract

Dimensionality reduction is often crucial for the application of machine learning and data mining. Feature selection methods can be employed for this purpose, with the advantage of preserving interpretability. There exist unsupervised feature selection methods based on matrix factorization algorithms, which can help choose the most informative features in terms of approximation error. Randomized methods have been proposed recently to provide better theoretical guarantees and better approximation errors than their deterministic counterparts, but their computational costs can be significant when dealing with big, high dimensional data sets. Some existing randomized and deterministic approaches require the computation of the singular value decomposition in \(O(mn\min (m,n))\) time (for m samples and n features) for providing leverage scores. This compromises their applicability to domains of even moderately high dimensionality. In this paper we propose the use of Probabilistic PCA to compute the leverage scores in O(mnk) time, enabling the applicability of some of these randomized methods to large, high-dimensional data sets. We show that using this approach, we can rapidly provide an approximation of the leverage scores that is works well in this context. In addition, we offer a parallelized version over the emerging Resilient Distributed Datasets paradigm (RDD) on Apache Spark, making it horizontally scalable for enormous numbers of data instances. We validate the performance of our approach on different data sets comprised of real-world and synthetic data.

The research leading to these results has received funding from the European Union under the FP7 grant agreement n. 619633 (project ONTIC) and H2020 grant agreement n. 671625 (project CogNet).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
If the cluster has enough memory, we recommend caching this RDD, since it is used in two different operations.
2.
http://ict-ontic.eu/index.php/onts-data/onts-request-access.
3.
http://ict-ontic.eu/.

References

Boutsidis, C., Mahoney, M.W., Drineas, P.: An improved approximation algorithm for the column subset selection problem. In: Proceedings of the Twentieth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 968–977. Society for Industrial and Applied Mathematics (2009)
Google Scholar
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Article Google Scholar
Deshpande, A., Rademacher, L., Vempala, S., Wang, G.: Matrix approximation and projective clustering via volume sampling. In: Proceedings of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1117–1126. Society for Industrial and Applied Mathematics (2006)
Google Scholar
Farahat, A.K., Elgohary, A., Ghodsi, A., Kamel, M.S.: Distributed column subset selection on mapreduce. In: IEEE 13th International Conference on Data Mining (ICDM), pp. 171–180. IEEE (2013)
Google Scholar
Frieze, A., Kannan, R., Vempala, S.: Fast monte-carlo algorithms for finding low-rank approximations. J. ACM (JACM) 51(6), 1025–1041 (2004)
Article MathSciNet MATH Google Scholar
He, Q., Cheng, X., Zhuang, F., Shi, Z.: Parallel feature selection using positive approximation based on mapreduce. In: 11th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), pp. 397–402. IEEE (2014)
Google Scholar
Jolliffe, I.T.: Discarding variables in a principal component analysis. I: artificial data. Appl. Stat. 21, 160–173 (1972)
Article MathSciNet Google Scholar
Ordozgoiti, B., Canaval, S.G., Mozo, A.: Parallelized unsupervised feature selection for large-scale network traffic analysis. In: Proceedings of the ESANN (2016)
Google Scholar
Ordozgoiti, B., Canaval, S.G., Mozo, A.: A fast iterative algorithm for improved unsupervised feature selection. In: IEEE 16th International Conference on Data Mining (ICDM), pp. 390–399. IEEE (2016)
Google Scholar
Papailiopoulos, D., Kyrillidis, A., Boutsidis, C.: Provable deterministic leverage score sampling. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 997–1006. ACM (2014)
Google Scholar
Singh, S., Kubica, J., Larsen, S., Sorokina, D.: Parallel large scale feature selection for logistic regression. In: SDM, pp. 1172–1183. SIAM (2009)
Google Scholar
Sun, Z., Li, Z.: Data intensive parallel feature selection method study. In: International Joint Conference on Neural Networks (IJCNN), pp. 2256–2262. IEEE (2014)
Google Scholar
Tipping, M.E., Bishop, C.M.: Probabilistic principal component analysis. J. Roy. Stat. Soc. Ser. B 61(3), 611–622 (1999)
Article MathSciNet MATH Google Scholar
Wold, S., Esbensen, K., Geladi, P.: Principal component analysis. Chemometr. Intell. Lab. Syst. 2(1), 37–52 (1987)
Article Google Scholar
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, p. 2. USENIX Assoc. (2012)
Google Scholar
Zhao, Z., Zhang, R., Cox, J., Duling, D., Sarle, W.: Massively parallel feature selection: an approach based on variance preservation. Mach. Learn. 92(1), 195–220 (2013)
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Systems, Universidad Politécnica de Madrid, Madrid, Spain
Bruno Ordozgoiti, Sandra Gómez Canaval & Alberto Mozo

Authors

Bruno Ordozgoiti
View author publications
You can also search for this author in PubMed Google Scholar
Sandra Gómez Canaval
View author publications
You can also search for this author in PubMed Google Scholar
Alberto Mozo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bruno Ordozgoiti .

Editor information

Editors and Affiliations

Universidad de Granada, Granada, Spain
Ignacio Rojas
University of Malaga, Malaga, Spain
Gonzalo Joya
Polytechnic University of Catalonia, Vilanova i la Geltrú, Barcelona, Spain
Andreu Catala

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ordozgoiti, B., Canaval, S.G., Mozo, A. (2017). Probabilistic Leverage Scores for Parallelized Unsupervised Feature Selection. In: Rojas, I., Joya, G., Catala, A. (eds) Advances in Computational Intelligence. IWANN 2017. Lecture Notes in Computer Science(), vol 10306. Springer, Cham. https://doi.org/10.1007/978-3-319-59147-6_61

Download citation

DOI: https://doi.org/10.1007/978-3-319-59147-6_61
Published: 18 May 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-59146-9
Online ISBN: 978-3-319-59147-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics