Abstract
Dimensionality reduction is often crucial for the application of machine learning and data mining. Feature selection methods can be employed for this purpose, with the advantage of preserving interpretability. There exist unsupervised feature selection methods based on matrix factorization algorithms, which can help choose the most informative features in terms of approximation error. Randomized methods have been proposed recently to provide better theoretical guarantees and better approximation errors than their deterministic counterparts, but their computational costs can be significant when dealing with big, high dimensional data sets. Some existing randomized and deterministic approaches require the computation of the singular value decomposition in \(O(mn\min (m,n))\) time (for m samples and n features) for providing leverage scores. This compromises their applicability to domains of even moderately high dimensionality. In this paper we propose the use of Probabilistic PCA to compute the leverage scores in O(mnk) time, enabling the applicability of some of these randomized methods to large, high-dimensional data sets. We show that using this approach, we can rapidly provide an approximation of the leverage scores that is works well in this context. In addition, we offer a parallelized version over the emerging Resilient Distributed Datasets paradigm (RDD) on Apache Spark, making it horizontally scalable for enormous numbers of data instances. We validate the performance of our approach on different data sets comprised of real-world and synthetic data.
The research leading to these results has received funding from the European Union under the FP7 grant agreement n. 619633 (project ONTIC) and H2020 grant agreement n. 671625 (project CogNet).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
If the cluster has enough memory, we recommend caching this RDD, since it is used in two different operations.
- 2.
- 3.
References
Boutsidis, C., Mahoney, M.W., Drineas, P.: An improved approximation algorithm for the column subset selection problem. In: Proceedings of the Twentieth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 968–977. Society for Industrial and Applied Mathematics (2009)
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Deshpande, A., Rademacher, L., Vempala, S., Wang, G.: Matrix approximation and projective clustering via volume sampling. In: Proceedings of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1117–1126. Society for Industrial and Applied Mathematics (2006)
Farahat, A.K., Elgohary, A., Ghodsi, A., Kamel, M.S.: Distributed column subset selection on mapreduce. In: IEEE 13th International Conference on Data Mining (ICDM), pp. 171–180. IEEE (2013)
Frieze, A., Kannan, R., Vempala, S.: Fast monte-carlo algorithms for finding low-rank approximations. J. ACM (JACM) 51(6), 1025–1041 (2004)
He, Q., Cheng, X., Zhuang, F., Shi, Z.: Parallel feature selection using positive approximation based on mapreduce. In: 11th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), pp. 397–402. IEEE (2014)
Jolliffe, I.T.: Discarding variables in a principal component analysis. I: artificial data. Appl. Stat. 21, 160–173 (1972)
Ordozgoiti, B., Canaval, S.G., Mozo, A.: Parallelized unsupervised feature selection for large-scale network traffic analysis. In: Proceedings of the ESANN (2016)
Ordozgoiti, B., Canaval, S.G., Mozo, A.: A fast iterative algorithm for improved unsupervised feature selection. In: IEEE 16th International Conference on Data Mining (ICDM), pp. 390–399. IEEE (2016)
Papailiopoulos, D., Kyrillidis, A., Boutsidis, C.: Provable deterministic leverage score sampling. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 997–1006. ACM (2014)
Singh, S., Kubica, J., Larsen, S., Sorokina, D.: Parallel large scale feature selection for logistic regression. In: SDM, pp. 1172–1183. SIAM (2009)
Sun, Z., Li, Z.: Data intensive parallel feature selection method study. In: International Joint Conference on Neural Networks (IJCNN), pp. 2256–2262. IEEE (2014)
Tipping, M.E., Bishop, C.M.: Probabilistic principal component analysis. J. Roy. Stat. Soc. Ser. B 61(3), 611–622 (1999)
Wold, S., Esbensen, K., Geladi, P.: Principal component analysis. Chemometr. Intell. Lab. Syst. 2(1), 37–52 (1987)
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, p. 2. USENIX Assoc. (2012)
Zhao, Z., Zhang, R., Cox, J., Duling, D., Sarle, W.: Massively parallel feature selection: an approach based on variance preservation. Mach. Learn. 92(1), 195–220 (2013)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Ordozgoiti, B., Canaval, S.G., Mozo, A. (2017). Probabilistic Leverage Scores for Parallelized Unsupervised Feature Selection. In: Rojas, I., Joya, G., Catala, A. (eds) Advances in Computational Intelligence. IWANN 2017. Lecture Notes in Computer Science(), vol 10306. Springer, Cham. https://doi.org/10.1007/978-3-319-59147-6_61
Download citation
DOI: https://doi.org/10.1007/978-3-319-59147-6_61
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-59146-9
Online ISBN: 978-3-319-59147-6
eBook Packages: Computer ScienceComputer Science (R0)