Skip to main content

Probabilistic Leverage Scores for Parallelized Unsupervised Feature Selection

  • Conference paper
  • First Online:
Advances in Computational Intelligence (IWANN 2017)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10306))

Included in the following conference series:

Abstract

Dimensionality reduction is often crucial for the application of machine learning and data mining. Feature selection methods can be employed for this purpose, with the advantage of preserving interpretability. There exist unsupervised feature selection methods based on matrix factorization algorithms, which can help choose the most informative features in terms of approximation error. Randomized methods have been proposed recently to provide better theoretical guarantees and better approximation errors than their deterministic counterparts, but their computational costs can be significant when dealing with big, high dimensional data sets. Some existing randomized and deterministic approaches require the computation of the singular value decomposition in \(O(mn\min (m,n))\) time (for m samples and n features) for providing leverage scores. This compromises their applicability to domains of even moderately high dimensionality. In this paper we propose the use of Probabilistic PCA to compute the leverage scores in O(mnk) time, enabling the applicability of some of these randomized methods to large, high-dimensional data sets. We show that using this approach, we can rapidly provide an approximation of the leverage scores that is works well in this context. In addition, we offer a parallelized version over the emerging Resilient Distributed Datasets paradigm (RDD) on Apache Spark, making it horizontally scalable for enormous numbers of data instances. We validate the performance of our approach on different data sets comprised of real-world and synthetic data.

The research leading to these results has received funding from the European Union under the FP7 grant agreement n. 619633 (project ONTIC) and H2020 grant agreement n. 671625 (project CogNet).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    If the cluster has enough memory, we recommend caching this RDD, since it is used in two different operations.

  2. 2.

    http://ict-ontic.eu/index.php/onts-data/onts-request-access.

  3. 3.

    http://ict-ontic.eu/.

References

  1. Boutsidis, C., Mahoney, M.W., Drineas, P.: An improved approximation algorithm for the column subset selection problem. In: Proceedings of the Twentieth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 968–977. Society for Industrial and Applied Mathematics (2009)

    Google Scholar 

  2. Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)

    Article  Google Scholar 

  3. Deshpande, A., Rademacher, L., Vempala, S., Wang, G.: Matrix approximation and projective clustering via volume sampling. In: Proceedings of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1117–1126. Society for Industrial and Applied Mathematics (2006)

    Google Scholar 

  4. Farahat, A.K., Elgohary, A., Ghodsi, A., Kamel, M.S.: Distributed column subset selection on mapreduce. In: IEEE 13th International Conference on Data Mining (ICDM), pp. 171–180. IEEE (2013)

    Google Scholar 

  5. Frieze, A., Kannan, R., Vempala, S.: Fast monte-carlo algorithms for finding low-rank approximations. J. ACM (JACM) 51(6), 1025–1041 (2004)

    Article  MathSciNet  MATH  Google Scholar 

  6. He, Q., Cheng, X., Zhuang, F., Shi, Z.: Parallel feature selection using positive approximation based on mapreduce. In: 11th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), pp. 397–402. IEEE (2014)

    Google Scholar 

  7. Jolliffe, I.T.: Discarding variables in a principal component analysis. I: artificial data. Appl. Stat. 21, 160–173 (1972)

    Article  MathSciNet  Google Scholar 

  8. Ordozgoiti, B., Canaval, S.G., Mozo, A.: Parallelized unsupervised feature selection for large-scale network traffic analysis. In: Proceedings of the ESANN (2016)

    Google Scholar 

  9. Ordozgoiti, B., Canaval, S.G., Mozo, A.: A fast iterative algorithm for improved unsupervised feature selection. In: IEEE 16th International Conference on Data Mining (ICDM), pp. 390–399. IEEE (2016)

    Google Scholar 

  10. Papailiopoulos, D., Kyrillidis, A., Boutsidis, C.: Provable deterministic leverage score sampling. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 997–1006. ACM (2014)

    Google Scholar 

  11. Singh, S., Kubica, J., Larsen, S., Sorokina, D.: Parallel large scale feature selection for logistic regression. In: SDM, pp. 1172–1183. SIAM (2009)

    Google Scholar 

  12. Sun, Z., Li, Z.: Data intensive parallel feature selection method study. In: International Joint Conference on Neural Networks (IJCNN), pp. 2256–2262. IEEE (2014)

    Google Scholar 

  13. Tipping, M.E., Bishop, C.M.: Probabilistic principal component analysis. J. Roy. Stat. Soc. Ser. B 61(3), 611–622 (1999)

    Article  MathSciNet  MATH  Google Scholar 

  14. Wold, S., Esbensen, K., Geladi, P.: Principal component analysis. Chemometr. Intell. Lab. Syst. 2(1), 37–52 (1987)

    Article  Google Scholar 

  15. Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, p. 2. USENIX Assoc. (2012)

    Google Scholar 

  16. Zhao, Z., Zhang, R., Cox, J., Duling, D., Sarle, W.: Massively parallel feature selection: an approach based on variance preservation. Mach. Learn. 92(1), 195–220 (2013)

    Article  MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bruno Ordozgoiti .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Ordozgoiti, B., Canaval, S.G., Mozo, A. (2017). Probabilistic Leverage Scores for Parallelized Unsupervised Feature Selection. In: Rojas, I., Joya, G., Catala, A. (eds) Advances in Computational Intelligence. IWANN 2017. Lecture Notes in Computer Science(), vol 10306. Springer, Cham. https://doi.org/10.1007/978-3-319-59147-6_61

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-59147-6_61

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-59146-9

  • Online ISBN: 978-3-319-59147-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics