Skip to main content

Distributed Monte Carlo Feature Selection: Extracting Informative Features Out of Multidimensional Problems with Linear Speedup

  • Conference paper
  • First Online:
Beyond Databases, Architectures and Structures. Advanced Technologies for Data Mining and Knowledge Discovery (BDAS 2015, BDAS 2016)

Abstract

Selection of informative features out of ever growing results of high throughput biological experiments requires specialized feature selection algorithms. One of such methods is the Monte Carlo Feature Selection - a straightforward, yet computationally expensive one. In this technical paper we present architecture and performance of a development version of our distributed implementation of this algorithm, designed to run in multiprocessor as well as multihost computing environments, and potentially controllable through a web browser by non-IT staff. As a simple enhancement, our method is able to produce statistically interpretable output by means of permutation testing. Tested on reference Golub et al. leukemia data, as well as on our own dataset of almost 2 million features, it has shown nearly linear speedup when executed with an increased amount of processors. Being platform independent, as well as open for extensions, this application could become a valuable tool for researchers facing the challenge of ill-defined high dimensional feature selection problems.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Draminski, M., Rada-Iglesias, A., Enroth, S., Wadelius, C., Koronacki, J., Komorowski, J.: Monte carlo feature selection for supervised classification. Bioinformatics 24, 110–117 (2008)

    Article  Google Scholar 

  2. Dramiński, M., Kierczak, M., Koronacki, J., Komorowski, J.: Monte carlo feature selection and interdependency discovery in supervised classification. In: Koronacki, J., Raś, Z.W., Wierzchoń, S.T., Kacprzyk, J. (eds.) Advances in Machine Learning II. SCI, vol. 263, pp. 371–385. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  3. Golub, T., et al.: Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science 286, 531–537 (1999)

    Article  Google Scholar 

  4. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The weka data mining software: An update. SIGKDD Explor. 11, 10–18 (2009)

    Article  Google Scholar 

  5. International HapMap Consortium: The international hapmap project. Nature 426, 789 (2003)

    Google Scholar 

  6. Luque-Baena, R.M., Urda, D., Subirats, J.L., Franco, L., Jerez, J.M.: Application of genetic algorithms and constructive neural networks for the analysis of microarray cancer data. Theor. Biol. Med. Model. 11, 7 (2014)

    Article  Google Scholar 

  7. Pearson, K.: On lines and planes of closest fit to systems of points in space. Philos. Mag. Series 6(2), 559–572 (1901)

    Article  MATH  Google Scholar 

  8. Perneger, T.: What wrong with Bonferroni adjustments. BMJ 316, 1236–1238 (1998)

    Article  Google Scholar 

  9. Quinlan, J.R.: Effective Akka. MO’Reilly Media, Inc. ISBN: 1449360076 9781449360078 (2013)

    Google Scholar 

  10. Sidak, Z.: Rectangular confidence regions for the means of multivariate normal distributions. J. Am. Stat. Assoc. 62, 626–633 (1967)

    MathSciNet  MATH  Google Scholar 

  11. Storey, J.D.: A direct approach to false discovery rates. J. R. Stat. Soc. Series B (Stat. Methodol.) 64, 479–498 (2002)

    Article  MathSciNet  MATH  Google Scholar 

  12. The: An integrated map of genetic variation from 1,092 human genomes. Nature 491, 56–65 (2012)

    Google Scholar 

Download references

Acknowledgements

I would like to thank dr Draminski for providing the latest version of dmLab software for evaluation, as well as Najla Al-Harbi, Sara Bin Judia, dr Salma Majid, dr Ghazi Alsbeih (Faisal Specialist Hospital & Research Centre, Riyadh 11211, Kingdom of Saudi Arabia), and furthermore Bozena Rolnik (Data Mining Group) for providing the CNV data. Calculations were carried out using the computer cluster Ziemowit (http://www.ziemowit.hpc.polsl.pl) funded by the Silesian BIO-FARMA project No. POIG.02.01.00-00-166/08 in the Computational Biology and Bioinformatics Laboratory of the Biotechnology Centre in the Silesian University of Technology. The work was financially supported by NCN grant HARMONIA UMO-2013/08/M/ST6/00924 (LK).

Finally I would like to thank Anonymous Reviewers who helped to increase quality of the paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lukasz Krol .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Krol, L. (2016). Distributed Monte Carlo Feature Selection: Extracting Informative Features Out of Multidimensional Problems with Linear Speedup. In: Kozielski, S., Mrozek, D., Kasprowski, P., Małysiak-Mrozek, B., Kostrzewa, D. (eds) Beyond Databases, Architectures and Structures. Advanced Technologies for Data Mining and Knowledge Discovery. BDAS BDAS 2015 2016. Communications in Computer and Information Science, vol 613. Springer, Cham. https://doi.org/10.1007/978-3-319-34099-9_35

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-34099-9_35

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-34098-2

  • Online ISBN: 978-3-319-34099-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics