Abstract
Selection of informative features out of ever growing results of high throughput biological experiments requires specialized feature selection algorithms. One of such methods is the Monte Carlo Feature Selection - a straightforward, yet computationally expensive one. In this technical paper we present architecture and performance of a development version of our distributed implementation of this algorithm, designed to run in multiprocessor as well as multihost computing environments, and potentially controllable through a web browser by non-IT staff. As a simple enhancement, our method is able to produce statistically interpretable output by means of permutation testing. Tested on reference Golub et al. leukemia data, as well as on our own dataset of almost 2 million features, it has shown nearly linear speedup when executed with an increased amount of processors. Being platform independent, as well as open for extensions, this application could become a valuable tool for researchers facing the challenge of ill-defined high dimensional feature selection problems.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Draminski, M., Rada-Iglesias, A., Enroth, S., Wadelius, C., Koronacki, J., Komorowski, J.: Monte carlo feature selection for supervised classification. Bioinformatics 24, 110–117 (2008)
Dramiński, M., Kierczak, M., Koronacki, J., Komorowski, J.: Monte carlo feature selection and interdependency discovery in supervised classification. In: Koronacki, J., Raś, Z.W., Wierzchoń, S.T., Kacprzyk, J. (eds.) Advances in Machine Learning II. SCI, vol. 263, pp. 371–385. Springer, Heidelberg (2010)
Golub, T., et al.: Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science 286, 531–537 (1999)
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The weka data mining software: An update. SIGKDD Explor. 11, 10–18 (2009)
International HapMap Consortium: The international hapmap project. Nature 426, 789 (2003)
Luque-Baena, R.M., Urda, D., Subirats, J.L., Franco, L., Jerez, J.M.: Application of genetic algorithms and constructive neural networks for the analysis of microarray cancer data. Theor. Biol. Med. Model. 11, 7 (2014)
Pearson, K.: On lines and planes of closest fit to systems of points in space. Philos. Mag. Series 6(2), 559–572 (1901)
Perneger, T.: What wrong with Bonferroni adjustments. BMJ 316, 1236–1238 (1998)
Quinlan, J.R.: Effective Akka. MO’Reilly Media, Inc. ISBN: 1449360076 9781449360078 (2013)
Sidak, Z.: Rectangular confidence regions for the means of multivariate normal distributions. J. Am. Stat. Assoc. 62, 626–633 (1967)
Storey, J.D.: A direct approach to false discovery rates. J. R. Stat. Soc. Series B (Stat. Methodol.) 64, 479–498 (2002)
The: An integrated map of genetic variation from 1,092 human genomes. Nature 491, 56–65 (2012)
Acknowledgements
I would like to thank dr Draminski for providing the latest version of dmLab software for evaluation, as well as Najla Al-Harbi, Sara Bin Judia, dr Salma Majid, dr Ghazi Alsbeih (Faisal Specialist Hospital & Research Centre, Riyadh 11211, Kingdom of Saudi Arabia), and furthermore Bozena Rolnik (Data Mining Group) for providing the CNV data. Calculations were carried out using the computer cluster Ziemowit (http://www.ziemowit.hpc.polsl.pl) funded by the Silesian BIO-FARMA project No. POIG.02.01.00-00-166/08 in the Computational Biology and Bioinformatics Laboratory of the Biotechnology Centre in the Silesian University of Technology. The work was financially supported by NCN grant HARMONIA UMO-2013/08/M/ST6/00924 (LK).
Finally I would like to thank Anonymous Reviewers who helped to increase quality of the paper.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Krol, L. (2016). Distributed Monte Carlo Feature Selection: Extracting Informative Features Out of Multidimensional Problems with Linear Speedup. In: Kozielski, S., Mrozek, D., Kasprowski, P., Małysiak-Mrozek, B., Kostrzewa, D. (eds) Beyond Databases, Architectures and Structures. Advanced Technologies for Data Mining and Knowledge Discovery. BDAS BDAS 2015 2016. Communications in Computer and Information Science, vol 613. Springer, Cham. https://doi.org/10.1007/978-3-319-34099-9_35
Download citation
DOI: https://doi.org/10.1007/978-3-319-34099-9_35
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-34098-2
Online ISBN: 978-3-319-34099-9
eBook Packages: Computer ScienceComputer Science (R0)