Skip to main content
Log in

MapReduce based parallel gene selection method

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Microarray data analysis has been widely used for extracting relevant biological information from thousands of genes simultaneously expressed in a specific cell. Although many genes are expressed in a sample tissue, most of these are irrelevant or insignificant for clinical diagnosis or disease classification because of missing values and noises. Thus, finding a small, closely related gene set to accurately classify disease cells is an important research problem. At the same time, scalable gene selection methods are required for microarray data analysis due to rapidly increasing volume of microarray data. In this paper, we propose a scalable parallel gene selection method using the M a p R e u d c e programming model. The proposed method utilizes the kNN classifier algorithm for evaluating classification accuracy and uses four real and three synthetic datasets for experiments. Experimental results show that the proposed method can offer good scalability on large data with increasing number of nodes and it can also provide higher classification accuracy rather than using whole gene set for classification.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  1. Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun ACM 51(1):107113

    Article  Google Scholar 

  2. Akdogan A, Demiryurek U, Banaei-Kashani F, Shahabi C (2010) Voronoi-based geospatial query processing with mapreduce In: Cloud computing technology and science (CloudCom), IEEE 2nd international conference on, pages 9–16. IEEE

  3. Ji C, Dong T, Li Y, Shen Y, Li K, Qiu W, Qu W, Guo M (2012) Inverted grid-based knn query processing with mapreduce. ChinaGrid Annual Conference (ChinaGrid), 7th, pages 25–32. IEEE

  4. Langmead B, Schatz MC, Lin J, Pop M, Salzberg SL (2009) Searching for SNPs with cloud computing. Genome Biol 10(11):R134

    Article  Google Scholar 

  5. Schatz MC (2009) CloudBurst: highly sensitive read mapping with MapReduce. Bioinformatics 25(11):1363–1369

    Article  Google Scholar 

  6. Palomino R, Benites A, Liang LR Cloud parallel genetic algorithm for gene Microarray data analysis. Tools with artificial intelligence (ICTAI), 2011 23rd IEEE international conference on, pp 932–933

  7. Chao J, Vecchiola C, Rajkumar B (2008) MRPGA: an extension of MapReduce for parallelizing genetic algorithms eScience, eScience’08. IEEE 4th international conference on, pp 214–221

  8. Verma A, Xavier L, Goldberg DE, Campbell RH (2009) Scaling genetic algorithms using mapreduce In: Intelligent systems design and applications, 2009. ISDA’09. 9th international conference on. IEEE Press, pp 13–18

  9. Xin D, Youcong N, Zhiqiang Y, Ruliang X, Datong X (2013) High performance parallel evolutionary algorithm model based on MapReduce framework. Int J Comput Appl Technol 46(3):290–295. Inderscience

    Article  Google Scholar 

  10. Austin C, Yin-Wu T, Ching-Heng L (2010) Novel methods to identify biologically relevant genes for leukemia and prostate cancer from gene expression profiles. BMC Genomics: 11

  11. Chen AH, Lin CH (2011) A novel support vector sampling technique to improve classification accuracy and to identify key genes of leukaemia and prostate cancers. Expert Syst Appl 38(4):3209–3219

    Article  Google Scholar 

  12. Lee CP, Leu Y (2011) A novel hybrid feature selection method for microarray data analysis. Appl Soft Comput 11(1):208–213

    Article  Google Scholar 

  13. Leu Y, Lee CP, Tsai HY (2010) A gene selection method for microarray data based on sampling. Comput Collective Intell Technol Appl: 68–74

  14. Pradipta M, Chandra D (2012) Relevant and significant supervised gene clusters for Microarray cancer classification. NanoBioscience, IEEE Trans 11(2):161–168

    Article  Google Scholar 

  15. Uri A, Naama B, Notterman DA, Kurt G, Suzanne Y, Mack D, Levine AJ (1999) Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Natl Acad Sci 96(12):6745–6750

    Article  Google Scholar 

  16. Dudoit S, Fridlyand J, Speed TP (2002) Comparison of discrimination methods for the classification of tumors using gene expression data. J Am Stat Assoc 97(457):77–87

    Article  MATH  MathSciNet  Google Scholar 

  17. Golub TR, Donna SK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA et al (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Sci 286(5439):531–537

    Article  Google Scholar 

  18. Jirapech-Umpai T, Aitken S (2005) Feature selection and classification for microarray data analysis, evolutionary methods for identifying predictive genes. BMC bioinforma 6(1):148

    Article  Google Scholar 

  19. Li L, Weinberg CR, Darden TA, Pedersen LG (2001) Gene selection for sample classification based on gene expression data: study of sensitivity to choice of parameters of the ga/knn method. Bioinforma 17(12):1131–1142

    Article  Google Scholar 

  20. Shipp MA, Ross KN, Tamayo P, Weng AP, Kutok JL, Aguiar RCT, Gaasenbeek M, Angelo M, Reich M, Pinkus GS et al (2002) Feature selection and classification for microarray data analysis: evolutionary methods for identifying predictive genes. Nat Med 8(1):68–74

    Article  Google Scholar 

  21. Singh D, Febbo PG, Ross K, Jackson DG, Manola J, Ladd C, Tamayo P, Renshaw AA, D’Amico AV, Richie JP (2002) Gene expression correlates of clinical prostate cancer behavior. Cancer cell 1(2):203–209

    Article  Google Scholar 

  22. Cho J-H, Lee D, Park JH, Lee I-B (2004) Gene selection and classification from microarray data using kernel machine. FEBS letters 571(1):93–98. Elsevier

    Article  Google Scholar 

  23. Armano G, Chira C, Hatami N (2011) A new gene selection method based on random subspace ensemble for microarray cancer classification. Pattern Recognit Bioinforma 571(1):191–201. Springer

    Article  Google Scholar 

  24. Caruana G, Li M, Qi MA MapReduce based parallel SVM for large scale spam filtering Fuzzy systems and knowledge discovery (FSKD), 2011 8th international conference on vol 4, pp2659–2662, 2011, IEEE

  25. Kiran M, Kumar A, Mukherjee S, Prakash RG (2013) Verification and validation of MapReduce program model for parallel support vector machine algorithm on Hadoop cluster, vol 10, pp 317– 325

  26. National Center for Biotechnology Information - (Gene), http://www.ncbi.nlm.nih.gov/gene

  27. The Gene Ontology, http://www.geneontology.org/

  28. Yotov WV, Hamel H, Rivard G-E, Champagne MA, Russo PA, Leclerc J-M, Bernstein ML, Levy E (1999) Amplifications of DNA primase 1 (PRIM1) in human osteosarcoma. Genes, Chromosom Cancer 26(1):62–69. Wiley Online Library

    Article  Google Scholar 

  29. Hu H, Zhang H, Ge W, Liu X, Loera S, Chu P, Chen H, Peng J, Zhou L, Yu S et al (2012) Secreted protein acidic and rich in cysteines-like 1 suppresses aggressiveness and predicts better survival in colorectal cancers. Clin Cancer Res 18(19):5438–5448. AACR

    Article  Google Scholar 

  30. Hao J, Serohijos A WR, Newton G, Tassone G, Wang Z, Sgroi DC, Dokholyan NV, Basilion JP (2008) Identification and rational redesign of peptide ligands to CRIP1, a novel biomarker for cancers. PLoS Comput Biol 4(8):e1000138. Public Library of Science

    Article  Google Scholar 

  31. Lin D-T, Lechleiter JD (2002) Mitochondrial targeted cyclophilin D protects cells from cell death by peptidyl prolyl isomerization. J Biol Chem 277(34):31134–31141. ASBMB

    Article  Google Scholar 

  32. Zhuo J, Tan EH, Yan B, Tochhawng L, Jayapal M, Koh S, Tay HK, Maciver SK, Hooi SC, Salto-Tellez M et al (2012) Gelsolin induces coclorectal tumor cell invasion via modulation of the urokinase-type plasminogen activator cascade. PloS one 7(8):e43594. Public Library of Science

    Article  Google Scholar 

Download references

Acknowledgments

This work was supported by a grant from the Kyung Hee University in 2013 (KHU-20130441)

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Byeong-Soo Jeong.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Islam, A.K.M.T., Jeong, BS., Bari, A.T.M.G. et al. MapReduce based parallel gene selection method. Appl Intell 42, 147–156 (2015). https://doi.org/10.1007/s10489-014-0561-x

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-014-0561-x

Keywords

Navigation