Skip to main content
Log in

Big data analytics in bioinformatics: architectures, techniques, tools and issues

  • Review Article
  • Published:
Network Modeling Analysis in Health Informatics and Bioinformatics Aims and scope Submit manuscript

Abstract

Bioinformatics research is characterized by voluminous and incremental datasets and complex data analytics methods. The machine learning methods used in bioinformatics are iterative and parallel. These methods can be scaled to handle big data using the distributed and parallel computing technologies. Usually big data tools perform computation in batch mode and are not optimized for iterative processing and high data dependency among operations. In the recent years, parallel, incremental, and multi-view machine learning algorithms have been proposed. Similarly, graph-based architectures and in-memory big data tools have been developed to minimize I/O cost and optimize iterative processing. However, standard big data architectures are still lacking. Also appropriate tools are not available for many important bioinformatics problems, such as fast construction of co-expression and regulatory networks and salient module identification, detection of complexes over growing protein-protein interaction data, fast analysis of massive DNA, RNA, and protein sequence data, and fast querying on incremental and heterogeneous disease networks. This paper addresses the issues and challenges posed by several big data problems in bioinformatics, and gives an overview of the state of the art and the future research opportunities.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. www.cs.cmu.edu/~coke/.

  2. http://bhuvan.nrsc.gov.in.

  3. http://www.bina.com.

  4. http://www.ebi.ac.uk/arrayexpress.

  5. http://www.ncbi.nlm.nih.gov/geo.

  6. http://smd.princeton.edu.

  7. http://www.ddbj.nig.ac.jp.

  8. http://rdp.cme.msu.edu.

  9. http://www.mirbase.org.

  10. http://dip.doe-mbi.ucla.edu.

  11. http://string.embl.de.

  12. http://thebiogrid.org.

  13. http://www.geneontology.org.

  14. http://giraph.apache.org.

  15. http://www.mongodb.org.

  16. http://couchdb.apache.org.

  17. http://spark.apache.org/mllib.

  18. http://hadoop.apache.org.

  19. http://spark.apache.org.

  20. http://hama.apache.org.

  21. http://spark.apache.org/streaming.

  22. http://illumina.com/applications/microarrays/microarray-software/beeline.html.

  23. http://www.bioinformatics.bbsrc.ac.uk/projects/seqmonk.

References

  • Aggarwal CC, Reddy CK (eds)(2013) Data clustering: algorithms and applications. CRC Press

  • Agrawal R, Imieliński T, Swami A (1993) Mining association rules between sets of items in large databases. In: ACM SIGMOD Record, vol 22. ACM, pp 207–216

  • Agrawal R, Shafer JC (1996) Parallel mining of association rules. IEEE Trans Knowl Data Eng 8(6):962–969

    Article  Google Scholar 

  • Agrawal R, Srikant R et al (1994) Fast algorithms for mining association rules. In: Proc. 20th int. conf. very large data bases, VLDB, vol 1215, pp 487–499

  • Ahmed H, Mahanta P, Bhattacharyya D, Kalita J (2014) Shifting-and-scaling correlation based biclustering algorithm. Comput Biol Bioinf IEEE ACM Trans 11(6):1239–1252

    Article  Google Scholar 

  • Ahmed H, Mahanta P, Bhattacharyya D, Kalita J, Ghosh A (2011) Intersected coexpressed subcube miner: an effective triclustering algorithm. In: Information and communication technologies (WICT), 2011 World Congress. IEEE, pp 846–851

  • Angiuoli SV, Matalka M, Gussman A, Galens K, Vangala M, Riley DR, Arze C, White JR, White O, Fricke WF (2011) CloVR: a virtual machine for automated and portable sequence analysis from the desktop using cloud computing. BMC Bioinf 12(1):356

    Article  Google Scholar 

  • Arefin AS, Berretta R, Moscato P (2013) A GPU-based method for computing eigenvector centrality of gene-expression networks. In: Proceedings of the eleventh Australasian symposium on parallel and distributed computing, vol 140. Australian Computer Society, Inc., pp 3–11

  • Aumann Y, Feldman R, Lipshtat O, Manilla H (1999) Borders: an efficient algorithm for association generation in dynamic databases. J Intell Inf Syst 12(1):61–73

    Article  Google Scholar 

  • Bader GD, Hogue CW (2003) An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinf 4(1):2

    Article  Google Scholar 

  • Bagyamathi M, Inbarani HH (2015) A novel hybridized rough set and improved harmony search based feature selection for protein sequence classification. In: Hassanien AE, Azar AT, Snasael V, Kacprzyk J, Abawajy JH (eds) Big data in complex systems, vol 9. Springer, pp 173–204

  • Baraldi A, Bruzzone L, Blonda P (2006) A multiscale expectation-maximization semisupervised classifier suitable for badly posed image classification. Image Process IEEE Trans 15(8):2208–2225

    Article  Google Scholar 

  • Barbu A, She Y, Ding L, Gramajo G (2013) Feature selection with annealing for big data learning. arXiv:1310.2880 (preprint)

  • Barker MS, Dlugosch KM, Dinh L, Challa RS, Kane NC, King MG, Rieseberg LH (2010) EvoPipes. net: bioinformatic tools for ecological and evolutionary genomics. Evol Bioinf Online 6:143

    Article  Google Scholar 

  • Ben-Dor A, Chor B, Karp R, Yakhini Z (2003) Discovering local structure in gene expression data: the order-preserving submatrix problem. J Comput Biol 10:373–384

    Article  Google Scholar 

  • Berényi Z, Vajk I (2009) Probabilistic model for a distributed feature selection method. In: Soft computing applications, 2009. SOFA’09. 3rd International Workshop. IEEE, pp 27–32

  • Bergmann S, Ihmels J, Barkai N (2003) Iterative signature algorithm for the analysis of large-scale gene expression data. Phys Rev E 67:031,902–031,919

  • Berkhin P (2006) A survey of clustering data mining techniques. In: Kogan J, Nicholas C, Teboulle M (eds) Grouping multidimensional data. Springer, Berlin, Heidelberg, pp 25–71

  • Bhatia S, Prakash P, Pillai G (2008) Svm based decision support system for heart disease classification with integer-coded genetic algorithm to select critical features. In: Proceedings of the world congress on engineering and computer science, WCECS, pp 22–24

  • Bhattacharyya DK, Kalita JK (2013) Network anomaly detection: a machine learning perspective

  • Bishop CM et al (2006) Pattern recognition and machine learning, vol 4. Springer, New York

    MATH  Google Scholar 

  • Blum A (2015) Semi-supervised learning (2015)

  • Bolón-Canedo V, Sánchez-Maroño N, Alonso-Betanzos A (2015) Distributed feature selection: an application to microarray data classification. Appl Soft Comput 30:136–150

    Article  Google Scholar 

  • Bolouri H (2014) Modeling genomic regulatory networks with big data. Trends Genet 30(5):182–191

    Article  Google Scholar 

  • Borthakur D (2007) The hadoop distributed file system: architecture and design. Hadoop Project Website 11(2007):21

    Google Scholar 

  • Bradley PS, Fayyad UM, Reina C et al (1998) Scaling clustering algorithms to large databases. In: KDD, pp 9–15

  • Brin S, Motwani R, Ullman JD, Tsur S (1997) Dynamic itemset counting and implication rules for market basket data. In: ACM SIGMOD Record, vol 26. ACM, pp 255–264

  • Cai D, He X, Han J (2008) Srda: an efficient algorithm for large-scale discriminant analysis. Knowl Data Eng IEEE Trans 20(1):1–12

    Article  Google Scholar 

  • Calaway R, Edlefsen L, Gong L, Fast S (2016) Big data decision trees with r. Revolution

  • Cerami EG, Gross BE, Demir E, Rodchenkov I, Babur Ö, Anwar N, Schultz N, Bader GD, Sander C (2011) Pathway commons, a web resource for biological pathway data. Nucleic Acids Res 39(suppl 1):D685–D690

    Article  Google Scholar 

  • Chakraborty S, Nagwani N (2011) Analysis and study of incremental k-means clustering algorithm. In: High performance architecture and grid computing. Springer, Berlin, Heidelberg, pp 338–341

  • Chaudhuri K, Kakade SM, Livescu K, Sridharan K (2009) Multi-view clustering via canonical correlation analysis. In: Proceedings of the 26th annual international conference on machine learning. ACM, pp 129–136

  • Chen N, Chen AZ, Zhou LX (2002) An incremental grid density-based clustering algorithm. J Softw 13(1):1–7

    Google Scholar 

  • Cheng Y, Church GM (2000) Biclustering of expression data. Ismb 8:93–103

    Google Scholar 

  • Cheung DW, Han J, Ng VT, Fu AW, Fu Y (1996) A fast distributed algorithm for mining association rules. In: Parallel and distributed information systems, 1996. Fourth International Conference. IEEE, pp 31–42

  • Cheung DW, Xiao Y (1998) Effect of data skewness in parallel mining of association rules. In: Research and development in knowledge discovery and data mining. Springer, Berlin, Heidelberg, pp 48–60

  • Chien BC, Lin ZL, Hong TP (2001) An efficient clustering algorithm for mining fuzzy quantitative association rules. In: IFSA World Congress and 20th NAFIPS International Conference, 2001. Joint 9th, vol 3. IEEE, pp 1306–1311

  • Choudhury A, Nair PB, Keane AJ et al (2002) A data parallel approach for large-scale gaussian process modeling. In: SDM. SIAM, pp 95–111

  • Cisco (2015) Cisco visual networking index: global mobile data traffic forecast update, 2014–2019. Cisco Public Information

  • Croft D, OKelly G, Wu G, Haw R, Gillespie M, Matthews L, Caudy M, Garapati P, Gopinath G, Jassal B et al (2010) Reactome: a database of reactions, pathways and biological processes. Nucleic Acids Res, p gkq1018

  • Davidich M, Bornholdt S (2008) Boolean network model predicts cell cycle sequence of fission yeast. PLoS One 3(2):e1672

    Article  Google Scholar 

  • Day A, Carlson MR, Dong J, O’Connor BD, Nelson SF (2007) Celsius: a community resource for Affymetrix microarray data. Genome Biol 8(6):R112

    Article  Google Scholar 

  • Day A, Dong J, Funari VA, Harry B, Strom SP, Cohn DH, Nelson SF (2009) Disease gene characterization through large-scale co-expression analysis. PLoS One 4(12):e8491

    Article  Google Scholar 

  • Dean J, Ghemawat S (2005) Mapreduce: simplified data processing on large clusters. In: OSDI\(\backslash \)’04, pp 137–150

  • Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun ACM 51(1):107–113

    Article  Google Scholar 

  • Divina F, Pontes B, Giráldez R, Aguilar-Ruiz JS (2011) An effective measure for assessing the quality of biclusters. Comput Biol Med 42(2):245–256

    Article  Google Scholar 

  • Jiang D, Pei J, Ramanathan M, Tang C, Zhang A (2004) Mining coherent gene clusters from gene-sample-time microarray data. In: In Proc of the 10 th ACM SIGKDD Conference (KDD’04)

  • Djuric N (2014) Big data algorithms for visualization and supervised learning. Ph.D. thesis, Temple University

  • Duda RO, Hart PE, Stork DG (2012) Pattern classification. Wiley, New York

  • Ecker C, Rocha-Rego V, Johnston P, Mourao-Miranda J, Marquand A, Daly EM, Brammer MJ, Murphy C, Murphy DG, Consortium MA et al (2010) Investigating the predictive value of whole-brain structural mr scans in autism: a pattern classification approach. Neuroimage 49(1):44–56

    Article  Google Scholar 

  • Ekanayake J, Li H, Zhang B, Gunarathne T, Bae SH, Qiu J, Fox G (2010) Twister: a runtime for iterative mapreduce. In: Proceedings of the 19th ACM international symposium on high performance distributed computing. ACM, pp 810–818

  • EMBL-European Bioinformatics Institute (2014) EMBL-EBI annual scientific report 2013

  • Ester M, Kriegel HP, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Kdd, vol 96, pp 226–231

  • Faith J, Hayete B, Thaden J, Mogno I, Wierzbowski J, Cottarel G, Kasif S, Collins J, Gardner T (2007) Large-scale mapping and validation of escherichia coli transcriptional regulation from a compendium of expression profiles. PLoS Biol 5(1):e8

    Article  Google Scholar 

  • Floridi L (2012) Big data and their epistemological challenge. Philos Technol 25(4):435–437

    Article  Google Scholar 

  • Fogel DB (2006) Evolutionary computation: toward a new philosophy of machine intelligence, vol 1. Wiley, New York

  • Friedman N, Linial M, Nachman I, Pe’er D (2000) Using bayesian networks to analyze expression data. J Comput Biol 7(3–4):601–620

    Article  Google Scholar 

  • Garg A, Mangla A, Gupta N, Bhatnagar V (2006) Pbirch: a scalable parallel clustering algorithm for incremental data. In: Database engineering and applications symposium, 2006. IDEAS’06. 10th International. IEEE, pp 315–316

  • Gershenfeld N, Krikorian R, Cohen D (2004) The internet of things. Sci Am 291(4):76

    Article  Google Scholar 

  • Giveki D, Salimi H, Bahmanyar G, Khademian Y (2012) Automatic detection of diabetes diagnosis using feature weighted support vector machines based on mutual information and modified cuckoo search. arXiv:1201.2173 (preprint)

  • Goecks J, Nekrutenko A, Taylor J et al (2010) Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol 11(8):R86

    Article  Google Scholar 

  • Gropp W, Lusk E, Doss N, Skjellum A (1996) A high-performance, portable implementation of the mpi message passing interface standard. Parallel Comput 22(6):789–828

    Article  MATH  Google Scholar 

  • Grosu P, Townsend JP, Hartl DL, Cavalieri D (2002) Pathway Processor: a tool for integrating whole-genome expression results into metabolic networks. Genome Res 12(7):1121–1126

    Article  Google Scholar 

  • Guha S, Rastogi R, Shim K (1998) Cure: an efficient clustering algorithm for large databases. In: ACM SIGMOD record, vol 27. ACM, pp 73–84

  • Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182

    MATH  Google Scholar 

  • Hall LO, Chawla N, Bowyer KW (1998) Decision tree learning on very large data sets. In: Systems, man, and cybernetics, 1998. 1998 IEEE international conference, vol 3. IEEE, pp 2579–2584

  • Hall MA, Smith LA (1999) Feature selection for machine learning: comparing a correlation-based filter approach to the wrapper. In: FLAIRS conference, vol 1999, pp 235–239

  • Haller S, Badoud S, Nguyen D, Garibotto V, Lovblad K, Burkhard P (2012) Individual detection of patients with parkinson disease using support vector machine analysis of diffusion tensor imaging data: initial results. Am J Neuroradiol 33(11):2123–2128

    Article  Google Scholar 

  • Han J, Pei J (2000) Mining frequent patterns by pattern-growth: methodology and implications. ACM SIGKDD Explor Newsl 2(2):14–20

    Article  Google Scholar 

  • Hinneburg A, Keim DA (1998) An efficient approach to clustering in large multimedia databases with noise. In: KDD, vol 98, pp 58–65

  • Hinton G, Deng L, Yu D, Dahl GE, Mohamed AR, Jaitly N, Senior A, Vanhoucke V, Nguyen P, Sainath TN et al (2012) Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. Signal Process Mag IEEE 29(6):82–97

    Article  Google Scholar 

  • Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313(5786):504–507

    Article  MathSciNet  MATH  Google Scholar 

  • Hoi SC, Wang J, Zhao P, Jin R (2012) Online feature selection for mining big data. In: Proceedings of the 1st international workshop on big data, streams and heterogeneous source mining: algorithms, systems, programming models and applications. ACM, pp 93–100

  • Höppner F (1999) Fuzzy cluster analysis: methods for classification, data analysis and image recognition

  • Hoque N, Bhattacharyya D, Kalita J (2014) Mifs-nd: a mutual information-based feature selection method. Expert Syst Appl 41(14):6371–6385

    Article  Google Scholar 

  • Houtsma M, Swami A (1995) Set-oriented mining for association rules in relational databases. In: Data engineering, 1995. Proceedings of the Eleventh International Conference. IEEE, pp 25–33

  • Hsieh CJ, Si S, Dhillon IS (2013) A divide-and-conquer solver for kernel support vector machines. arXiv:1311.0914 (preprint)

  • Huang Z (1998) Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining Knowl Discov 2(3):283–304

    Article  Google Scholar 

  • Hubert LJ (1974) Some applications of graph theory to clustering. Psychometrika 39(3):283–309

    Article  MathSciNet  MATH  Google Scholar 

  • Hughes GP (1968) On the mean accuracy of statistical pattern recognizers. Inf Theory IEEE Trans 14(1):55–63

    Article  Google Scholar 

  • Jain A, Zongker D (1997) Feature selection: Evaluation, application, and small sample performance. Pattern Anal Mach Intell IEEE Trans 19(2):153–158

    Article  Google Scholar 

  • Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv (CSUR) 31(3):264–323

    Article  Google Scholar 

  • Janecek A, Gansterer WN, Demel M, Ecker G (2008) On the relationship between feature selection and classification accuracy. In: FSDM, pp 90–105

  • Jiang H, Zhou S, Guan J, Zheng Y (2006) gtricluster: a more general and effective 3d clustering algorithm for gene-sample-time microarray data. In: BioDM’06, pp 48–59

  • Judd D, McKinley PK, Jain, AK (1996) Large-scale parallel data clustering. In: Pattern recognition, 1996. Proceedings of the 13th International Conference, vol 4. IEEE, pp 488–493

  • Kailing K, Kriegel HP, Pryakhin A, Schubert M (2004) Clustering multi-represented objects with noise. In: Advances in knowledge discovery and data mining. Springer, Berlin, Heidelberg, pp 394–403

  • Kanehisa M, Goto S (2000) KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res 28(1):27–30

    Article  Google Scholar 

  • Karypis G, Han EH, Kumar V (1999) Chameleon: hierarchical clustering using dynamic modeling. Computer 32(8):68–75

    Article  Google Scholar 

  • Kaufman L, Rousseeuw P (1987) Clustering by means of medoids. North-Holland

  • Kaufman L, Rousseeuw PJ (1990) Finding groups in data. An introduction to cluster analysis. Wiley Series in Probability and Mathematical Statistics. Applied Probability and Statistics, vol 1. Wiley, New York

  • Kaufman L, Rousseeuw PJ (2009) Finding groups in data: an introduction to cluster analysis. Wiley, Hoboken, NJ

  • Kelley BP, Yuan B, Lewitter F, Sharan R, Stockwell BR, Ideker T (2004) PathBLAST: a tool for alignment of protein interaction networks. Nucleic Acids Res 32(suppl 2):W83–W88

    Article  Google Scholar 

  • Kira K, Rendell LA (1992) A practical approach to feature selection. In: Proceedings of the ninth international workshop on Machine learning, pp 249–256

  • Kluger Y, Basri R, Chang J, Gerstein M (2003) Spectral biclustering of microarray data: coclustering genes and conditions. Genome Res 13(4):703–716

    Article  Google Scholar 

  • Kohonen T (1990) The self-organizing map. Proc IEEE 78(9):1464–1480

    Article  Google Scholar 

  • Kraska T, Talwalkar A, Duchi JC, Griffith R, Franklin MJ, Jordan MI (2013) Mlbase: a distributed machine-learning system. In: CIDR

  • Kriegel HP, Kröger P, Sander J, Zimek A (2011) Density-based clustering. Wiley Interdiscip Rev Data Mining Knowl Discov 1(3):231–240

    Article  Google Scholar 

  • Kumar A, Daumé H (2011) A co-training approach for multi-view spectral clustering. In: Proceedings of the 28th international conference on machine learning (ICML-11), pp 393–400

  • Kumar S, Nei M, Dudley J, Tamura K (2008) MEGA: a biologist-centric software for evolutionary analysis of DNA and protein sequences. Brief Bioinf 9(4):299–306

    Article  Google Scholar 

  • Kurtz S (2003) The vmatch large scale sequence analysis software. Ref Type: Computer Program, pp 4–12

  • Langfelder P, Horvath S (2008) WGCNA: an R package for weighted correlation network analysis. BMC Bioinf 9(1):559

    Article  Google Scholar 

  • Langmead B, Schatz MC, Lin J, Pop M, Salzberg SL (2009) Searching for SNPs with cloud computing. Genome Biol 10(11):R134

    Article  Google Scholar 

  • Langmead B, Trapnell C, Pop M, Salzberg SL et al (2009) Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10(3):R25

    Article  Google Scholar 

  • Lee H, Hsu A, Sajdak J, Qin J, Pavlidis P (2004) Coexpression analysis of human genes across many microarray data sets. Genome Res 14(6):1085–1094

    Article  Google Scholar 

  • Li R, Li Y, Fang X, Yang H, Wang J, Kristiansen K, Wang J (2009) SNP detection for massively parallel whole-genome resequencing. Genome Res 19(6):1124–1132

    Article  Google Scholar 

  • Li X, Fang Z (1989) Parallel clustering algorithms. Parallel Comput 11(3):275–290

    Article  MathSciNet  MATH  Google Scholar 

  • Liang, M., Zhang, F., Jin, G., Zhu, J (2014) FastGCN: a GPU accelerated tool for fast gene co-expression networks. PLoS One 10(1):e0116,776–e0116,776

  • Lin D, Foster DP, Ungar LH (2011) Vif regression: a fast regression algorithm for large data. J Am Stat Assoc 106(493):232–247

    Article  MathSciNet  MATH  Google Scholar 

  • Liu F, Guo W, Fouche JP, Wang Y, Wang W, Ding J, Zeng L, Qiu C, Gong Q, Zhang W et al (2015) Multivariate classification of social anxiety disorder using whole brain functional connectivity. Brain Struct Funct 220(1):101–115

    Article  Google Scholar 

  • Liu F, Guo W, Yu D, Gao Q, Gao K, Xue Z, Du H, Zhang J, Tan C, Liu Z et al (2012) Classification of different therapeutic responses of major depressive disorder with multivariate pattern analysis method based on structural MR scans. PLoS One 7(7):e40968

    Article  Google Scholar 

  • Liu F, Suk HI, Wee CY, Chen H, Shen D (2013) High-order graph matching based feature selection for Alzheimers disease identification. In: Medical image computing and computer-assisted intervention–MICCAI 2013. Springer, Berlin, Heidelberg, pp 311–318

  • Liu F, Wee CY, Chen H, Shen D (2014) Inter-modality relationship constrained multi-modality multi-task feature selection for alzheimer’s disease and mild cognitive impairment identification. NeuroImage 84:466–475

    Article  Google Scholar 

  • Liu F, Xie B, Wang Y, Guo W, Fouche JP, Long Z, Wang W, Chen H, Li M, Duan X et al (2014) Characterization of post-traumatic stress disorder using resting-state fmri with a multi-level parametric classification approach. Brain Topogr 28(2):221–237

    Article  Google Scholar 

  • López M, Still G (2007) Semi-infinite programming. Eur J Oper Res 180(2):491–518

    Article  MathSciNet  MATH  Google Scholar 

  • Low Y, Bickson D, Gonzalez J, Guestrin C, Kyrola A, Hellerstein JM (2012) Distributed graphlab: a framework for machine learning and data mining in the cloud. Proc VLDB Endow 5(8):716–727

    Article  Google Scholar 

  • Low Y, Gonzalez JE, Kyrola A, Bickson D, Guestrin CE, Hellerstein J (2014) Graphlab: a new framework for parallel machine learning. arXiv:1408.2041 (preprint)

  • Luo W, Brouwer C (2013) Pathview: an R/Bioconductor package for pathway-based data integration and visualization. Bioinformatics 29(14):1830–1831

    Article  Google Scholar 

  • Madhamshettiwar PB, Maetschke SR, Davis MJ, Reverter A, Ragan MA (2012) Gene regulatory network inference: evaluation and application to ovarian cancer allows the prioritization of drug targets. Genome Med 4(5):1–16

    Article  Google Scholar 

  • Malewicz G, Austern MH, Bik AJ, Dehnert JC, Horn I, Leiser N, Czajkowski G (2010) Pregel: a system for large-scale graph processing. In: Proceedings of the 2010 ACM SIGMOD international conference on management of data. ACM, pp 135–146

  • Mannila H, Toivonen H (1997) Levelwise search and borders of theories in knowledge discovery. Data Mining Knowl Discov 1(3):241–258

    Article  Google Scholar 

  • Margolin A, Nemenman I, Basso K, Wiggins C, Stolovitzky G, Favera R, Califano A (2006) Aracne: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context. BMC Bioinf 7(Suppl 1):S7

    Article  Google Scholar 

  • Marx V (2013) Biology: the big challenges of big data. Nature 498(7453):255–260

    Article  Google Scholar 

  • Matsunaga A, Tsugawa M, Fortes J (2008) Cloudblast: combining mapreduce and virtualization on distributed resources for bioinformatics applications. In: eScience, 2008. eScience’08. IEEE fourth international conference. IEEE, pp 222–229

  • McArt DG, Bankhead P, Dunne PD, Salto-Tellez M, Hamilton P, Zhang SD (2013) cudaMap: a GPU accelerated program for gene expression connectivity mapping. BMC Bioinf 14(1):305

    Article  Google Scholar 

  • Meyer P, Kontos K, Lafitte F, Bontempi G (2007) Information-theoretic inference of large transcriptional regulatory networks. EURASIP J Bioinf Syst Biol 2007(1):1–9

    Article  Google Scholar 

  • Mitchell TM (1997) Machine learning, vol 45. McGraw Hill, Burr Ridge

    MATH  Google Scholar 

  • Moens S, Aksehirli E, Goethals B (2013) Frequent itemset mining for big data. In: Big data, 2013 IEEE international conference. IEEE, pp 111–118

  • Mohri M, Rostamizadeh A, Talwalkar A (2012) Foundations of machine learning. MIT Press

  • Mosquera J, Sánchez-Pla A (2008) Serbgo: searching for the best go tool. Nucleic Acids Res 36(suppl 2):W368–W371

    Article  Google Scholar 

  • Najafabadi MM, Villanustre F, Khoshgoftaar TM, Seliya N, Wald R, Muharemagic E (2015) Deep learning applications and challenges in big data analytics. J Big Data 2(1):1–21

    Article  Google Scholar 

  • Nei F, Huang Y, Wang X, Huang H (2014) New primal svm solver with linear computational cost for big data classifications. In: Proceedings of the 31st international conference on machine learning, JMLR, pp 1–9

  • Nekrutenko A, Taylor J (2012) Next-generation sequencing data interpretation: enhancing reproducibility and accessibility. Nat Rev Genet 13(9):667–672

    Article  Google Scholar 

  • Nepusz T, Yu H, Paccanaro A (2012) Detecting overlapping protein complexes in protein-protein interaction networks. Nat Methods 9(5):471–472

    Article  Google Scholar 

  • Ng RT, Han J (2002) Clarans: a method for clustering objects for spatial data mining. Knowl Data Eng IEEE Trans 14(5):1003–1016

    Article  Google Scholar 

  • Ngiam J, Khosla A, Kim M, Nam J, Lee H, Ng AY (2011) Multimodal deep learning. In: Proceedings of the 28th international conference on machine learning (ICML-11), pp 689–696

  • Nordberg H, Bhatia K, Wang K, Wang Z (2013) BioPig: a Hadoop-based analytic toolkit for large-scale sequence data. Bioinformatics 29(23):3014–3019

    Article  Google Scholar 

  • O’Leary DE (2013) Artificial intelligence and big data. IEEE Intell Syst 28(2):0096–99

    Article  MathSciNet  Google Scholar 

  • Ordonez C, Omiecinski E (2004) Efficient disk-based k-means clustering for relational databases. Knowl Data Eng IEEE Trans 16(8):909–921

    Article  Google Scholar 

  • Ovsiannikov M, Rus S, Reeves D, Sutter P, Rao S, Kelly J (2013) The quantcast file system. Proc VLDB Endow 6(11):1092–1101

    Article  Google Scholar 

  • Owen S, Anil R, Dunning T, Friedman E (2011) Mahout in action. Manning, Shelter Island, NY

  • Page M, Molina M, Gordon J (2013) The mobile economy 2013. ATKearney [Online]. http://www.atkearney.com/documents/10192/760890/The_Mobile_Economy_2013. pdf. Accessed 09 Feb 2015

  • Pareto V (1964) Cours d’économie politique. Droz, Genève

  • Park BH, Kargupta H (2002) Distributed data mining: algorithms, systems, and applications. In: Data mining handbook, pp 341–358

  • Park JS, Chen MS, Yu PS (1995) An effective hash-based algorithm for mining association rules

  • Park JS, Chen MS, Yu PS (1995) Efficient parallel data mining for association rules. In: Proceedings of the fourth international conference on Information and knowledge management. ACM, pp 31–36

  • Park YS, Schmidt M, Martin ER, Pericak-Vance MA, Chung RH (2013) Pathway-PDT: a flexible pathway analysis tool for nuclear families. BMC Bioinf 14(1):267

    Article  Google Scholar 

  • Phan JH, Young AN, Wang MD (2013) omniBiomarker: a web-based application for knowledge-driven biomarker identification. Biomed Eng IEEE Trans 60(12):3364–3367

    Article  Google Scholar 

  • Pontes B, Giráldez R, Aguilar-Ruiz J (2010) Measuring the quality of shifting and scaling patterns in biclusters. Pattern Recognit Bioinf 6282:242–252

    Article  Google Scholar 

  • Prelić A, Bleuler S, Zimmermann P, Wille A, Bühlmann P, Gruissem W, Hennig L, Thiele L, Zitzler E (2006) A systematic comparison and evaluation of biclustering methods for gene expression data. Bioinformatics 22(9):1122–1129

    Article  Google Scholar 

  • Raftery AE, Gneiting T, Balabdaoui F, Polakowski M (2005) Using bayesian model averaging to calibrate forecast ensembles. Mon Weather Rev 133(5):1155–1174

    Article  Google Scholar 

  • Rana O, Walker D, Li M, Lynden S, Ward M (2000) Paddmas: parallel and distributed data mining application suite. In: Parallel and distributed processing symposium, 2000. IPDPS 2000. Proceedings. 14th International. IEEE, pp 387–392

  • Reed M, Huang J, Brand R, Graetz I, Neugebauer R, Fireman B, Jaffe M, Ballard DW, Hsu J (2013) Implementation of an outpatient electronic health record and emergency department visits, hospitalizations, and office visits among patients with diabetes. JAMA 310(10):1060–1065

    Article  Google Scholar 

  • Rivera CG, Vakil R, Bader JS (2010) NeMo: network module identification in Cytoscape. BMC Bioinf 11(Suppl 1):S61

    Article  Google Scholar 

  • Robison RJ (2014) How big is the human genome? Precis Med

  • Rojahn SY (2012) Breaking the genome bottleneck. MIT Technol Rev

  • Roy S, Bhattacharyya DK (2008) Opam: an efficient one pass association mining technique without candidate generation. J Convergence Inf Technol 3(3):32–38

    Google Scholar 

  • Roy S, Bhattacharyya DK, Kalita JK (2014) Reconstruction of gene co-expression network from microarray data using local expression patterns. BMC Bioinf 15(Suppl 7):S10

    Article  Google Scholar 

  • Roy S, Bhattacharyya DK, Kalita JK (2015) Analysis of gene expression patterns using biclustering. Methods Mol Biol 1375:91–103. doi:10.1007/7651_2015_280

    Article  Google Scholar 

  • Savasere A, Omiecinski ER, Navathe SB (1995) An efficient algorithm for mining association rules in large databases

  • Schumacher A, Pireddu L, Niemenmaa M, Kallio A, Korpelainen E, Zanetti G, Heljanko K (2014) SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop. Bioinformatics 30(1):119–120

    Article  Google Scholar 

  • Sheikholeslami G, Chatterjee S, Zhang A (2000) Wavecluster: a wavelet-based clustering approach for spatial data in very large databases. VLDB J 8(3–4):289–304

    Article  Google Scholar 

  • Shi W, Guo YF, Jin C, Xue X (2008) An improved generalized discriminant analysis for large-scale data set. In: Machine learning and applications, 2008. ICMLA’08. Seventh International Conference. IEEE, pp 769–772

  • Shvachko K, Kuang H, Radia S, Chansler R (2010) The hadoop distributed file system. In: Mass storage systems and technologies (MSST), 2010 IEEE 26th Symposium. IEEE, pp 1–10

  • Son YJ, Kim HG, Kim EH, Choi S, Lee SK (2010) Application of support vector machine for prediction of medication adherence in heart failure patients. Healthc Inf Res 16(4):253–259

    Article  Google Scholar 

  • Srikant R, Agrawal R (1996) Mining quantitative association rules in large relational tables. In: ACM SIGMOD record, vol 25. ACM, pp 1–12

  • Stokes TH, Moffitt RA, Phan JH, Wang MD (2007) chip artifact CORRECTion (caCORRECT): a bioinformatics system for quality assurance of genomics and proteomics array data. Ann Biomed Eng 35(6):1068–1080

    Article  Google Scholar 

  • Tan M, Tsang IW, Wang L (2014) Towards ultrahigh dimensional feature selection for big data. J Mach Learn Res 15(1):1371–1429

    MathSciNet  MATH  Google Scholar 

  • Tan PN, Steinbach K, Kumar V (2006) Data mining cluster analysis: basic concepts and algorithms

  • Tanay A, Sharan R, Kupiec M, Shamir R (2004) Revealing modularity and organization in the yeast molecular network by integrated analysis of highly heterogeneous genome wide data. Proc Natl Acad Sci 101(9):2981–2986

    Article  Google Scholar 

  • Thomas S, Bodagala S, Alsabti K, Ranka S (1997) An efficient algorithm for the incremental updation of association rules in large databases. In: KDD, pp 263–266

  • Thomas SA, Jin Y (2014) Reconstructing biological gene regulatory networks: where optimization meets big data. Evol Intell 7(1):29–47

    Article  Google Scholar 

  • Toivonen H et al (1996) Sampling large databases for association rules. VLDB 96:134–145

    Google Scholar 

  • Tseng GC, Ghosh D, Feingold E (2012) Comprehensive literature review and statistical considerations for microarray meta-analysis. Nucleic Acids Res 40(9):3785–3799. doi:10.1093/nar/gkr1265

    Article  Google Scholar 

  • Tsiliki G, Vlachakis D, Kossida S (2014) On integrating multi-experiment microarray data. Philos Trans R Soc Lond A Math Phys Eng Sci 372(2016):20130,136

  • Turner V, Gantz J, Reinsel D, Minton S (2014) The digital universe of opportunities: rich data and the increasing value of the internet of things. International Data Corporation, White Paper, IDC_1672

  • van Iersel MP, Kelder T, Pico AR, Hanspers K, Coort S, Conklin BR, Evelo C (2008) Presenting and exploring biological pathways with PathVisio. BMC Bioinf 9(1):399

    Article  Google Scholar 

  • Widyantoro DH, Ioerger TR, Yen J (2002) An incremental approach to building a cluster hierarchy. In: Data mining, 2002. ICDM 2003. Proceedings. 2002 IEEE International Conference. IEEE, pp 705–708

  • Wright R, Yang Z (2004) Privacy-preserving bayesian network structure computation on distributed heterogeneous data. In: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 713–718

  • Xu X, Jäger J, Kriegel HP (2002) A fast parallel clustering algorithm for large spatial databases. In: High performance data mining. Springer, US, pp 263–290

  • Yang J, Wang H, Wang W, Yu P (2003) Enhanced biclustering on expression data. In: Proceedings of Third IEEE Symposium on Bioinformatics and Bioengineering, pp 321–327

  • Yang P, Patrick E, Tan SX, Fazakerley DJ, Burchfield J, Gribben C, Prior MJ, James DE, Yang YH (2014) Direction pathway analysis of large-scale proteomics data reveals novel features of the insulin action pathway. Bioinformatics 30(6):808–814

    Article  Google Scholar 

  • Yang WH, Dai DQ, Yan H (2011) Finding correlated biclusters from gene expression data. Knowl Data Eng IEEE Trans 23(4):568–584

    Article  Google Scholar 

  • Ye J, Chow JH, Chen J, Zheng Z (2009) Stochastic gradient boosted distributed decision trees. In: Proceedings of the 18th ACM conference on Information and knowledge management. ACM, pp 2061–2064

  • Yoo C, Ramirez L, Liuzzi J (2014) Big data analysis using modern statistical and machine learning methods in medicine. Int Neurourol J 18(2):50–57

    Article  Google Scholar 

  • Yuasa T, Urakami S, Yamamoto S, Yonese J, Nakano K, Kodaira M, Takahashi S, Hatake K, Inamura K, Ishikwa Y et al (2011) Tumor size is a potential predictor of response to tyrosine kinase inhibitors in renal cell cancer. Urology 77(4):831–835

    Article  Google Scholar 

  • Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauley M, Franklin MJ, Shenker S, Stoica I (2012) Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation. USENIX Association, p 2

  • Zambon AC, Gaj S, Ho I, Hanspers K, Vranizan K, Evelo CT, Conklin BR, Pico AR, Salomonis N (2012) GO-Elite: a flexible solution for pathway and ontology over-representation. Bioinformatics 28(16):2209–2210

    Article  Google Scholar 

  • Zeng A, Li T, Liu D, Zhang J, Chen H (2015) A fuzzy rough set approach for incremental feature selection on hybrid information systems. Fuzzy Sets Syst 258:39–60

    Article  MathSciNet  MATH  Google Scholar 

  • Zeng HJ, Chen Z, Ma WY (2002) A unified framework for clustering heterogeneous web objects. In: Web information systems engineering, 2002. WISE 2002. In: Proceedings of the third international conference. IEEE, pp 161–170

  • Zhang S, Wu X, Zhang J, Zhang C (2005) A decremental algorithm for maintaining frequent itemsets in dynamic databases. In: Data warehousing and knowledge discovery. Springer, Berlin, Heidelberg, pp 305–314

  • Zhang T, Ramakrishnan R, Livny M (1996) Birch: an efficient data clustering method for very large databases. In: ACM SIGMOD record, vol 25. ACM, pp 103–114

  • Zhao L, Zaki MJ (2005) Tricluster: an effective algorithm for mining coherent clusters in 3D microarray data. ACM, pp 694–705. doi:10.1145/1066157.1066236

  • Zhao S, Prenger K, Smith L (2013) Stormbow: a cloud-based tool for reads mapping and expression quantification in large-scale RNA-Seq studies. ISRN Bioinform 2013:481545

    Article  Google Scholar 

  • Zhao S, Prenger K, Smith L, Messina T, Fan H, Jaeger E, Stephens S (2013) Rainbow: a tool for large-scale whole-genome sequencing data analysis using cloud computing. BMC Genomics 14(1):425

    Article  Google Scholar 

  • Zhao W, Ma H, He Q (2009) Parallel k-means clustering based on mapreduce. In: Cloud computing. Springer, Berlin, Heidelberg, pp 674–679

  • Zhou Z, Chawla N, Jin Y, Williams G (2014) Big data opportunities and challenges: discussions from data analytics perspectives [discussion forum]. Comput Intell Mag IEEE 9(4):62–74

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dhruba Kumar Bhattacharyya.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kashyap, H., Ahmed, H.A., Hoque, N. et al. Big data analytics in bioinformatics: architectures, techniques, tools and issues. Netw Model Anal Health Inform Bioinforma 5, 28 (2016). https://doi.org/10.1007/s13721-016-0135-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s13721-016-0135-4

Keywords

Navigation