Abstract
Bioinformatics is one of the emerging and rapidly developing research areas that is predominantly used for genetic data analysis and processing. Bioinformatics is characterized by its huge and voluminous data that is growing in nature which in turn complicates data analysis. In most cases, Bioinformatics data analysis and processing involve big data analytics due to the complex nature of the data. Previous research works handled data analytics using traditional tools and conventional big data analytical methods. However, it can be proved that machine learning algorithms and approaches can be effectively deployed to perform parallel, distributed and incremental processing of complex big data analytics especially in the case of gene big data analytics to enhance the efficiency in processing this large chunk of Bioinformatics-based gene big data. This paper provides a Machine Learning algorithm-based Convolution Neural Network (ML-CNN) approach for the process of identifying potential target genes, predicting miRNAs, visualizing the unique miRNA patterns, and validating genomes. The proposed approach has experimented with MATLAB software using deep learning toolbox on the pre - miRNA dataset. Experimental results indicate that machine learning algorithms certainly increases the efficiency of Bioinformatics-based methods of processing gene data in terms of prediction accuracy and reduced processing time. The mean performance of ML-CNN is improved 7% high than the existing system.
Similar content being viewed by others
References
Schatz MC, Langmead B (2013) The DNA data deluge. IEEE Spectr 50(7):28–33
Marx V (2013) Biology: the big challenges of big data. Nature 498(7453):255–260
Ashley EA (2015) The precision medicine initiative: a new national effort. JAMA 313(21):2119–2120
Stephens ZD et al (2015) Big data: astronomical or genomical? PLOS Biol 13(7):e1002195
Watson JD, Crick FHC (1953) Molecular structure of nucleic aids: a structure for deoxyribose nucleic acid. Nature 171(4356):737–738
de Klerk E, 't Hoen PAC (2015) Alternative mRNA transcription, processing, and translation: insights from RNA sequencing. Trends Gen 31(3):128–139
Harrow J, Frankish A, Gonzalez JM, Tapanari E, Diekhans M, Kokocinski F, Aken BL, Barrell D, Zadissa A, Searle S, Barnes I, Bignell A, Boychenko V, Hunt T, Kay M, Mukherjee G, Rajan J, Despacio-Reyes G, Saunders G, Steward C, Harte R, Lin M, Howald C, Tanzer A, Derrien T, Chrast J, Walters N, Balasubramanian S, Pei B, Tress M, Rodriguez JM, Ezkurdia I, van Baren J, Brent M, Haussler D, Kellis M, Valencia A, Reymond A, Gerstein M, Guigó R, Hubbard TJ (2012) GENCODE: the reference human genome annotation for the ENCODE project. Genome Res 22(9):1760–1774
Rubin MA (2015) Make precision medicine work for cancer care. Nature 520(7547):290–291
Wang X, Naqa I (2008) Prediction of both conserved and non-conserved microRNA targets in animals. Bioinf Adv Access 24(3):325–332
Herrero J, Dopazo J (2002) Combining hierarchical clustering and self-organizing maps for exploratory analysis of gene expression patterns. J Proteome Res 1:467–470
Herrero J, Valencia A, Dopazo J (2001) A hierarchical unsupervised growing neural network for clustering gene expression patterns. Bioinformatics 17:126–138
Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT et al (2000) Gene ontology: tool for the unification of biology. Nat Genet 25:25–29
Saçar MD, Allmer J (2014) Machine learning methods for miRNA gene prediction. Methods Mol Biol. https://doi.org/10.1007/978-1-62703-748-8_10
Yandell M, Ence D (2012) A beginner's guide to eukaryotic genome annotation. Nat Rev Genet 13(5):329–342
Alexander RP, Fang G, Rozowsky J, Snyder M, Gerstein MB (2010) Annotating non-coding regions of the genome. Nat Rev Genet 11(8):559–571
Yip KY, Cheng C, Gerstein M (2013) Machine learning and genome annotation: a match meant to be? Genome Biol 14(5):205
Sonnenburg S, Schweikert G, Philips P, Behr J, Rätsch G (2007) Accurate splice site prediction using support vector machines. BMC Bioinf 8(Suppl. 10):S7
Saeys Y, Abeel T, Degroeve S, Van de Peer Y (2007) Translation initiation site prediction on a genomic scale: beauty in simplicity. Bioinformatics 23(1987):418–423
Alipanahi B, Delong A, Weirauch MT, Frey BJ (2015) Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat Biotechnol 33(8):831–838
Lee TI, Young R (2013) Transcriptional regulation and its Mis-regulation in disease. Cell 152(6):1237–1251
Li X, Quon G, Lipshitz HD, Morris Q (2010) Predicting in vivo binding sites of RNA-binding proteins using mRNA secondary structure. RNA 16(6):1096–1107
Maston GA, Evans SK, Green MR (2006) Transcriptional regulatory elements in the human genome. Annu Rev Genomics Hum Genet 7:29–59
Xiong HY et al (2014) The human splicing code reveals new insights into the genetic determinants of disease. Science 347(6218). https://doi.org/10.1126/science.1254806
Wang Z, Burge CB (2008) Splicing regulation: from a parts list of regulatory elements to an integrated splicing code. RNA 14(5):802–813
Barash Y, Calarco JA, Gao W, Pan Q, Wang X, Shai O, Blencowe BJ, Frey BJ (2010) Deciphering the splicing code. Nature 465(7294):53–59
Xiong H, Barash Y, Frey B (2011) Bayesian prediction of tissue-regulated splicing using RNA sequence and cellular context. Bioinformatics 27(18):2554–2562
Leung MKK, Xiong HY, Lee LJ, Frey BJ (2014) Deep learning of the tissue-regulated splicing code. Bioinformatics 30(12):i121–i129
Lorenz R et al (2011) Vienna RNA package 2.0. Algorithms Mol Biol 6(1):26
Laing C, Schlick T (2011) Computational approaches to RNA structure prediction, analysis, and design. Curr Opin Struct Biol 21(3):306–318
Wan Y, Kertesz M, Spitale RC, Segal E, Chang HY (2011) Understanding the transcriptome through RNA structure. Nat Rev Genet 12(9):641–655
Floudas CA (2007) Computational methods in protein structure prediction. Biotechnol Bioeng 97(2):207–213
Troyanskaya OG (2014) Deep supervised and convolutional generative stochastic network for protein secondary structure prediction. In: Proceedings of 31st international conference machine learning, vol. 32, pp 745–753
Di Lena P, Nagata K, Baldi P (2012) Deep architectures for protein contact map prediction. Bioinformatics 28(19):2449–2457
Elkon R, Ugalde AP, Agami R (2013) Alternative cleavage and polyadenylation: extent, regulation and function. Nat Rev Genet 14(7):496–506
Danckwardt S, Hentze MW, Kulozik AE (2008) 30 end mRNA processing: molecular mechanisms and implications for health and disease. EMBO J 27(3):482–498
Akhtar MN, Bukhari SA, Fazal Z, Qamar R, Shahmuradov IA (2010) POLYAR, a new computer program for prediction of poly(A) sites in human sequences. BMC Genomics 11(1):646
Chang T-H et al (2011) Characterization and prediction of mRNA polyadenylation sites in human genes. Med Biol Eng Comput 49(4):463–472
Rahman ME, Islam R, Islam S, Mondal SI, Amin MR (2012) Mirann: a reliable approach for improved classification of precursor Micron using artificial neural network model. Genomics 99:189–194
Xue C, Li F, He T, Liu G, Li Y, Zhang X (2005) Classification of real and pseudo Microrna precursors using local structure sequence features and support vector machine. BMC Bioinf 6:310. https://doi.org/10.1186/1471-2105-6-310
Xiao J, Tang X, Li Y, Fang Z, Ma D, He Y, Li M Identification of microrna precursors based on random forest with network-level representation method of stem-loop structure. BMC Bioinf 12:165. https://doi.org/10.1186/1471-2105-12-165
Wang L, Xi Y, Sung S, Qiao H (2018) RNA-seq assistant: machine learning based methods to identify more transcriptional regulated genes. BMC Genomics 19:546. https://doi.org/10.1186/s12864-018-4932-2
Park C, Kim J, Kim J, Park S (2018) Machine learning-based identification of genetic interactions from heterogeneous gene expression profiles. PLoS ONE 13(7). https://doi.org/10.1371/journal.pone.0201056
Martins PVL, Camacho R, Fonseca N (2018) Gene prediction using deep learning, thesis
Mande SS, Mohammed MH, Ghosh TS (2012) Classification of metagenomic sequences: methods and challenges. Brief Bioinform 13(6):669–681
Han J , Kamber M (2015) Data mining: concepts and techniques. The Morgan Kaufmann series in data management systems[J]. antimicrobial agents & chemotherapy 59(3):1435–40.
Kozomara A, Birgaoanu M, Griffiths-Jones S (2019) miRBase: from microRNA sequences to function. Nucleic Acids Res 47:D155–D162
Xue C, Li F, He T, Liu G, Li Y, Zhang X (2005) Classification of real and pseudo microrna precursors using local structure sequence features and support vector machine. BMC Bioinf 6:310. https://doi.org/10.1186/1471-2105-6-310
Thomas J, Sael L (2017) Deep neural network-based precursor microRNA prediction on eleven species. arXiv preprint arXiv:1704.03834
Xiao J, Tang X, Li Y, Fang Z, Ma D, He Y, Li M (2011) Identification of microrna precursors based on random forest with network-level representation method of stem-loop structure. BMC Bioinf 12:165. https://doi.org/10.1186/1471-2105-12-165
Kleftogiannis D, Theofilatos K, Likothanassis S, Mavroudi S (2015) Yamipred: a novel evolutionary method for predicting pre-mirnas and selecting relevant features. IEEE/ACM Trans Comput Biol Bioinform 12(5):1183–1192. https://doi.org/10.1109/TCBB.2014.2388227
Ng KLS, Mishra SK (2007) De novo SVM classification of precursor micrornas from genomic pseudo hairpins using global and intrinsic folding measures. BMC Bioinf 23(11):1321–1330. https://doi.org/10.1186/1471-2105-8-341
Batuwita R, Palade V (2009) micropred: effective classification of pre-mirnasfor human mirna gene prediction. BMC Bioinf 25(8):989–995. https://doi.org/10.1093/bioinformatics/btp107
Pasaila D, Sucial A, Mohorianu I, Pantiru ST, Ciortuz L (2011) Mirnarecognition with the yasmir system: the quest for further improvements. Adv Exp Med Biol 696:17–25. https://doi.org/10.1007/978-1-4419-7046-62
Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J (2007) Genbank. Nucleic Acids Res 35:D21–D25
Acknowledgements
This research is supported by National Natural Science Foundation of China (Grant: 91746104). The author would like to thank all the students and teachers for their efforts. We are also appreciating the reviewers and editors for their valuable suggestions and comments to improve this work.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix-1
Appendix-1
1.1 List of Abbreviations
- A:
-
Adrenal Gland
- Ap:
-
Adipose
- B:
-
Brain
- Bl:
-
Bladder
- BM:
-
Bone Marrow
- Br:
-
Breast
- Ce:
-
Cervix
- Co:
-
Colon
- DCo:
-
Distal Colon
- Du:
-
Duodenum
- ES, EBD3, EBD28, E11, E15, E17:
-
Embryonic Stages
- E:
-
Esophagus
- F:
-
Fallopian Tube
- FC:
-
Frontal Cortex
- H:
-
Heart
- HLS3:
-
Hela3s
- I:
-
Intestine
- Ile:
-
Ileum
- Je:
-
Jejunum
- K:
-
Kidney
- LAt:
-
Left Atrium
- Li:
-
Liver
- Lu:
-
Lung
- LVe:
-
Left Ventricle
- Ly:
-
Lymph Node
- O:
-
Ovary
- Pa:
-
Pancreas
- PBMC:
-
Peripheral Blood Mononuclear Cells
- PCo:
-
Proximal Colon
- Pe:
-
Pericardium
- Pl:
-
Placenta
- Pr:
-
Prostate
- RAt:
-
Right Atrium
- RVe:
-
Right Ventricle
- SI:
-
Small Intestine
- SM:
-
Skeletal Muscle
- Sp:
-
Spleen
- St:
-
Stomach
- Te:
-
Testicle
- Th:
-
Thymus
- Tr:
-
Trachea
- Ty:
-
Thyroid
- U:
-
Uterus
- VC:
-
Vena Cava
Rights and permissions
About this article
Cite this article
Wang, G., Pu, P. & Shen, T. An efficient gene bigdata analysis using machine learning algorithms. Multimed Tools Appl 79, 9847–9870 (2020). https://doi.org/10.1007/s11042-019-08358-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-019-08358-7