Skip to main content

Advertisement

Log in

An efficient gene bigdata analysis using machine learning algorithms

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Bioinformatics is one of the emerging and rapidly developing research areas that is predominantly used for genetic data analysis and processing. Bioinformatics is characterized by its huge and voluminous data that is growing in nature which in turn complicates data analysis. In most cases, Bioinformatics data analysis and processing involve big data analytics due to the complex nature of the data. Previous research works handled data analytics using traditional tools and conventional big data analytical methods. However, it can be proved that machine learning algorithms and approaches can be effectively deployed to perform parallel, distributed and incremental processing of complex big data analytics especially in the case of gene big data analytics to enhance the efficiency in processing this large chunk of Bioinformatics-based gene big data. This paper provides a Machine Learning algorithm-based Convolution Neural Network (ML-CNN) approach for the process of identifying potential target genes, predicting miRNAs, visualizing the unique miRNA patterns, and validating genomes. The proposed approach has experimented with MATLAB software using deep learning toolbox on the pre - miRNA dataset. Experimental results indicate that machine learning algorithms certainly increases the efficiency of Bioinformatics-based methods of processing gene data in terms of prediction accuracy and reduced processing time. The mean performance of ML-CNN is improved 7% high than the existing system.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

References

  1. Schatz MC, Langmead B (2013) The DNA data deluge. IEEE Spectr 50(7):28–33

    Article  Google Scholar 

  2. Marx V (2013) Biology: the big challenges of big data. Nature 498(7453):255–260

    Article  Google Scholar 

  3. Ashley EA (2015) The precision medicine initiative: a new national effort. JAMA 313(21):2119–2120

    Article  Google Scholar 

  4. Stephens ZD et al (2015) Big data: astronomical or genomical? PLOS Biol 13(7):e1002195

    Article  Google Scholar 

  5. Watson JD, Crick FHC (1953) Molecular structure of nucleic aids: a structure for deoxyribose nucleic acid. Nature 171(4356):737–738

    Article  Google Scholar 

  6. de Klerk E, 't Hoen PAC (2015) Alternative mRNA transcription, processing, and translation: insights from RNA sequencing. Trends Gen 31(3):128–139

    Article  Google Scholar 

  7. Harrow J, Frankish A, Gonzalez JM, Tapanari E, Diekhans M, Kokocinski F, Aken BL, Barrell D, Zadissa A, Searle S, Barnes I, Bignell A, Boychenko V, Hunt T, Kay M, Mukherjee G, Rajan J, Despacio-Reyes G, Saunders G, Steward C, Harte R, Lin M, Howald C, Tanzer A, Derrien T, Chrast J, Walters N, Balasubramanian S, Pei B, Tress M, Rodriguez JM, Ezkurdia I, van Baren J, Brent M, Haussler D, Kellis M, Valencia A, Reymond A, Gerstein M, Guigó R, Hubbard TJ (2012) GENCODE: the reference human genome annotation for the ENCODE project. Genome Res 22(9):1760–1774

    Article  Google Scholar 

  8. Rubin MA (2015) Make precision medicine work for cancer care. Nature 520(7547):290–291

    Article  Google Scholar 

  9. Wang X, Naqa I (2008) Prediction of both conserved and non-conserved microRNA targets in animals. Bioinf Adv Access 24(3):325–332

  10. Herrero J, Dopazo J (2002) Combining hierarchical clustering and self-organizing maps for exploratory analysis of gene expression patterns. J Proteome Res 1:467–470

    Article  Google Scholar 

  11. Herrero J, Valencia A, Dopazo J (2001) A hierarchical unsupervised growing neural network for clustering gene expression patterns. Bioinformatics 17:126–138

    Article  Google Scholar 

  12. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT et al (2000) Gene ontology: tool for the unification of biology. Nat Genet 25:25–29

    Article  Google Scholar 

  13. Saçar MD, Allmer J (2014) Machine learning methods for miRNA gene prediction. Methods Mol Biol. https://doi.org/10.1007/978-1-62703-748-8_10

  14. Yandell M, Ence D (2012) A beginner's guide to eukaryotic genome annotation. Nat Rev Genet 13(5):329–342

    Article  Google Scholar 

  15. Alexander RP, Fang G, Rozowsky J, Snyder M, Gerstein MB (2010) Annotating non-coding regions of the genome. Nat Rev Genet 11(8):559–571

    Article  Google Scholar 

  16. Yip KY, Cheng C, Gerstein M (2013) Machine learning and genome annotation: a match meant to be? Genome Biol 14(5):205

    Article  Google Scholar 

  17. Sonnenburg S, Schweikert G, Philips P, Behr J, Rätsch G (2007) Accurate splice site prediction using support vector machines. BMC Bioinf 8(Suppl. 10):S7

    Article  Google Scholar 

  18. Saeys Y, Abeel T, Degroeve S, Van de Peer Y (2007) Translation initiation site prediction on a genomic scale: beauty in simplicity. Bioinformatics 23(1987):418–423

    Article  Google Scholar 

  19. Alipanahi B, Delong A, Weirauch MT, Frey BJ (2015) Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat Biotechnol 33(8):831–838

    Article  Google Scholar 

  20. Lee TI, Young R (2013) Transcriptional regulation and its Mis-regulation in disease. Cell 152(6):1237–1251

    Article  Google Scholar 

  21. Li X, Quon G, Lipshitz HD, Morris Q (2010) Predicting in vivo binding sites of RNA-binding proteins using mRNA secondary structure. RNA 16(6):1096–1107

    Article  Google Scholar 

  22. Maston GA, Evans SK, Green MR (2006) Transcriptional regulatory elements in the human genome. Annu Rev Genomics Hum Genet 7:29–59

    Article  Google Scholar 

  23. Xiong HY et al (2014) The human splicing code reveals new insights into the genetic determinants of disease. Science 347(6218). https://doi.org/10.1126/science.1254806

  24. Wang Z, Burge CB (2008) Splicing regulation: from a parts list of regulatory elements to an integrated splicing code. RNA 14(5):802–813

    Article  Google Scholar 

  25. Barash Y, Calarco JA, Gao W, Pan Q, Wang X, Shai O, Blencowe BJ, Frey BJ (2010) Deciphering the splicing code. Nature 465(7294):53–59

    Article  Google Scholar 

  26. Xiong H, Barash Y, Frey B (2011) Bayesian prediction of tissue-regulated splicing using RNA sequence and cellular context. Bioinformatics 27(18):2554–2562

    Article  Google Scholar 

  27. Leung MKK, Xiong HY, Lee LJ, Frey BJ (2014) Deep learning of the tissue-regulated splicing code. Bioinformatics 30(12):i121–i129

    Article  Google Scholar 

  28. Lorenz R et al (2011) Vienna RNA package 2.0. Algorithms Mol Biol 6(1):26

    Article  Google Scholar 

  29. Laing C, Schlick T (2011) Computational approaches to RNA structure prediction, analysis, and design. Curr Opin Struct Biol 21(3):306–318

    Article  Google Scholar 

  30. Wan Y, Kertesz M, Spitale RC, Segal E, Chang HY (2011) Understanding the transcriptome through RNA structure. Nat Rev Genet 12(9):641–655

    Article  Google Scholar 

  31. Floudas CA (2007) Computational methods in protein structure prediction. Biotechnol Bioeng 97(2):207–213

    Article  Google Scholar 

  32. Troyanskaya OG (2014) Deep supervised and convolutional generative stochastic network for protein secondary structure prediction. In: Proceedings of 31st international conference machine learning, vol. 32, pp 745–753

  33. Di Lena P, Nagata K, Baldi P (2012) Deep architectures for protein contact map prediction. Bioinformatics 28(19):2449–2457

    Article  Google Scholar 

  34. Elkon R, Ugalde AP, Agami R (2013) Alternative cleavage and polyadenylation: extent, regulation and function. Nat Rev Genet 14(7):496–506

    Article  Google Scholar 

  35. Danckwardt S, Hentze MW, Kulozik AE (2008) 30 end mRNA processing: molecular mechanisms and implications for health and disease. EMBO J 27(3):482–498

    Article  Google Scholar 

  36. Akhtar MN, Bukhari SA, Fazal Z, Qamar R, Shahmuradov IA (2010) POLYAR, a new computer program for prediction of poly(A) sites in human sequences. BMC Genomics 11(1):646

    Article  Google Scholar 

  37. Chang T-H et al (2011) Characterization and prediction of mRNA polyadenylation sites in human genes. Med Biol Eng Comput 49(4):463–472

    Article  Google Scholar 

  38. Rahman ME, Islam R, Islam S, Mondal SI, Amin MR (2012) Mirann: a reliable approach for improved classification of precursor Micron using artificial neural network model. Genomics 99:189–194

    Article  Google Scholar 

  39. Xue C, Li F, He T, Liu G, Li Y, Zhang X (2005) Classification of real and pseudo Microrna precursors using local structure sequence features and support vector machine. BMC Bioinf 6:310. https://doi.org/10.1186/1471-2105-6-310

    Article  Google Scholar 

  40. Xiao J, Tang X, Li Y, Fang Z, Ma D, He Y, Li M Identification of microrna precursors based on random forest with network-level representation method of stem-loop structure. BMC Bioinf 12:165. https://doi.org/10.1186/1471-2105-12-165

  41. Wang L, Xi Y, Sung S, Qiao H (2018) RNA-seq assistant: machine learning based methods to identify more transcriptional regulated genes. BMC Genomics 19:546. https://doi.org/10.1186/s12864-018-4932-2

    Article  Google Scholar 

  42. Park C, Kim J, Kim J, Park S (2018) Machine learning-based identification of genetic interactions from heterogeneous gene expression profiles. PLoS ONE 13(7). https://doi.org/10.1371/journal.pone.0201056

  43. Martins PVL, Camacho R, Fonseca N (2018) Gene prediction using deep learning, thesis

  44. Mande SS, Mohammed MH, Ghosh TS (2012) Classification of metagenomic sequences: methods and challenges. Brief Bioinform 13(6):669–681

    Article  Google Scholar 

  45. Han J , Kamber M (2015) Data mining: concepts and techniques. The Morgan Kaufmann series in data management systems[J]. antimicrobial agents & chemotherapy 59(3):1435–40.

  46. Kozomara A, Birgaoanu M, Griffiths-Jones S (2019) miRBase: from microRNA sequences to function. Nucleic Acids Res 47:D155–D162

    Article  Google Scholar 

  47. Xue C, Li F, He T, Liu G, Li Y, Zhang X (2005) Classification of real and pseudo microrna precursors using local structure sequence features and support vector machine. BMC Bioinf 6:310. https://doi.org/10.1186/1471-2105-6-310

    Article  Google Scholar 

  48. Thomas J, Sael L (2017) Deep neural network-based precursor microRNA prediction on eleven species. arXiv preprint arXiv:1704.03834

  49. Xiao J, Tang X, Li Y, Fang Z, Ma D, He Y, Li M (2011) Identification of microrna precursors based on random forest with network-level representation method of stem-loop structure. BMC Bioinf 12:165. https://doi.org/10.1186/1471-2105-12-165

    Article  Google Scholar 

  50. Kleftogiannis D, Theofilatos K, Likothanassis S, Mavroudi S (2015) Yamipred: a novel evolutionary method for predicting pre-mirnas and selecting relevant features. IEEE/ACM Trans Comput Biol Bioinform 12(5):1183–1192. https://doi.org/10.1109/TCBB.2014.2388227

    Article  Google Scholar 

  51. Ng KLS, Mishra SK (2007) De novo SVM classification of precursor micrornas from genomic pseudo hairpins using global and intrinsic folding measures. BMC Bioinf 23(11):1321–1330. https://doi.org/10.1186/1471-2105-8-341

    Article  Google Scholar 

  52. Batuwita R, Palade V (2009) micropred: effective classification of pre-mirnasfor human mirna gene prediction. BMC Bioinf 25(8):989–995. https://doi.org/10.1093/bioinformatics/btp107

    Article  Google Scholar 

  53. Pasaila D, Sucial A, Mohorianu I, Pantiru ST, Ciortuz L (2011) Mirnarecognition with the yasmir system: the quest for further improvements. Adv Exp Med Biol 696:17–25. https://doi.org/10.1007/978-1-4419-7046-62

    Article  Google Scholar 

  54. Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J (2007) Genbank. Nucleic Acids Res 35:D21–D25

    Article  Google Scholar 

Download references

Acknowledgements

This research is supported by National Natural Science Foundation of China (Grant: 91746104). The author would like to thank all the students and teachers for their efforts. We are also appreciating the reviewers and editors for their valuable suggestions and comments to improve this work.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ge Wang.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix-1

Appendix-1

1.1 List of Abbreviations

A:

Adrenal Gland

Ap:

Adipose

B:

Brain

Bl:

Bladder

BM:

Bone Marrow

Br:

Breast

Ce:

Cervix

Co:

Colon

DCo:

Distal Colon

Du:

Duodenum

ES, EBD3, EBD28, E11, E15, E17:

Embryonic Stages

E:

Esophagus

F:

Fallopian Tube

FC:

Frontal Cortex

H:

Heart

HLS3:

Hela3s

I:

Intestine

Ile:

Ileum

Je:

Jejunum

K:

Kidney

LAt:

Left Atrium

Li:

Liver

Lu:

Lung

LVe:

Left Ventricle

Ly:

Lymph Node

O:

Ovary

Pa:

Pancreas

PBMC:

Peripheral Blood Mononuclear Cells

PCo:

Proximal Colon

Pe:

Pericardium

Pl:

Placenta

Pr:

Prostate

RAt:

Right Atrium

RVe:

Right Ventricle

SI:

Small Intestine

SM:

Skeletal Muscle

Sp:

Spleen

St:

Stomach

Te:

Testicle

Th:

Thymus

Tr:

Trachea

Ty:

Thyroid

U:

Uterus

VC:

Vena Cava

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, G., Pu, P. & Shen, T. An efficient gene bigdata analysis using machine learning algorithms. Multimed Tools Appl 79, 9847–9870 (2020). https://doi.org/10.1007/s11042-019-08358-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-019-08358-7

Keywords

Navigation