Splicing sites prediction of human genome using machine learning techniques

Ullah, Waseem; Muhammad, Khan; Ul Haq, Ijaz; Ullah, Amin; Ullah Khattak, Saeed; Sajjad, Muhammad

doi:10.1007/s11042-021-10619-3

Splicing sites prediction of human genome using machine learning techniques

1155T: Advanced machine learning algorithms for biomedical data and imaging
Published: 20 May 2021

Volume 80, pages 30439–30460, (2021)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Waseem Ullah¹,
Khan Muhammad ORCID: orcid.org/0000-0003-4055-7412²,
Ijaz Ul Haq¹,
Amin Ullah¹,
Saeed Ullah Khattak³ &
…
Muhammad Sajjad⁴

532 Accesses
Explore all metrics

Abstract

The accurate splice site prediction has several applications in the field of medical sciences and biochemistry. For instance, any mutation affecting the splice site will lead to genetic diseases and cancer such as Lynch syndrome and breast cancer. For this purpose, collecting the Ribonucleic Acid (RNA) samples is an efficient and convenient method to detect the involvement of splicing defects in disease formation. Therefore, the present study aims to develop an accurate and robust Computer-Aided Diagnosis (CAD) method for swift and precise targeting of splice site sequences. A composite features-based model is proposed by integrating three different sample representation methods i.e., Dinucleotide Composition (DNC), Trinucleotide Composition (TNC) and Tetranucleotide Composition (TetraNC) for precise splice site prediction after converting the DNA sequences into numerical descriptors. The precision and accuracy of these features are analyzed by applying different machine learning algorithms such as Support Vector Machine (SVM), K-Nearest Neighbor (KNN) and Naïve Bayes (NB). Results show that the proposed model of composite features vector with SVM classifier achieved an accuracy of 95.20% and 97.50% for donor and acceptor sites datasets, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Splice site identification in human genome using random forest

Article 02 December 2016

Identification of donor splice sites using support vector machine: a computational approach based on positional, compositional and dependency features

Article Open access 01 June 2016

A high-performance approach for predicting donor splice sites based on short window size and imbalanced large samples

Article Open access 11 April 2019

References

Ali F, Hayat M (2016) Machine learning approaches for discrimination of extracellular matrix proteins using hybrid feature space. J Theor Biol 403:30–37
Article MathSciNet MATH Google Scholar
Angermueller C, Lee HJ, Reik W, Stegle O (2017) DeepCpG: accurate prediction of single-cell DNA methylation states using deep learning. Genome Biol 18:67
Article Google Scholar
Burge C, Karlin S (1997) Prediction of complete gene structures in human genomic DNA1. J Mol Biol 268:78–94
Article Google Scholar
Burke B, Stewart CL (2014) Functional architecture of the cell's nucleus in development, aging, and disease. Curr Top Dev Biol 109, Elsevier:1–52
Article Google Scholar
Cai Y-D, Zhou G-P, Chou K-C (2003) Support vector machines for predicting membrane protein types by using functional domain composition. Biophys J 84:3257–3263
Article Google Scholar
Cao D-S, Xu Q-S, Liang Y-Z (2013) Propy: a tool to generate various modes of Chou’s PseAAC. Bioinformatics 29:960–962
Article Google Scholar
Cartegni L, Wang J, Zhu Z, Zhang MQ, Krainer AR (2003) ESEfinder: a web resource to identify exonic splicing enhancers. Nucleic Acids Res 31:3568–3571
Article Google Scholar
Chaki J, Dey N (2019) Pattern analysis of genetics and genomics: a survey of the state-of-art. Multimed Tools Appl 1–32
Chen W, Feng P-M, Lin H, Chou K-C (2013) iRSpot-PseDNC: identify recombination spots with pseudo dinucleotide composition. Nucleic Acids Res 41:e68–e68
Article Google Scholar
Chen W, Feng P-M, Lin H, Chou K-C (2014) iSS-PseDNC: identifying splicing sites using pseudo dinucleotide composition. Biomed Res Int 2014:623149–623149
Google Scholar
Chen W, Zhang X, Brooker J, Lin H, Zhang L, Chou K-C (2014) PseKNC-general: a cross-platform package for generating various modes of pseudo nucleotide compositions. Bioinformatics 31:119–120
Article Google Scholar
Chen W, Feng P-M, Deng E-Z, Lin H, Chou K-C (2014) iTIS-PseTNC: a sequence-based predictor for identifying translation initiation site in human genes using pseudo trinucleotide composition. Anal Biochem 462:76–83
Article Google Scholar
Chen W, Lin H, Chou K-C (2015) Pseudo nucleotide composition or PseKNC: an effective formulation for analyzing genomic sequences. Mol BioSyst 11:2620–2634
Article Google Scholar
Chou KC (2001) Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins Struct Funct Bioinforma 43:246–255
Article Google Scholar
Chou KC (2001) Prediction of protein signal sequences and their cleavage sites. Proteins Struct Funct Bioinforma 42:136–139
Article Google Scholar
Chou K-C (2004) Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes. Bioinformatics 21:10–19
Article Google Scholar
Chou K-C (2009) Pseudo amino acid composition and its applications in bioinformatics, proteomics and system biology. Curr Proteome 6:262–274
Article Google Scholar
Chou K-C, Shen H-B (2007) Recent progress in protein subcellular location prediction. Anal Biochem 370:1–16
Article Google Scholar
Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20:273–297
Article MATH Google Scholar
Cui Y, Han J, Zhong D, Liu R (2013) A novel computational method for the identification of plant alternative splice sites. Biochem Biophys Res Commun 431:221–224
Article Google Scholar
Du P, Gu S, Jiao Y (2014) PseAAC-general: fast building various modes of general form of Chou’s pseudo-amino acid composition for large-scale protein datasets. Int J Mol Sci 15:3495–3506
Article Google Scholar
Feng P-M, Chen W, Lin H, Chou K-C (2013) iHSP-PseRAAAC: identifying the heat shock protein families using pseudo reduced amino acid alphabet composition. Anal Biochem 442:118–125
Article Google Scholar
Fernandez M, Miranda-Saavedra D (2012) Genome-wide enhancer prediction from epigenetic signatures using genetic algorithm-optimized support vector machines. Nucleic Acids Res 40:e77–e77
Article Google Scholar
Firpi HA, Ucar D, Tan K (2010) Discover regulatory DNA elements using chromatin signatures and artificial neural network. Bioinformatics 26:1579–1586
Article Google Scholar
Friedman N, Geiger D, Goldszmidt M (1997) Bayesian network classifiers. Mach Learn 29:131–163
Article MATH Google Scholar
Garhwal AS, Yan WQ (2019) BIIIA: a bioinformatics-inspired image identification approach. Multimed Tools Appl 78:9537–9552
Article Google Scholar
Goel N, Singh S, Aseri TC (2015) An improved method for splice site prediction in DNA sequences using support vector machines. Procedia Comput Sci 57:358–367
Article Google Scholar
Guo S-H, Deng E-Z, Xu L-Q, Ding H, Lin H, Chen W, Chou KC (2014) iNuc-PseKNC: a sequence-based predictor for predicting nucleosome positioning in genomes with pseudo k-tuple nucleotide composition. Bioinformatics 30:1522–1529
Article Google Scholar
Henderson J, Salzberg S, Fasman KH (1997) Finding genes in DNA with a hidden Markov model. J Comput Biol 4:127–141
Article Google Scholar
Hill ST, Kuintzle R, Teegarden A, Merrill E III, Danaee P, Hendrix DA (2018) A deep recurrent neural network discovers complex biological rules to decipher RNA protein-coding potential. Nucleic Acids Res 46:8105–8113
Article Google Scholar
Hoang T, Yin C, Yau SS-T (2020) Splice sites detection using chaos game representation and neural network. Genomics 112:1847–1852
Article Google Scholar
Iqbal M, Hayat M (2016) “iSS-Hyb-mRMR”: identification of splicing sites using hybrid space of pseudo trinucleotide and pseudo tetranucleotide composition. Comput Methods Prog Biomed 128:1–11
Article Google Scholar
Jian X, Boerwinkle E, Liu X (2014) In silico tools for splicing defect prediction: a survey from the viewpoint of end users. Genet Med 16:497–503
Article Google Scholar
Kabir M, Yu D-J (2017) Predicting DNase I hypersensitive sites via un-biased pseudo trinucleotide composition. Chemom Intell Lab Syst 167:78–84
Article Google Scholar
Kabir M, Iqbal M, Ahmad S, Hayat M (2015) iTIS-PseKNC: identification of translation initiation site in human genes using pseudo k-tuple nucleotides composition. Comput Biol Med 66:252–257
Article Google Scholar
Kandaswamy KK, Chou K-C, Martinetz T, Möller S, Suganthan P, Sridharan S et al (2011) AFP-Pred: a random forest approach for predicting antifreeze proteins from sequence-derived properties. J Theor Biol 270:56–62
Article Google Scholar
Kulakovskiy IV, Medvedeva YA, Schaefer U, Kasianov AS, Vorontsov IE, Bajic VB et al (2012) HOCOMOCO: a comprehensive collection of human transcription factor binding sites models. Nucleic Acids Res 41:D195–D202
Article Google Scholar
Li W, Jaroszewski L, Godzik A (2002) Tolerating some redundancy significantly speeds up clustering of large protein databases. Bioinformatics 18:77–82
Article Google Scholar
Li C, Li X, Lin Y-X (2016) Numerical characterization of protein sequences based on the generalized Chou’s pseudo amino acid composition. Appl Sci 6:406
Article Google Scholar
Li W, Li J, Huo L, Li W, Du X (2017) Prediction of splice site using support vector machine with feature selection. In: Proceedings of the International Conference on Bioinformatics and Computational Intelligence (pp. 1–5)
Lin S-X, Lapointe J (2013) Theoretical and experimental biology in one—a symposium in honour of professor Kuo-Chen Chou’s 50th anniversary and professor Richard Giegé’s 40th anniversary of their scientific careers. J Biomed Sci Eng 6:435–442
Article Google Scholar
Lin H, Deng E-Z, Ding H, Chen W, Chou K-C (2014) iPro54-PseKNC: a sequence-based predictor for identifying sigma-54 promoters in prokaryote with pseudo k-tuple nucleotide composition. Nucleic Acids Res 42:12961–12972
Article Google Scholar
Liu B (2016) iEnhancer-PsedeKNC: identification of enhancers and their subgroups based on Pseudo degenerate kmer nucleotide composition. Neurocomputing 217:46–52
Article Google Scholar
Liu B, Liu F, Wang X, Chen J, Fang L, Chou K-C (2015) Pse-in-one: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences. Nucleic Acids Res 43:W65–W71
Article Google Scholar
Maji S, Garg D (2014) Hybrid approach using SVM and MM2 in splice site junction identification. Curr Bioinforma 9:76–85
Article Google Scholar
Maji S, Kanrar S (2019) SpliceCombo: A hybrid technique efficiently use for principal component analysis of splice site prediction. arXiv preprint arXiv:1907.09401
Meher PK, Sahu TK, Rao A, Wahi S (2016) Identification of donor splice sites using support vector machine: a computational approach based on positional, compositional and dependency features. Algorithm Mol Biol 11:16
Article Google Scholar
Moles-Fernández A, Duran-Lozano L, Montalban G, Bonache S, López-Perolio I, Menéndez M, Santamariña M, Behar R, Blanco A, Carrasco E, López-Fernández A, Stjepanovic N, Balmaña J, Capellá G, Pineda M, Vega A, Lázaro C, de la Hoya M, Diez O, Gutiérrez-Enríquez S (2018) Computational tools for splicing defect prediction in breast/ovarian cancer genes: how efficient are they at predicting RNA alterations? Front Genet 9:366
Article Google Scholar
Naito T (2019) Predicting the impact of single nucleotide variants on splicing via sequence-based deep neural networks and genomic features. Hum Mutat 40:1261–1269
Article Google Scholar
Nanni L, Lumini A (2006) An ensemble of K-local hyperplanes for predicting protein–protein interactions. Bioinformatics 22:1207–1210
Article Google Scholar
Nazari I, Tahir M, Tayara H, Chong KT (2019) iN6-methyl (5-step): identifying RNA N6-methyladenosine sites using deep learning mode via Chou's 5-step rules and Chou's general PseKNC. Chemom Intell Lab Syst 193:103811
Article Google Scholar
Norouzi B, Mirzakuchaki S (2017) An image encryption algorithm based on DNA sequence operations and cellular neural network. Multimed Tools Appl 76:13681–13701
Article Google Scholar
Ogura H, Agata H, Xie M, Odaka T, Furutani H (1997) A study of learning splice sites of DNA sequence by neural networks. Comput Biol Med 27:67–75
Article Google Scholar
Pashaei E, Ozen M, Aydin N (2017) Splice site identification in human genome using random forest. Heal Technol 7:141–152
Article Google Scholar
Pertea M, Lin X, Salzberg SL (2001) GeneSplicer: a new computational method for splice site prediction. Nucleic Acids Res 29:1185–1190
Article Google Scholar
Pollastro P, Rampone S (2002) HS3D, a dataset of homo sapiens splice regions, and its extraction procedure from a major public database. Int J Mod Phys C 13:1105–1117
Article Google Scholar
Qiu W-R, Xiao X, Chou K-C (2014) iRSpot-TNCPseAAC: identify recombination spots with trinucleotide composition and pseudo amino acid components. Int J Mol Sci 15:1746–1766
Article Google Scholar
Quang D, Xie X (2019) FactorNet: a deep learning framework for predicting cell type specific transcription factor binding from nucleotide-resolution sequential data. Methods 166:40–47
Article Google Scholar
Reese MG, Eeckman FH, Kulp D, Haussler D (1997) Improved splice site detection in genie. J Comput Biol 4:311–323
Article Google Scholar
Rhine CL, Cygan KJ, Soemedi R, Maguire S, Murray MF, Monaghan SF, Fairbrother WG (2018) Hereditary cancer genes are highly susceptible to splicing mutations. PLoS Genet 14:e1007231
Article Google Scholar
Richhariya B, Tanveer M (2019) A fuzzy universum support vector machine based on information entropy. In: Machine Intelligence and Signal Analysis (pp. 569–582), ed: Springer
Schäffer AA, Aravind L, Madden TL, Shavirin S, Spouge JL, Wolf YI, Koonin EV, Altschul SF (2001) Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. Nucleic Acids Res 29:2994–3005
Article Google Scholar
Tahir M, Hayat M (2016) iNuc-STNC: a sequence-based predictor for identification of nucleosome positioning in genomes by extending the concept of SAAC and Chou's PseAAC. Mol BioSyst 12:2587–2593
Article Google Scholar
Tahir M, Hayat M, Kabir M (2017) Sequence based predictor for discrimination of enhancer and their types by applying general form of Chou's trinucleotide composition. Comput Methods Prog Biomed 146:69–75
Article Google Scholar
Tanveer M, Shubham K, Aldhaifallah M, Ho SS (2016) An efficient regularized K-nearest neighbor based weighted twin support vector regression. Knowl-Based Syst 94:70–87
Article Google Scholar
Tanveer M, Sharma A, Suganthan PN (2019) General twin support vector machine with pinball loss function. Inf Sci 494:311–327
Article MathSciNet MATH Google Scholar
Tayara H, Tahir M, Chong KT (2019) iSS-CNN: identifying splicing sites using convolution neural network. Chemom Intell Lab Syst 188:63–69
Article Google Scholar
Thompson TB, Chou K-C, Zheng C (1995) Neural network prediction of the HIV-1 protease cleavage sites. J Theor Biol 177:369–379
Article Google Scholar
Touati R, Messaoudi I, Oueslati AE, Lachiri Z (2019) A combined support vector machine-FCGS classification based on the wavelet transform for Helitrons recognition in C. elegans. Multimed Tools Appl 78:13047–13066
Article Google Scholar
Vaz-Drago R, Custódio N, Carmo-Fonseca M (2017) Deep intronic mutations and human disease. Hum Genet 136:1093–1111
Article Google Scholar
Waris M, Ahmad K, Kabir M, Hayat M (2016) Identification of DNA binding proteins using evolutionary profiles position specific scoring matrix. Neurocomputing 199:154–162
Article Google Scholar
Xiao X, Wang P, Chou K-C (2012) iNR-PhysChem: a sequence-based predictor for identifying nuclear receptors and their subfamilies via physical-chemical property matrix. PLoS One 7:e30869
Article Google Scholar
Xu Q, Li M (2019) A new cluster computing technique for social media data analysis. Clust Comput 22:2731–2738
Article Google Scholar
Xu Z-C, Wang P, Qiu W-R, Xiao X (2017) iSS-PC: identifying splicing sites via physical-chemical properties using deep sparse auto-encoder. Sci Rep 7:8222
Article Google Scholar
Zhang MQ (1997) Identification of protein coding regions in the human genome by quadratic discriminant analysis. Proc Natl Acad Sci 94:565–568
Article Google Scholar
Zhang XH, Heller KA, Hefter I, Leslie CS, Chasin LA (2003) Sequence information for the splicing of human pre-mRNA identified by support vector machine classification. Genome Res 13:2637–2650
Article Google Scholar
Zhang Y, Liu X, MacLeod J, Liu J (2018) Discerning novel splice junctions derived from RNA-seq alignment: a deep learning approach. BMC Genomics 19:971
Article Google Scholar
Zhang Z, Zhao Y, Liao X, Shi W, Li K, Zou Q, Peng S (2019) Deep learning in omics: a survey and guideline. Brief Funct Genom 18:41–57
Article Google Scholar
Zou J, Huss M, Abid A, Mohammadi P, Torkamani A, Telenti A (2019) A primer on deep learning in genomics. Nat Genet 51:12–18
Article Google Scholar

Download references

Author information

Authors and Affiliations

Intelligent Media Laboratory, Sejong University, Seoul, 143-747, Republic of Korea
Waseem Ullah, Ijaz Ul Haq & Amin Ullah
Visual Analytics for Knowledge Laboratory, Department of Software, Sejong University, Seoul, 143-747, Republic of Korea
Khan Muhammad
Centre of Biotechnology and Microbiology, University of Peshawar, Peshawar, Pakistan
Saeed Ullah Khattak
Department of Computer Science, Islamia College Peshawar, Peshawar, Pakistan
Muhammad Sajjad

Authors

Waseem Ullah
View author publications
You can also search for this author inPubMed Google Scholar
Khan Muhammad
View author publications
You can also search for this author inPubMed Google Scholar
Ijaz Ul Haq
View author publications
You can also search for this author inPubMed Google Scholar
Amin Ullah
View author publications
You can also search for this author inPubMed Google Scholar
Saeed Ullah Khattak
View author publications
You can also search for this author inPubMed Google Scholar
Muhammad Sajjad
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding authors

Correspondence to Khan Muhammad or Muhammad Sajjad.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ullah, W., Muhammad, K., Ul Haq, I. et al. Splicing sites prediction of human genome using machine learning techniques. Multimed Tools Appl 80, 30439–30460 (2021). https://doi.org/10.1007/s11042-021-10619-3

Download citation

Received: 21 January 2020
Revised: 29 November 2020
Accepted: 29 January 2021
Published: 20 May 2021
Issue Date: August 2021
DOI: https://doi.org/10.1007/s11042-021-10619-3

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Splicing sites prediction of human genome using machine learning techniques

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Splice site identification in human genome using random forest

Identification of donor splice sites using support vector machine: a computational approach based on positional, compositional and dependency features

A high-performance approach for predicting donor splice sites based on short window size and imbalanced large samples

References

Author information

Authors and Affiliations

Corresponding authors

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now