Abstract
Studies on the amino acid sequences, protein structure, and the relationships of amino acids are still a large and challenging problem in biology. Although bioinformatics studies have progressed in solving these problems, the relationship between amino acids and determining the type of protein formed by amino acids are still a problem that has not been fully solved. This problem is why the use of some of the available protein sequences is also limited. This study proposes a hybrid deep learning model to classify amino acid sequences of unknown species using the amino acid sequences in the plant transcription factor database. The model achieved 98.23% success rate in the tests performed. With the hybrid model created, transcription factor proteins in the plant kingdom can be easily classified. The fact that the model is hybrid has made its layers lighter. The training period has decreased, and the success has increased. When tested with a bidirectional LSTM produced with a similar dataset to our dataset and a ResNet-based ProtCNN model, a CNN model, the proposed model was more successful. In addition, we found that the hybrid model we designed by creating vectors with Word2Vec is more successful than other LSTM or CNN-based models. With the model we have prepared, other proteins, especially transcription factor proteins, will be classified, thus enabling species identification to be carried out efficiently and successfully. The use of such a triplet hybrid structure in classifying plant transcription factors stands out as an innovation brought to the literature.
Similar content being viewed by others
Data availability
After the article is published, the data set can be accessed by e-mailing the corresponding author.
References
Acar, N., Gündeğer, E., Selçuki, C.: Protein yapı analizleri. In: Baloğlu, M.C. (ed.) Biyoinformatik Temelleri Ve Uygulamaları, pp. 85–128. Pegem Akademi Yayıncılık, Kastamonu (2018)
Petrey, D., Honig, B.: Is protein classification necessary? towards alternative approaches to function annotation. Curr. Opin. Struct. Biol. 19(3), 363–368 (2009)
Baldi, P., Brunak, S.: Bioinformatics: the machine learning approach. The MIT Press, London (2001)
Eddy, S.R.: Hidden markov models. Curr. Opin. Struct. Biol. 6(3), 361–365 (1996)
Gromiha, M.M.: Chapter 2 - protein sequence analysis. In: Protein Bioinformatics. pp. 29–62. Academic Press, Tokyo (2010)
Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local aligment search tool. J. Mol. Biol. 215(3), 403–410 (1990)
Shen, H.-B., Chou, K.-C.: Ezypred: A top-down approach for predicting enzyme functional classes and subclasses. Biochem. Biophys. Res. Commun. 364(1), 53–59 (2007)
Cozzetto, D., Minneci, F., Currant, H., Jones, D.T.: Ffpred 3: feature-based function prediction for all gene ontology domains. Sci. Rep. 6, 1–11 (2016)
Dalkıran, A., Rifaioğlu, A.S., Martin, M.J., Çetin, A.R., Atalay, V., Doğan, T.: Ecpred: a tool for the prediction of the enzymatic functions of protein sequences based on the ec nomenclature. BMC Bioinf. 19, 1–13 (2018)
Gong, Q., Ning, W., Tian, W.: Gofdr: A sequence alignment based method for predicting protein functions. Methods 93(2), 3–14 (2016)
Asgari, E., Mofrad, M.R.K.: Continuous distributed representation of biological sequences for deep proteomics and genomics. PLoS ONE 10(11), 1–15 (2015)
Naveenkumar, K.S., R., M.H.B., Vinayakumar, R., Soman, K.P.: Protein family classification using deep learning. Preprint at https://www.biorxiv.org/content/10.1101/414128v2 (2018)
Strodthoff, N., Wagner, P., Wenzel, M., Samek, W.: Udsmprot: universal deep sequence models for protein classification. Bioinformatics 36(8), 2401–2409 (2020)
Le, N.Q.K., Yapp, E.K.Y., Nagasundaram, N., Chua, M.C.H., Yeh, H.-Y.: Computational identification of vesicular transport proteins from sequences using deep gated recurrent units architecture. Comput. Struct. Biotechnol. J. 17, 1245–1254 (2009)
Li, S., Chen, J., Liu, B.: Protein remote homology detection based on bidirectional long short-term memory. BMC Bioinf. 18, 1–8 (2017)
Bileschi, M.L., Belanger, D., Bryant, D., Sanderson, T., Carter, D.B., Sculley DePristo, M.A., Colwell, L.J.: Using deep learning to annotate the protein universe. Nat. Biotechnol. 40(6), 932–937 (2022)
Rao, R., Bhattacharya, N., Thomas, N., Duan, Y., Chen, X., Canny, J., Abbeel, P., Song, Y.S.: Evaluating protein transfer learning with tape. Adv. Neural Inf. Process. Syst. 32, 9689–9701 (2019)
Belzen, J.U.Z., Bürgel, T., Holderbach, S., Bubeck, F., Adam, L., Gandor, C., Klein, M., Mathony, J., Pfuderer, P., Platz, L., Przybilla, M., Schwendemann, M., Heid, D., Hoffmann, M.D., Jendrusch, M., Schmelas, C., Waldhauer, M., Lehmann, I., D., N., Eils, R.: The index of general nonlinear DAES. Nat. Mach. Intell. 1, 225–235 (2019)
Torrisi, M., Pollastri, G., Le, Q.: Deep learning methods in protein structure prediction. Comput. Struct. Biotechnol. Jo. 18, 1301–1310 (2020)
Gustafsson, C., Minshull, J., Govindarajan, S., Ness, J., Villalobos, A., Welch, M.: Engineering genes for predictable protein expression. Protein Expr. Purif. 83(1), 37–46 (2012)
Latchman, D.S.: Transcription factors: An overview. Int. J. Biochem. Cell Biol. 29(12), 1305–1312 (1997)
Jin, J., Zhang, H., Kong, L., Gao, G., Luo, J.: Planttfdb 3.0: a portal for the functional and evolutionary study of plant transcription factors. Nucl. Acids Res. 42(D1), 1182–1187 (2014)
Jin, J., Tian, F., Yang, D.-C., Meng, Y.-Q., Kong, L., Luo, J., Gao, G.: Planttfdb 4.0: toward a central hub for transcription factors and regulatory interactions in plants. Nucl. Acids Res. 45(D1), 1040–1045 (2017)
LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521, 436–444 (2015)
on Biochemical Nomenclature (CBN), I.-I.C.: A one-letter notation for amino acid sequences tentative rules. European J. Biochem. 7(8), 151–153 (1968)
Ofer, D., Brandes, N., Linial, M.: The language of proteins: Nlp, machine learning & protein sequences. Comput. Struct. Biotechnol. J. 19, 1750–1758 (2021)
Pfam: Family: HLH (PF00010). Available at http://pfam.xfam.org/family/PF00010 (Access date: February 2019)
Schuster-Böckler, B., Schultz, J., Rahmann, S.: Hmm logos for visualization of protein families. BMC Bioinf. 5, 1–8 (2004)
Vries, J.K., Liu, X., Bahar, I.: The relationship between n-gram patterns and protein secondary structure. Proteins 68(4), 830–9838 (2007)
Vries, J.K., Liu, X.: Subfamily specific conservation profiles for proteins based on n-gram patterns. BMC Bioinf. 9, 1–13 (2008)
Greff, K., Srivastava, R.K., Koutnik, J., Steunebrink, B.R., Schmidhuber, J.: Lstm: a search space odyssey. Trans. Neural Netw. Learn. Syst. 28(10), 2222–2232 (2017)
Gao, Y., Glowacka, D.: Deep gate recurrent neural network. In: JMLR: Workshop and Conference Proceedings 63, 350–365 (2016)
Kingma, D.P., Ba, J.L.: ADAM: A Method for Stochastic Optimization. In: Paper presented at International Conference on Learning Representations (ICLR), pp. 7–9 May 2015 (2014)
Funding
Not applicable.
Author information
Authors and Affiliations
Contributions
All authors have an equal contribution to the article.
Corresponding author
Ethics declarations
Conflict of interest
The author declares that there is no competing interests related to this paper.
Ethical approval
Not applicable.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Öncül, A.B., Çelik, Y. A hybrid deep learning model for classification of plant transcription factor proteins. SIViP 17, 2055–2061 (2023). https://doi.org/10.1007/s11760-022-02419-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11760-022-02419-5