Abstract
We tackle the problem of sequence classification using relevant subsequences found in a dataset of protein labelled sequences. A subsequence is relevant if it is frequent and has a minimal length. For each query sequence a vector of features is obtained. The features consist in the number and average length of the relevant subsequences shared with each of the protein families. Classification is performed by combining these features in a Bayes Classifier. The combination of these characteristics results in a multi-class and multi-domain method that is exempt of data transformation and background knowledge. We illustrate the performance of our method using three collections of protein datasets. The performed tests showed that the method has an equivalent performance to state of the art methods in protein classification.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Altschul, S.F., Madden, T.L., Schaeffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J.: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research 25, 3389–3402 (1997)
Ayres, J., Flannick, J., Gehrke, J., Yiu, T.: Sequential pattern mining using a bitmap representation. In: Proceedings of the 8th International Conference of Knowledge Discovery and Data Mining SIGKDD, S. Francisco, July 2002, pp. 429–435 (2002)
Bairoch, A.: Prosite: a dictionary of sites and patterns in proteins. Nucleic Acids Res 25(19), 2241–2245 (1991)
Ben-Hur, A., Brutlag, D.: Remote homology detection:a motif based approach. Bioinformatics 19(1), 26–33 (2003)
Ben-Hur, A., Brutlag, D.: Sequence motifs: highly predictive features of protein function. In: Proceeding of Workshop on Feature Selection, NIPS - Neural Information Processing Systems (December 2003)
Cooper, N.G.: The Human Genome Project, Dechiphering the blueprint of heredity, vol. 1. University Science Books (1994)
Domingos, P., Pazzani, M.: Beyond independence: Conditions for the optimality of the simple bayesian classifier. In: International Conference on Machine Learning, pp. 105–112 (1996)
Eskin, E., Grundy, W.N., Singer, Y.: Biological sequence analysis: Probabilistic models of proteins and nucleic acids. Journal of Computational Biology 10(2), 187–214 (2003)
Bateman, A., et al.: The pfam protein families database. Nucleic Acids Research 32(Database issue) (October 2003)
Ferreira, P., Azevedo, P.: Protein sequence pattern mining with constraints. In: Jorge, A.M., Torgo, L., Brazdil, P.B., Camacho, R., Gama, J. (eds.) PKDD 2005. LNCS (LNAI), vol. 3721, pp. 96–107. Springer, Heidelberg (2005)
Bejerano, G., Yona, G.: Modeling protein families using probabilistic suffix trees. In: ACM Press (ed.) The Proceedings of RECOMB 1999, pp. 15–24 (1999)
Hunter, L.: Molecular biology for computer scientists (artificial intelligence & molecular biology)
Floratos, A., Rigoutsos, I.: Combinatorial pattern discovery in biological sequences: the teiresias algorithm. Bioinformatics 1(14) (January 1998)
Krogh, M.S., Brown, Haussler: Hidden markov models in computational biology: applications to protein modeling. Journal of Molecular Biology (235), 1501–1531 (1994)
Kudenko, D., Hirsh, H.: Feature generation for sequence categorization. In: AAAI/IAAI, pp. 733–738 (1998)
Lesh, N., Zaki, M.J., Ogihara, M.: Mining features for sequence classification. In: Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 342–346. ACM Press, New York (1999)
Pearson, R.W., Lipman, D.J.: Improved tools for biological sequence comparison. Proceedings Natl. Academy Sciences USA 5, 2444–2448 (1998)
Pei, J., Han, J., Mortazavi-Asl, B., Pinto, H., Chen, Q., Dayal, U., Hsu, M.-C.: PrefixSpan mining sequential patterns efficiently by prefix projected pattern growth. In: Proceedings Int. Conf. Data Engineering (ICDE 2001), Heidelberg, Germany, April 2001, pp. 215–226 (2001)
Durbin, R., Eddy, S.R.: Biological sequence analysis: Probabilistic models of proteins and nucleic acids. Cambridge University Press, Cambridge (1998)
Srikant, R., Agrawal, R.: Mining sequential patterns: Generalizations and performance improvements. In: Apers, P.M.G., Bouzeghoub, M., Gardarin, G. (eds.) EDBT 1996. LNCS, vol. 1057, pp. 3–17. Springer, Heidelberg (1996)
Zaki, N.M., Ilias, R.M., Derus, S.: A comparative analysis of protein homology detection methods. Journal of Theoretics, 5–4 (2003)
Zar, J.H.: Biostatistical Analysis, 3rd edn. Prentice-Hall, Englewood Cliffs (1999)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Ferreira, P.G., Azevedo, P.J. (2005). Protein Sequence Classification Through Relevant Sequence Mining and Bayes Classifiers. In: Bento, C., Cardoso, A., Dias, G. (eds) Progress in Artificial Intelligence. EPIA 2005. Lecture Notes in Computer Science(), vol 3808. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11595014_24
Download citation
DOI: https://doi.org/10.1007/11595014_24
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-30737-2
Online ISBN: 978-3-540-31646-6
eBook Packages: Computer ScienceComputer Science (R0)