Abstract
Although large-scale classification studies of genetic sequence data are in progress around the world, very few studies compare different classification approaches, e.g. unsupervised and supervised, in terms of objective criteria such as classification accuracy and computational complexity. In this paper, we study such criteria for both unsupervised and supervised classification of a relatively large sequence data set. The unsupervised approach involves use of different sequence alignment algorithms (e.g., Smith-Waterman, FASTA and BLAST) followed by clustering using the Maximin algorithm. The supervised approach uses a suitable numeric encoding (relative frequencies of tuples of nucleotides followed by principal component analysis) which is fed to a Multi-layer Backpropagation Neural Network. Classification experiments conducted on IBM-SP parallel computers show that FASTA with unsupervised Maximin leads to best trade-off between accuracy and speed among all methods, followed by supervised neural networks as the second best approach. Finally, the different classifiers are applied to the problem of cross-species homology detection.
Similar content being viewed by others
References
A.J. Enright and C.A. Ouzounis, "GeneRAGE: A Robust Algorithm for Sequence Clustering and Domain Detection," Bioinformatics, vol. 16, no. 5, 2000, pp. 451-457.
M.D. Adams et al., "Complementary DNA Sequencing: Expressed Sequence Tags and the Human Genome Project," Science, vol. 252, 1991, pp. 1651-1656.
G.N. Lance and W.T. Williams, "Computer Programs for Hierarchical Polythetic Classification," Comput. J., vol. 9, 1966, pp. 60-64.
P.H.A. Sneath, "The Application of Computers to Taxonomy," J. Gen. Microbiol., vol. 17, 1957, pp. 201-226.
S.C. Johnson, "Hierarchical Clustering Schemes," Psychometrika, vol. 32, 1967, pp. 241-254.
J.H. Ward, "Hierarchical Grouping to Optimize an Objective Function," J. Am. Statist. Ass., vol. 58, 1963, pp. 236-244.
J. MacQueen, "Some Methods for Classification and Analysis of Multivariate Observations," Proc. 5th Berkeley Symp. Math.Stat. Probab., vol. 1, 1965, pp. 281-297.
J. Herrero, A. Valencia, and J. Dopazo, "A Hierarchical Unsupervised Growing Neural Network for Clustering Gene Expression Patterns," Bioinformatics, vol. 17, no. 2, 2000, pp. 126-136.
A.V. Lukashin and R. Fuchs, "Analysis of Temporal Gene Expression Profiles: Clustering by Simulated Annealing and Determining the Optimal Number of Clusters," Bioinformatics, vol. 17, no. 5, 2001, pp. 405-414. 284 Mukhopadhyay et al.
P. Baccam, R.J. Thompson, O. Fedrigo, S. Carpenter, and J.L. Cornette, "PAQ: Partition Analysis of Quasispecies," Bioinformatics, vol. 17, no. 1, 2001, pp. 16-22.
A. Krause and M. Vingron, "A Set-Theoretic Approach to Database Searching and Clustering," Bioinformatics, vol. 14, no. 5, 1998, pp. 430-438.
C. Wu, G. Whitson, J. McLarty, A. Ermongkonchai, and T.C. Chang, "Protein Classification Artificial Neural System," Protein Science, vol. 1, 1992, pp. 667-677.
M. Milik, A. Kolinski, and J. Skolnick, "Neural Network System for the Evaluation of Side-Chain Packing in Protein Structures," Protein Eng., vol. 8, 1995, pp. 225-236.
S. Mukhopadhyay, C. Tang, J. Huang, M. Yu, and M. Palakal, "A Comparative Study of Genetic Sequence Classi-fication Algorithms," Proceedings of IEEE Neural Networks in Signal Processing (NNSP) Conference,2002, pp. 57-66.
S.B. Needleman and C.D. Wunsch, "A General Method Applicable to the Search for Similarities in the Amino Acid Sequences of Two Proteins," J. Mol. Biol., vol. 48, 1970, pp. 443-453.
T.F. Smith and M.S. Waterman, "Identification of Common Molecular Subsequences," J. Mol. Biol., vol. 147, 1981, pp. 195-197.
W.R. Perason and D.J. Lipman, "Improved Tools for Biological Sequence Comparision," Proc. Natl. Acad. Sci. USA, vol. 85, 1988, pp. 244-2448.
S.F. Altschul, W. Gish, W. Miller, E.W. Myers, and D.J. Lipman, "Basic Local Alignment Search Tool," J. Mol. Biol., vol. 215, 1990, pp. 403-410.
M.S. Waterman and M. Eggert, "A New Algorithm for Subsequence Alignments with Application to tRNA-rRNA Comparisons," J. Mol. Biol., vol. 197, 1987, pp. 723-728.
J.T. Tou and R.C. Gonazalez, Pattern Recognition Principles, pp. 92-94, 1974.
J. Mostafa, S. Mukhopadhyay, W. Lam, and M. Palakal, "A Multilevel Approach to Intelligent Information Filtering: Model, System and Evaluation," ACMTransactions on Information Systems, vol. 15, no. 4, 1997, pp. 368-399.
S. Hayin, Neural Networks: A Comprehensive Foundation, New York: Macmillan, 1994.
UniGene, http://www.ncbi.nlm.nih.gov/UniGene
G.D. Schuler et al., "A Gene Map of the Human Genome," Science, vol. 274, 1996, p. 540.
Z. Zhang, S. Schwarz, L. Wagner, and W. Miller, "A Greedy Algorithm for Aligning DNA Sequences," J. Comp. Biol., vol. 7, 2000, pp.203-214.
SP System, http://sp-www.iu.edu
OpenMP API: http://www.openmp.org
W. Fleischmann, S. Moller, A. Gateau, and R. Apweiler, "A Novel Method for Automatic Functional Annotation of Proteins," Bioinformatics, vol. 15, no. 3, 1999, pp. 228-233.
R.D. King, A. Karwath, A. Clare, and L. Dehaspe, "The Utility of Different Representation of Protein Sequence for Predicting Functional Class," Bioinformatics, vol. 17, no. 5, 2001, pp. 445-454.
P. Pavlidis, J. Weston, J. Cai, and W.N. Grundy, "Gene Functional Classification from Heterogeneous Data," ACM Trans., 2001, pp. 249-255.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Mukhopadhyay, S., Tang, C., Huang, J. et al. Genetic Sequence Classification and its Application to Cross-Species Homology Detection. The Journal of VLSI Signal Processing-Systems for Signal, Image, and Video Technology 35, 273–285 (2003). https://doi.org/10.1023/B:VLSI.0000003025.42408.40
Published:
Issue Date:
DOI: https://doi.org/10.1023/B:VLSI.0000003025.42408.40