Skip to main content
Log in

Abstract

Although large-scale classification studies of genetic sequence data are in progress around the world, very few studies compare different classification approaches, e.g. unsupervised and supervised, in terms of objective criteria such as classification accuracy and computational complexity. In this paper, we study such criteria for both unsupervised and supervised classification of a relatively large sequence data set. The unsupervised approach involves use of different sequence alignment algorithms (e.g., Smith-Waterman, FASTA and BLAST) followed by clustering using the Maximin algorithm. The supervised approach uses a suitable numeric encoding (relative frequencies of tuples of nucleotides followed by principal component analysis) which is fed to a Multi-layer Backpropagation Neural Network. Classification experiments conducted on IBM-SP parallel computers show that FASTA with unsupervised Maximin leads to best trade-off between accuracy and speed among all methods, followed by supervised neural networks as the second best approach. Finally, the different classifiers are applied to the problem of cross-species homology detection.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. A.J. Enright and C.A. Ouzounis, "GeneRAGE: A Robust Algorithm for Sequence Clustering and Domain Detection," Bioinformatics, vol. 16, no. 5, 2000, pp. 451-457.

    Article  Google Scholar 

  2. M.D. Adams et al., "Complementary DNA Sequencing: Expressed Sequence Tags and the Human Genome Project," Science, vol. 252, 1991, pp. 1651-1656.

    Article  Google Scholar 

  3. G.N. Lance and W.T. Williams, "Computer Programs for Hierarchical Polythetic Classification," Comput. J., vol. 9, 1966, pp. 60-64.

    Article  MATH  Google Scholar 

  4. P.H.A. Sneath, "The Application of Computers to Taxonomy," J. Gen. Microbiol., vol. 17, 1957, pp. 201-226.

    Article  Google Scholar 

  5. S.C. Johnson, "Hierarchical Clustering Schemes," Psychometrika, vol. 32, 1967, pp. 241-254.

    Article  Google Scholar 

  6. J.H. Ward, "Hierarchical Grouping to Optimize an Objective Function," J. Am. Statist. Ass., vol. 58, 1963, pp. 236-244.

    Article  Google Scholar 

  7. J. MacQueen, "Some Methods for Classification and Analysis of Multivariate Observations," Proc. 5th Berkeley Symp. Math.Stat. Probab., vol. 1, 1965, pp. 281-297.

    Google Scholar 

  8. J. Herrero, A. Valencia, and J. Dopazo, "A Hierarchical Unsupervised Growing Neural Network for Clustering Gene Expression Patterns," Bioinformatics, vol. 17, no. 2, 2000, pp. 126-136.

    Article  Google Scholar 

  9. A.V. Lukashin and R. Fuchs, "Analysis of Temporal Gene Expression Profiles: Clustering by Simulated Annealing and Determining the Optimal Number of Clusters," Bioinformatics, vol. 17, no. 5, 2001, pp. 405-414. 284 Mukhopadhyay et al.

    Article  Google Scholar 

  10. P. Baccam, R.J. Thompson, O. Fedrigo, S. Carpenter, and J.L. Cornette, "PAQ: Partition Analysis of Quasispecies," Bioinformatics, vol. 17, no. 1, 2001, pp. 16-22.

    Article  Google Scholar 

  11. A. Krause and M. Vingron, "A Set-Theoretic Approach to Database Searching and Clustering," Bioinformatics, vol. 14, no. 5, 1998, pp. 430-438.

    Article  Google Scholar 

  12. C. Wu, G. Whitson, J. McLarty, A. Ermongkonchai, and T.C. Chang, "Protein Classification Artificial Neural System," Protein Science, vol. 1, 1992, pp. 667-677.

    Article  Google Scholar 

  13. M. Milik, A. Kolinski, and J. Skolnick, "Neural Network System for the Evaluation of Side-Chain Packing in Protein Structures," Protein Eng., vol. 8, 1995, pp. 225-236.

    Article  Google Scholar 

  14. S. Mukhopadhyay, C. Tang, J. Huang, M. Yu, and M. Palakal, "A Comparative Study of Genetic Sequence Classi-fication Algorithms," Proceedings of IEEE Neural Networks in Signal Processing (NNSP) Conference,2002, pp. 57-66.

  15. S.B. Needleman and C.D. Wunsch, "A General Method Applicable to the Search for Similarities in the Amino Acid Sequences of Two Proteins," J. Mol. Biol., vol. 48, 1970, pp. 443-453.

    Article  Google Scholar 

  16. T.F. Smith and M.S. Waterman, "Identification of Common Molecular Subsequences," J. Mol. Biol., vol. 147, 1981, pp. 195-197.

    Article  Google Scholar 

  17. W.R. Perason and D.J. Lipman, "Improved Tools for Biological Sequence Comparision," Proc. Natl. Acad. Sci. USA, vol. 85, 1988, pp. 244-2448.

    Google Scholar 

  18. S.F. Altschul, W. Gish, W. Miller, E.W. Myers, and D.J. Lipman, "Basic Local Alignment Search Tool," J. Mol. Biol., vol. 215, 1990, pp. 403-410.

    Article  Google Scholar 

  19. M.S. Waterman and M. Eggert, "A New Algorithm for Subsequence Alignments with Application to tRNA-rRNA Comparisons," J. Mol. Biol., vol. 197, 1987, pp. 723-728.

    Article  Google Scholar 

  20. J.T. Tou and R.C. Gonazalez, Pattern Recognition Principles, pp. 92-94, 1974.

  21. J. Mostafa, S. Mukhopadhyay, W. Lam, and M. Palakal, "A Multilevel Approach to Intelligent Information Filtering: Model, System and Evaluation," ACMTransactions on Information Systems, vol. 15, no. 4, 1997, pp. 368-399.

    Article  Google Scholar 

  22. S. Hayin, Neural Networks: A Comprehensive Foundation, New York: Macmillan, 1994.

    Google Scholar 

  23. UniGene, http://www.ncbi.nlm.nih.gov/UniGene

  24. G.D. Schuler et al., "A Gene Map of the Human Genome," Science, vol. 274, 1996, p. 540.

    Article  Google Scholar 

  25. Z. Zhang, S. Schwarz, L. Wagner, and W. Miller, "A Greedy Algorithm for Aligning DNA Sequences," J. Comp. Biol., vol. 7, 2000, pp.203-214.

    Article  Google Scholar 

  26. SP System, http://sp-www.iu.edu

  27. OpenMP API: http://www.openmp.org

  28. W. Fleischmann, S. Moller, A. Gateau, and R. Apweiler, "A Novel Method for Automatic Functional Annotation of Proteins," Bioinformatics, vol. 15, no. 3, 1999, pp. 228-233.

    Article  Google Scholar 

  29. R.D. King, A. Karwath, A. Clare, and L. Dehaspe, "The Utility of Different Representation of Protein Sequence for Predicting Functional Class," Bioinformatics, vol. 17, no. 5, 2001, pp. 445-454.

    Article  Google Scholar 

  30. P. Pavlidis, J. Weston, J. Cai, and W.N. Grundy, "Gene Functional Classification from Heterogeneous Data," ACM Trans., 2001, pp. 249-255.

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mukhopadhyay, S., Tang, C., Huang, J. et al. Genetic Sequence Classification and its Application to Cross-Species Homology Detection. The Journal of VLSI Signal Processing-Systems for Signal, Image, and Video Technology 35, 273–285 (2003). https://doi.org/10.1023/B:VLSI.0000003025.42408.40

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1023/B:VLSI.0000003025.42408.40

Navigation