skip to main content
article

Genetic subtyping using cluster analysis

Published:01 July 2001Publication History
Skip Abstract Section

Abstract

In this paper we (1) describe state-of-the-art methods to identify clusters in DNA sequence data for taxonomic analysis; (2) describe a new method with better scaling properties based on model-based clustering, and (3) present examples using the nucleoprotein and hemagglutin regions of influenza and the env and gag regions of human immunodeficiency virus (HIV).

References

  1. Banfield, J. and Raftery, A. Model-based Gaussian and non-Gaussian clustering. Biometrics 49:803-821, 1993.Google ScholarGoogle ScholarCross RefCross Ref
  2. Bradley, P., Fayyad, U., and Reina, C. Scaling Clustering Algorithms to Large Databases. Proceedings of the 4th International Conf. on Knowledge Discovery and Data Mining (KDD-98). AAAI Press, Aug. 1998.Google ScholarGoogle Scholar
  3. Burr, T., Myers, G., and Hyman, J. The origin of AIDS --- Darwinian or Lamarkian? Phil. Trans. R. Soc. Lond. B.356:877-887, 2001Google ScholarGoogle ScholarCross RefCross Ref
  4. Burr, T., Skourikhine, A. N., Macken, C., and Bruno, W. Confidence measures for evolutionary trees: applications to molecular epidemiology. Proc. of the 1999 IEEE Inter. Conference on Information, Intelligence and Systems, 107-114, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Burr, T., Charlton, W., and Stanbro, W. Comparison of signature pattern analysis methods in molecular epidemiology. Mathematical and Engineering Methods in Medicine and Biological Sciences, 473-479, 2000.Google ScholarGoogle Scholar
  6. Dempster, A., Laird, N., and Rubin, D. Maximum likelihood for incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39:1-38, 1977.Google ScholarGoogle Scholar
  7. Efron, B., Halloran, E., and Holmes, S. Bootstrap confidence levels for phylogenetic trees. Proc. Natl. Acad. Sci. USA 93: 13429, 1996.Google ScholarGoogle ScholarCross RefCross Ref
  8. Faloutsos, C. and Lin, K. FastMap: A fast algorithm for indexing, data-mining, and visualization of traditional and multimedia datasets. Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data, San Jose, California, pages 163-174, May 22-25, 1995 Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Felsenstein, J. Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17:368-376, 1981.Google ScholarGoogle ScholarCross RefCross Ref
  10. Felsenstein, J. Phylogenies from molecular sequences: inference and reliability. Annual Review of Genetics 22:521-565, 1997.Google ScholarGoogle ScholarCross RefCross Ref
  11. Fraley, C. and Raftery, A. MCLUST: Software for model-based cluster analysis. Journal of Classification 16:297-306, 1999.Google ScholarGoogle ScholarCross RefCross Ref
  12. Gammelin, M., Mandler, J., and Schholtissek, C. Two subtypes of nucleoproteins (NP) of the influenza viruses. Virology 170:71-80, 1989.Google ScholarGoogle ScholarCross RefCross Ref
  13. Grassley, N. C., Harvey, P. H., and Holmes, E. C. Population dynamics of HIV-1 inferred from gene sequences. Genetics 151: 427-438, 1999.Google ScholarGoogle Scholar
  14. Guha, S., Rastogi, R., and Shim, K. CURE: An efficient clustering algorithm for large databases. In Proceedings of ACM SIGMOD International Conference on Management of Data, pages 73-84, New York, 1998. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Hasegawa, M., Kishino, H., and Yano, T. Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. 21: 160-174, 1985.Google ScholarGoogle ScholarCross RefCross Ref
  16. Holmes, E. C., Pybus, O. G., and Harvey, P. H. The molecular population dynamics of HIV-1. In Crandell, K. The Evolution of HIV, Baltimore: Johns Hopkins University Press, 1999.Google ScholarGoogle Scholar
  17. Hu, D. J., Buve, A., Baggs, J., van der Groen, G., and Dondero, T. J. What role does HIV-1 subtype play in transmission and pathogenesis? An epidemiological perspective. AIDS 13:873-881, 1999.Google ScholarGoogle ScholarCross RefCross Ref
  18. Huelsenbeck, J. and Rannala, B. Phylogenetic methods come of age: testing hypotheses in an evolutionary context. Science, 276: 227-232, 1997.Google ScholarGoogle ScholarCross RefCross Ref
  19. Johnson, R. and Wichern, D. Applied Multivariate Statistical Analysis, 2nd edition. Englewood Cliffs, New Jersey: Prentice Hall, 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Kass, R. and Raftery, A. Bayes Factors. J. American Statistical Association. 90:773-795, 1995.Google ScholarGoogle ScholarCross RefCross Ref
  21. Kingman, J. F. C. On the genealogy of large populations. J. Appl. Prob. 19: 27-43. 1982.Google ScholarGoogle ScholarCross RefCross Ref
  22. Korber, B. and Myers, G. Signature pattern analysis: a method for assessing viral sequence relatedness. AIDS Research and Human Retroviruses 8: 1549-1560, 1992.Google ScholarGoogle ScholarCross RefCross Ref
  23. Leitner, T., Kumar., S., and Albert, J. Tempo and mode of nucleotide substitutions in gag and env gene fragments in HIV Type 1 populations with a known transmission history. Virology 71: 4761-4770, 1997.Google ScholarGoogle Scholar
  24. Leitner, T., et al, Accurate reconstruction of a known HIV-1 transmission history by phylogenetic tree analysis. Proc. Natl. Acad. Sci., USA 93: 10864-10869, 1996.Google ScholarGoogle ScholarCross RefCross Ref
  25. Mau, B., Newton, M., and Larget, B. Bayesian phylogenetic inference via Markov Chain Montre Carlo Methods. Biometrics 55:1-12, 1999.Google ScholarGoogle ScholarCross RefCross Ref
  26. Moore, A. very fast EM-based mixture model clustering using multiresolution kd-trees. Neural Information Processing Systems, December 1998 Issue. The paper is available online at http://www.cs.cmu.edu/-awm/papers.html#fastem Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Myers, G. HIV: between past and future. AIDS Res Human Retro 10: 1317-1324, 1994.Google ScholarGoogle ScholarCross RefCross Ref
  28. Needleman, S. and Wunsch, C. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol Biol. 48:443-453, 1970.Google ScholarGoogle ScholarCross RefCross Ref
  29. Ng, R. and Han, J. Efficient and effective clustering methods for spatial data mining. In Proceedings of the VLDB Conference, Santiago, Chile, September 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Salter, L. Algorithms for phylogenetic tree reconstruction. Mathematical and Engineering Methods in Medicine and Biological Sciences, 459-465, 2000.Google ScholarGoogle Scholar
  31. Simon, D. and Larget, B. Bayesian Analysis in Molecular Biology and Evolution (BAMBE) version 1.01 beta, Dept. of Mathematics and Computer Science, Duquesne University, 1998.Google ScholarGoogle Scholar
  32. S-Plus 5.1 MathSoft, Seattle Washington, 1999.Google ScholarGoogle Scholar
  33. Swofford, D. L., Olsen, G. J., Waddell, P. J., and Hillis, D. M. Phylogenetic inference In Molecular Systematics, 2nd edition, pp. 407-514 (Hillis et al., eds.) Sunderland, Massachusetts: Sinauer Associates, 1996.Google ScholarGoogle Scholar
  34. Swofford, D. L. PAUP* Phylogenetic analysis using parsimony; Version 4; Sunderland, Massachusetts: Sinauer Associates, 1999.Google ScholarGoogle Scholar
  35. Venables, W. and Ripley, B. Modern applied statistics with S-PLUS, 2nd ed., Springer-Verlag: NY, 1997.Google ScholarGoogle Scholar
  36. Web sites: hiv-web.lanl.gov for the HIV sequences; linker.lanl.gov/flu for the influenza sequences; www.stat.washington.edu/fraley for emclust code for use in Splus; http://evolve.zoo.ox.ac.uk for Treevolve code to simulate DNA data under various coalescent models.Google ScholarGoogle Scholar
  37. Zhang, T., Ramakrishnan, R., and Livny, M. BIRCH: An efficient data clustering method for very large databases. In Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, pages 103-114, Montreal, Canada, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Genetic subtyping using cluster analysis
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Article Metrics

            • Downloads (Last 12 months)2
            • Downloads (Last 6 weeks)1

            Other Metrics

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader