Abstract
In this paper we (1) describe state-of-the-art methods to identify clusters in DNA sequence data for taxonomic analysis; (2) describe a new method with better scaling properties based on model-based clustering, and (3) present examples using the nucleoprotein and hemagglutin regions of influenza and the env and gag regions of human immunodeficiency virus (HIV).
- Banfield, J. and Raftery, A. Model-based Gaussian and non-Gaussian clustering. Biometrics 49:803-821, 1993.Google ScholarCross Ref
- Bradley, P., Fayyad, U., and Reina, C. Scaling Clustering Algorithms to Large Databases. Proceedings of the 4th International Conf. on Knowledge Discovery and Data Mining (KDD-98). AAAI Press, Aug. 1998.Google Scholar
- Burr, T., Myers, G., and Hyman, J. The origin of AIDS --- Darwinian or Lamarkian? Phil. Trans. R. Soc. Lond. B.356:877-887, 2001Google ScholarCross Ref
- Burr, T., Skourikhine, A. N., Macken, C., and Bruno, W. Confidence measures for evolutionary trees: applications to molecular epidemiology. Proc. of the 1999 IEEE Inter. Conference on Information, Intelligence and Systems, 107-114, 1999. Google ScholarDigital Library
- Burr, T., Charlton, W., and Stanbro, W. Comparison of signature pattern analysis methods in molecular epidemiology. Mathematical and Engineering Methods in Medicine and Biological Sciences, 473-479, 2000.Google Scholar
- Dempster, A., Laird, N., and Rubin, D. Maximum likelihood for incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39:1-38, 1977.Google Scholar
- Efron, B., Halloran, E., and Holmes, S. Bootstrap confidence levels for phylogenetic trees. Proc. Natl. Acad. Sci. USA 93: 13429, 1996.Google ScholarCross Ref
- Faloutsos, C. and Lin, K. FastMap: A fast algorithm for indexing, data-mining, and visualization of traditional and multimedia datasets. Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data, San Jose, California, pages 163-174, May 22-25, 1995 Google ScholarDigital Library
- Felsenstein, J. Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17:368-376, 1981.Google ScholarCross Ref
- Felsenstein, J. Phylogenies from molecular sequences: inference and reliability. Annual Review of Genetics 22:521-565, 1997.Google ScholarCross Ref
- Fraley, C. and Raftery, A. MCLUST: Software for model-based cluster analysis. Journal of Classification 16:297-306, 1999.Google ScholarCross Ref
- Gammelin, M., Mandler, J., and Schholtissek, C. Two subtypes of nucleoproteins (NP) of the influenza viruses. Virology 170:71-80, 1989.Google ScholarCross Ref
- Grassley, N. C., Harvey, P. H., and Holmes, E. C. Population dynamics of HIV-1 inferred from gene sequences. Genetics 151: 427-438, 1999.Google Scholar
- Guha, S., Rastogi, R., and Shim, K. CURE: An efficient clustering algorithm for large databases. In Proceedings of ACM SIGMOD International Conference on Management of Data, pages 73-84, New York, 1998. ACM. Google ScholarDigital Library
- Hasegawa, M., Kishino, H., and Yano, T. Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. 21: 160-174, 1985.Google ScholarCross Ref
- Holmes, E. C., Pybus, O. G., and Harvey, P. H. The molecular population dynamics of HIV-1. In Crandell, K. The Evolution of HIV, Baltimore: Johns Hopkins University Press, 1999.Google Scholar
- Hu, D. J., Buve, A., Baggs, J., van der Groen, G., and Dondero, T. J. What role does HIV-1 subtype play in transmission and pathogenesis? An epidemiological perspective. AIDS 13:873-881, 1999.Google ScholarCross Ref
- Huelsenbeck, J. and Rannala, B. Phylogenetic methods come of age: testing hypotheses in an evolutionary context. Science, 276: 227-232, 1997.Google ScholarCross Ref
- Johnson, R. and Wichern, D. Applied Multivariate Statistical Analysis, 2nd edition. Englewood Cliffs, New Jersey: Prentice Hall, 1988. Google ScholarDigital Library
- Kass, R. and Raftery, A. Bayes Factors. J. American Statistical Association. 90:773-795, 1995.Google ScholarCross Ref
- Kingman, J. F. C. On the genealogy of large populations. J. Appl. Prob. 19: 27-43. 1982.Google ScholarCross Ref
- Korber, B. and Myers, G. Signature pattern analysis: a method for assessing viral sequence relatedness. AIDS Research and Human Retroviruses 8: 1549-1560, 1992.Google ScholarCross Ref
- Leitner, T., Kumar., S., and Albert, J. Tempo and mode of nucleotide substitutions in gag and env gene fragments in HIV Type 1 populations with a known transmission history. Virology 71: 4761-4770, 1997.Google Scholar
- Leitner, T., et al, Accurate reconstruction of a known HIV-1 transmission history by phylogenetic tree analysis. Proc. Natl. Acad. Sci., USA 93: 10864-10869, 1996.Google ScholarCross Ref
- Mau, B., Newton, M., and Larget, B. Bayesian phylogenetic inference via Markov Chain Montre Carlo Methods. Biometrics 55:1-12, 1999.Google ScholarCross Ref
- Moore, A. very fast EM-based mixture model clustering using multiresolution kd-trees. Neural Information Processing Systems, December 1998 Issue. The paper is available online at http://www.cs.cmu.edu/-awm/papers.html#fastem Google ScholarDigital Library
- Myers, G. HIV: between past and future. AIDS Res Human Retro 10: 1317-1324, 1994.Google ScholarCross Ref
- Needleman, S. and Wunsch, C. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol Biol. 48:443-453, 1970.Google ScholarCross Ref
- Ng, R. and Han, J. Efficient and effective clustering methods for spatial data mining. In Proceedings of the VLDB Conference, Santiago, Chile, September 1994. Google ScholarDigital Library
- Salter, L. Algorithms for phylogenetic tree reconstruction. Mathematical and Engineering Methods in Medicine and Biological Sciences, 459-465, 2000.Google Scholar
- Simon, D. and Larget, B. Bayesian Analysis in Molecular Biology and Evolution (BAMBE) version 1.01 beta, Dept. of Mathematics and Computer Science, Duquesne University, 1998.Google Scholar
- S-Plus 5.1 MathSoft, Seattle Washington, 1999.Google Scholar
- Swofford, D. L., Olsen, G. J., Waddell, P. J., and Hillis, D. M. Phylogenetic inference In Molecular Systematics, 2nd edition, pp. 407-514 (Hillis et al., eds.) Sunderland, Massachusetts: Sinauer Associates, 1996.Google Scholar
- Swofford, D. L. PAUP* Phylogenetic analysis using parsimony; Version 4; Sunderland, Massachusetts: Sinauer Associates, 1999.Google Scholar
- Venables, W. and Ripley, B. Modern applied statistics with S-PLUS, 2nd ed., Springer-Verlag: NY, 1997.Google Scholar
- Web sites: hiv-web.lanl.gov for the HIV sequences; linker.lanl.gov/flu for the influenza sequences; www.stat.washington.edu/fraley for emclust code for use in Splus; http://evolve.zoo.ox.ac.uk for Treevolve code to simulate DNA data under various coalescent models.Google Scholar
- Zhang, T., Ramakrishnan, R., and Livny, M. BIRCH: An efficient data clustering method for very large databases. In Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, pages 103-114, Montreal, Canada, 1996. Google ScholarDigital Library
Index Terms
- Genetic subtyping using cluster analysis
Recommendations
An algorithm for the reconstruction of consensus sequences of ancient segmental duplications and transposon copies in eukaryotic genomes
Interspersed repeats, mostly resulting from the activity and accumulation of transposable elements, occupy a significant fraction of many eukaryotic genomes. More than half of human genomic sequence consists of known repeats, however a very large part ...
Recombination and phylogeny: effects and detection
The role of phylogeny in guiding comparative studies is rapidly growing in the post genomic era. Most phylogeny reconstruction methods though, assume a single tree underlying a given alignment of sequences. However, when events such as recombination ...
Reconstructing phylogenetic trees of prokaryote genomes by randomly sampling oligopeptides
In this paper, we propose a method for reconstructing phylogenetic trees of a given set of prokaryote organisms by randomly sampling relatively small oligopeptides of a fixed length from their complete proteomes. For each of the organisms, a vector of ...
Comments