article

Genetic subtyping using cluster analysis

Authors:
Tom Burr

Los Alamos National Laboratory, Los Alamos, NM

Los Alamos National Laboratory, Los Alamos, NM
View Profile

,
James R. Gattiker

Los Alamos National Laboratory, Los Alamos, NM

Los Alamos National Laboratory, Los Alamos, NM
View Profile

,
Greggory S. LaBerge

Denver Police Dept. Crime Lab, Denver, CO and University of Colorado

Denver Police Dept. Crime Lab, Denver, CO and University of Colorado
View Profile

Authors Info & Claims

ACM SIGKDD Explorations Newsletter Volume 3 Issue 1July 2001pp 33–42https://doi.org/10.1145/507533.507539

Published:01 July 2001Publication History

ACM SIGKDD Explorations Newsletter

Abstract

In this paper we (1) describe state-of-the-art methods to identify clusters in DNA sequence data for taxonomic analysis; (2) describe a new method with better scaling properties based on model-based clustering, and (3) present examples using the nucleoprotein and hemagglutin regions of influenza and the env and gag regions of human immunodeficiency virus (HIV).

References

Banfield, J. and Raftery, A. Model-based Gaussian and non-Gaussian clustering. Biometrics 49:803-821, 1993.Google ScholarCross Ref
Bradley, P., Fayyad, U., and Reina, C. Scaling Clustering Algorithms to Large Databases. Proceedings of the 4th International Conf. on Knowledge Discovery and Data Mining (KDD-98). AAAI Press, Aug. 1998.Google Scholar
Burr, T., Myers, G., and Hyman, J. The origin of AIDS --- Darwinian or Lamarkian? Phil. Trans. R. Soc. Lond. B.356:877-887, 2001Google ScholarCross Ref
Burr, T., Skourikhine, A. N., Macken, C., and Bruno, W. Confidence measures for evolutionary trees: applications to molecular epidemiology. Proc. of the 1999 IEEE Inter. Conference on Information, Intelligence and Systems, 107-114, 1999. Google ScholarDigital Library
Burr, T., Charlton, W., and Stanbro, W. Comparison of signature pattern analysis methods in molecular epidemiology. Mathematical and Engineering Methods in Medicine and Biological Sciences, 473-479, 2000.Google Scholar
Dempster, A., Laird, N., and Rubin, D. Maximum likelihood for incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39:1-38, 1977.Google Scholar
Efron, B., Halloran, E., and Holmes, S. Bootstrap confidence levels for phylogenetic trees. Proc. Natl. Acad. Sci. USA 93: 13429, 1996.Google ScholarCross Ref
Faloutsos, C. and Lin, K. FastMap: A fast algorithm for indexing, data-mining, and visualization of traditional and multimedia datasets. Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data, San Jose, California, pages 163-174, May 22-25, 1995 Google ScholarDigital Library
Felsenstein, J. Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17:368-376, 1981.Google ScholarCross Ref
Felsenstein, J. Phylogenies from molecular sequences: inference and reliability. Annual Review of Genetics 22:521-565, 1997.Google ScholarCross Ref
Fraley, C. and Raftery, A. MCLUST: Software for model-based cluster analysis. Journal of Classification 16:297-306, 1999.Google ScholarCross Ref
Gammelin, M., Mandler, J., and Schholtissek, C. Two subtypes of nucleoproteins (NP) of the influenza viruses. Virology 170:71-80, 1989.Google ScholarCross Ref
Grassley, N. C., Harvey, P. H., and Holmes, E. C. Population dynamics of HIV-1 inferred from gene sequences. Genetics 151: 427-438, 1999.Google Scholar
Guha, S., Rastogi, R., and Shim, K. CURE: An efficient clustering algorithm for large databases. In Proceedings of ACM SIGMOD International Conference on Management of Data, pages 73-84, New York, 1998. ACM. Google ScholarDigital Library
Hasegawa, M., Kishino, H., and Yano, T. Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. 21: 160-174, 1985.Google ScholarCross Ref
Holmes, E. C., Pybus, O. G., and Harvey, P. H. The molecular population dynamics of HIV-1. In Crandell, K. The Evolution of HIV, Baltimore: Johns Hopkins University Press, 1999.Google Scholar
Hu, D. J., Buve, A., Baggs, J., van der Groen, G., and Dondero, T. J. What role does HIV-1 subtype play in transmission and pathogenesis? An epidemiological perspective. AIDS 13:873-881, 1999.Google ScholarCross Ref
Huelsenbeck, J. and Rannala, B. Phylogenetic methods come of age: testing hypotheses in an evolutionary context. Science, 276: 227-232, 1997.Google ScholarCross Ref
Johnson, R. and Wichern, D. Applied Multivariate Statistical Analysis, 2nd edition. Englewood Cliffs, New Jersey: Prentice Hall, 1988. Google ScholarDigital Library
Kass, R. and Raftery, A. Bayes Factors. J. American Statistical Association. 90:773-795, 1995.Google ScholarCross Ref
Kingman, J. F. C. On the genealogy of large populations. J. Appl. Prob. 19: 27-43. 1982.Google ScholarCross Ref
Korber, B. and Myers, G. Signature pattern analysis: a method for assessing viral sequence relatedness. AIDS Research and Human Retroviruses 8: 1549-1560, 1992.Google ScholarCross Ref
Leitner, T., Kumar., S., and Albert, J. Tempo and mode of nucleotide substitutions in gag and env gene fragments in HIV Type 1 populations with a known transmission history. Virology 71: 4761-4770, 1997.Google Scholar
Leitner, T., et al, Accurate reconstruction of a known HIV-1 transmission history by phylogenetic tree analysis. Proc. Natl. Acad. Sci., USA 93: 10864-10869, 1996.Google ScholarCross Ref
Mau, B., Newton, M., and Larget, B. Bayesian phylogenetic inference via Markov Chain Montre Carlo Methods. Biometrics 55:1-12, 1999.Google ScholarCross Ref
Moore, A. very fast EM-based mixture model clustering using multiresolution kd-trees. Neural Information Processing Systems, December 1998 Issue. The paper is available online at http://www.cs.cmu.edu/-awm/papers.html#fastem Google ScholarDigital Library
Myers, G. HIV: between past and future. AIDS Res Human Retro 10: 1317-1324, 1994.Google ScholarCross Ref
Needleman, S. and Wunsch, C. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol Biol. 48:443-453, 1970.Google ScholarCross Ref
Ng, R. and Han, J. Efficient and effective clustering methods for spatial data mining. In Proceedings of the VLDB Conference, Santiago, Chile, September 1994. Google ScholarDigital Library
Salter, L. Algorithms for phylogenetic tree reconstruction. Mathematical and Engineering Methods in Medicine and Biological Sciences, 459-465, 2000.Google Scholar
Simon, D. and Larget, B. Bayesian Analysis in Molecular Biology and Evolution (BAMBE) version 1.01 beta, Dept. of Mathematics and Computer Science, Duquesne University, 1998.Google Scholar
S-Plus 5.1 MathSoft, Seattle Washington, 1999.Google Scholar
Swofford, D. L., Olsen, G. J., Waddell, P. J., and Hillis, D. M. Phylogenetic inference In Molecular Systematics, 2nd edition, pp. 407-514 (Hillis et al., eds.) Sunderland, Massachusetts: Sinauer Associates, 1996.Google Scholar
Swofford, D. L. PAUP* Phylogenetic analysis using parsimony; Version 4; Sunderland, Massachusetts: Sinauer Associates, 1999.Google Scholar
Venables, W. and Ripley, B. Modern applied statistics with S-PLUS, 2nd ed., Springer-Verlag: NY, 1997.Google Scholar
Web sites: hiv-web.lanl.gov for the HIV sequences; linker.lanl.gov/flu for the influenza sequences; www.stat.washington.edu/fraley for emclust code for use in Splus; http://evolve.zoo.ox.ac.uk for Treevolve code to simulate DNA data under various coalescent models.Google Scholar
Zhang, T., Ramakrishnan, R., and Livny, M. BIRCH: An efficient data clustering method for very large databases. In Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, pages 103-114, Montreal, Canada, 1996. Google ScholarDigital Library

Index Terms

Genetic subtyping using cluster analysis
1. Applied computing
  1. Life and medical sciences
2. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Unsupervised learning
        Cluster analysis

Index terms have been assigned to the content through auto-classification.

Recommendations

An algorithm for the reconstruction of consensus sequences of ancient segmental duplications and transposon copies in eukaryotic genomes

Interspersed repeats, mostly resulting from the activity and accumulation of transposable elements, occupy a significant fraction of many eukaryotic genomes. More than half of human genomic sequence consists of known repeats, however a very large part ...
Read More
Recombination and phylogeny: effects and detection

The role of phylogeny in guiding comparative studies is rapidly growing in the post genomic era. Most phylogeny reconstruction methods though, assume a single tree underlying a given alignment of sequences. However, when events such as recombination ...
Read More
Reconstructing phylogenetic trees of prokaryote genomes by randomly sampling oligopeptides

In this paper, we propose a method for reconstructing phylogenetic trees of a given set of prokaryote organisms by randomly sampling relatively small oligopeptides of a fixed length from their complete proteomes. For each of the organisms, a vector of ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM SIGKDD Explorations Newsletter Volume 3, Issue 1
July 2001
50 pages
ISSN:1931-0145
EISSN:1931-0153
DOI:10.1145/507533
Issue’s Table of Contents

Copyright © 2001 Authors
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 July 2001
Check for updates
Author Tags
DNA sequence analysis
HIV
influenza
model-based clustering
phylogenetic trees
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 590
  Total Downloads
- Downloads (Last 12 months)2
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Genetic subtyping using cluster analysis

ACM SIGKDD Explorations Newsletter

Abstract

References

Cited By

Index Terms

Recommendations

An algorithm for the reconstruction of consensus sequences of ancient segmental duplications and transposon copies in eukaryotic genomes

Recombination and phylogeny: effects and detection

Reconstructing phylogenetic trees of prokaryote genomes by randomly sampling oligopeptides

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Genetic subtyping using cluster analysis

ACM SIGKDD Explorations Newsletter

Abstract

References

Cited By

Index Terms

Recommendations

An algorithm for the reconstruction of consensus sequences of ancient segmental duplications and transposon copies in eukaryotic genomes

Recombination and phylogeny: effects and detection

Reconstructing phylogenetic trees of prokaryote genomes by randomly sampling oligopeptides

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media