Abstract
Recent results supporting the usefulness of the normalized compression distance for the task to classify genome sequences of virus data are reported. Specifically, the problem to cluster the hemagglutinin (HA) sequences of in uenza virus data for the HA gene in dependence on the host and subtype of the virus, and the classification of dengue virus genome data with respect to their four serotypes are studied. A comparison is made with respect to hierarchical and spectral clustering via the kLine algorithm by Fischer and Poland (2004), respectively, and with respect to the standard compressors bzlip, ppmd, and zlib. Our results are very promising and show that one can obtain an (almost) perfect clustering for all the problems studied.
Supported by MEXT Grant-in-Aid for Scientific Research on Priority Areas under Grant No. 21013001.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
D. Benedetto, E. Caglioti, and V. Loreto. Language trees and zipping. Phys. Rev. Lett., 88(4):048702-1-048702-4, 2002.
C. H. Bennett, P. Gács, M. Li, P. M. B. Vitányi, and W. H. Zurek. Information distance. IEEE Transactions on Information Theory, 44(4):1407–1423, 1998.
D. S. Burke, G. Kuno, and T. P. Monath. Flaviviruses. In D. M. Knipe and P. M. Howley et al., editors, Fields’ Virology, pages 1153–1252. Lippincott Williams & Wilkins, Philadelphia, fifth edition, 2007.
R. Cilibrasi. The CompLearn Toolkit, 2003-. http://www.complearn.org/.
R. Cilibrasi and P. Vitányi. Automatic meaning discovery using Google. Manuscript, CWI, Amsterdam, 2006.
R. Cilibrasi and P. Vitanyi. Similarity of objects and the meaning of words. In Theory and Applications of Models of Computation, Third International Conference, TAMC 2006, Beijing, China, May 2006, Proceedings, volume 3959 of Lecture Notes in Computer Science, pages 21–45, Berlin, 2006. Springer.
R. Cilibrasi and P. M. Vitányi. A new quartet tree heuristic for hierarchical clustering. In D. V. Arnold, T. Jansen, M. D. Vose, and J. E. Rowe, editors, Theory of Evolutionary Algorithms, number 06061 in Dagstuhl Seminar Proceedings. Internationales Begegnungs- und Forschungszentrum fur Informatik (IBFI), Schloss Dagstuhl, Germany, 2006.
R. Cilibrasi and P. M. B. Vitányi. Clustering by compression. IEEE Transactions on Information Theory, 51(4):1523–1545, 2005.
I. Fischer and J. Poland. New methods for spectral clustering. Technical Report IDSIA-12-04, IDSIA/USI-SUPSI, Manno, Switzerland, 2004.
S. B. Halstead. Pathogenesis of dengue: Challenges to molecular biology. Science, 239 (4839):476–481, 1988.
K. Ito, T. Zeugmann, and Y. Zhu. Clustering the normalized compression distance for inuenza virus data. In T. Elomaa, H. Mannila, and P. Orponen, editors, Algorithms and Applications, Essays Dedicated to Esko Ukkonen on the Occasion of His 60th Birthday, volume 6060 of Lecture Notes in Computer Science, pages 130–146. Springer, Heidelberg, 2010.
E. Keogh, S. Lonardi, and C. A. Ratanamahatana. Towards parameter-free data mining. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 206–215. ACM Press, 2004.
M. Li, X. Chen, X. Li, B. Ma, and P. M. Vitányi. The similarity metric. IEEE Transactions on Information Theory, 50(12):3250–3264, 2004.
M. Li and P. Vitányi. An Introduction to Kolmogorov Complexity and its Applications. Springer, 3rd edition, 2008.
National Center for Biotechnology Information. In uenza Virus Resource, information, search and analysis. http://www.ncbi.nlm.nih.gov/genomes/FLU/FLU.html.
P. Palese and M. L. Shaw. Orthomyxoviridae: The viruses and their replication. In D. M. Knipe and P. M. Howley et al., editors, Fields’ Virology, pages 1647–1689. Lippincott Williams & Wilkins, Philadelphia, fifth edition, 2007.
P. M. B. Vitányi, F. J. Balbach, R. L. Cilibrasi, and M. Li. Normalized information distance. In Information Theory and Statistical Learning, pages 45–82. Springer, New York, 2008.
P. F. Wright, G. Neumann, and Y. Kawaoka. Orthomyxoviruses. In D. M. Knipe and P. M. Howley et al., editors, Fields’ Virology, pages 1691–1740. Lippincott Williams & Wilkins, Philadelphia, fifth edition, 2007.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer Science+Business Media B.V.
About this paper
Cite this paper
Ito, K., Zeugmann, T., Zhu, Y. (2011). Recent Experiences in Parameter-Free Data Mining. In: Gelenbe, E., Lent, R., Sakellari, G., Sacan, A., Toroslu, H., Yazici, A. (eds) Computer and Information Sciences. Lecture Notes in Electrical Engineering, vol 62. Springer, Dordrecht. https://doi.org/10.1007/978-90-481-9794-1_68
Download citation
DOI: https://doi.org/10.1007/978-90-481-9794-1_68
Published:
Publisher Name: Springer, Dordrecht
Print ISBN: 978-90-481-9793-4
Online ISBN: 978-90-481-9794-1
eBook Packages: EngineeringEngineering (R0)