Abstract
Consider, in the most general sense, the space of all information carrying objects: a book, an article, a name, a definition, a genome, a letter, an image, an email, a webpage, a Google query, an answer, a movie, a music score, a Facebook blog, a short message, or even an abstract concept. Over the past 20 years, we have been developing a general theory of information distance in this space and applications of this theory. The theory is object-independent and application-independent. The theory is also unique, in the sense that no other theory is “better”. During the past 10 years, such a theory has found many applications. Recently we have introduced two extensions to this theory concerning multiple objects and irrelevant information. This expository article will focus on explaining the main ideas behind this theory, especially these recent extensions, and their applications. We will also discuss some very preliminary applications.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Ané, C., Sanderson, M.J.: Missing the forest for the trees: Phylogenetic compression and its implications for inferring complex evolutionary histories. Systematic Biology 54(1), 146–157 (2005)
Arbuckle, T., Balaban, A., Peters, D.K., Lawford, M.: Software documents: comparison and measurement. In: Proc. 18 Int’l Conf. on Software Engineering and Knowledge Engineering 2007 (SEKE 2007), pp. 740–745 (2007)
Arbuckle, T.: Studying software evolution using artefacts’ shared information content. Sci. of Comput. Programming 76(2), 1078–1097 (2011)
Bennett, C.H., Gács, P., Li, M., Vitányi, P., Zurek, W.: Information Distance. IEEE Trans. Inform. Theory 44(4), 1407–1423 (1993) (STOC 1993)
Bennett, C.H., Li, M., Ma, B.: Chain letters and evolutionary histories. Scientific American 288(6), 76–81 (2003) (feature article)
Benedetto, D., Caglioti, E., Loreto, V.: Language trees and zipping. Phys. Rev. Lett. 88(4), 048702 (2002)
Bu, F., Zhu, X., Li, M.: A new multiword expression metric and its applications. J. Comput. Sci. Tech. 26(1), 3–13 (2011); also in COLING 2010
Chen, X., Francia, B., Li, M., Mckinnon, B., Seker, A.: Shared information and program plagiarism detection. IEEE Trans. Information Theory 50(7), 1545–1550 (2004)
Cilibrasi, R., Vitányi, P., de Wolf Algorithmic, R.: clustring of music based on string compression. Comput. Music J. 28(4), 49–67 (2004)
Cilibrasi, R., Vitányi, P.: Automatic semantics using Google (2005) (manuscript), http://arxiv.org/abs/cs.CL/0412098 (2004)
Cilibrasi, R., Vitányi, P.: Clustering by compression. IEEE Trans. Inform. Theory 51(4), 1523–1545 (2005)
Cuturi, M., Vert, J.P.: The context-tree kernel for strings. Neural Networks 18(4), 1111–1123 (2005)
Emanuel, K., Ravela, S., Vivant, E., Risi, C.: A combined statistical-deterministic approach of hurricane risk assessment. In: Program in Atmospheres, Oceans, and Climate. MIT, Cambridge (2005) (manuscript)
Fagin, R., Stockmeyer, L.: Relaxing the triangle inequality in pattern matching. Int’l J. Comput. Vision 28(3), 219–231 (1998)
Kirk, S.R., Jenkins, S.: Information theory-baed software metrics and obfuscation. J. Systems and Software 72, 179–186 (2004)
Keogh, E., Lonardi, S., Ratanamahatana, C.A.: Towards parameter-free data mining. In: KDD 2004, pp. 206–215 (2004)
Kocsor, A., Kertesz-Farkas, A., Kajan, L., Pongor, S.: Application of compression-based distance measures to protein sequence classification: a methodology study. Bioinformatics 22(4), 407–412 (2006)
Krasnogor, N., Pelta, D.A.: Measuring the similarity of protein structures by means of the universal similarity metric. Bioinformatics 20(7), 1015–1021 (2004)
Li, M., Badger, J., Chen, X., Kwong, S., Kearney, P., Zhang, H.: An information-based sequence distance and its application to whole mitochondrial genome phylogeny. Bioinformatics 17(2), 149–154 (2001)
Li, M., Chen, X., Li, X., Ma, B., Vitányi, P.: The similarity metric. IEEE Trans. Information Theory 50(12), 3250–3264 (2004)
Li, M.: Information distance and its applications. Int’l J. Found. Comput. Sci. 18(4), 669–681 (2007)
Li, M., Ma, B.: Notes on information distance among many entities, March 23 (2008) (unpublished notes)
Li, M., Tang, Y., Wang, D.: Information distance between what I said and what it heard (manuscript, 2011)
Li, M., Vitányi, P.: An introduction to Kolmogorov complexity and its applications, 3rd edn. Springer, Heidelberg (2008)
Long, C., Zhu, X.Y., Li, M., Ma, B.: Information shared by many objects. In: ACM 17th Conf. Info. and Knowledge Management (CIKM 2008), Napa Valley, California, October 26-30 (2008)
Long, C., Huang, M., Zhu, X., Li, M.: Multi-document summarization by information distance. In: IEEE Int’l Conf. Data Mining, 2009 (ICDM 2009), Miami, Florida, December 6-9 (2009)
Nikvand, N., Wang, Z.: Generic image similarity based on Kolmogorov complexity. In: IEEE Int’l Conf. Image Processing, Hong Kong, China, September 26-29 (2010)
Nykter, M., Price, N.D., Larjo, A., Aho, T., Kauffman, S.A., Yli-Harja, O., Shmulevich, I.: Critical networks exhibit maximal information diversity in structure-dynamics relationships. Phy. Rev. Lett. 100, 058702(4) (2008)
Nykter, M., Price, N.D., Aldana, M., Ramsey, S.A., Kauffman, S.A., Hood, L.E., Yli-Harja, O., Shmulevich, I.: Gene expression dynamics in the macrophage exhibit criticality. Proc. Nat. Acad. Sci. USA 105(6), 1897–1900 (2008)
Otu, H.H., Sayood, K.: A new sequence distance measure for phylogenetic tree construction. Bioinformatics 19(6), 2122–2130 (2003)
Pao, H.K., Case, J.: Computing entropy for ortholog detection. In: Int’l Conf. Comput. Intell., Istanbul, Turkey, December 17-19 (2004)
Parry, D.: Use of Kolmogorov distance identification of web page authorship, topic and domain. In: Workshop on Open Source Web Inf. Retrieval (2005), http://www.emse.fr/OSWIR05/
Costa Santos, C., Bernardes, J., Vitányi, P., Antunes, L.: Clustering fetal heart rate tracings by compression. In: Proc. 19th IEEE Intn’l Symp. Computer-Based Medical Systems, Salt Lake City, Utah, June 22-23 (2006)
Taha, W., Crosby, S., Swadi, K.: A new approach to data mining for software design, Rice Univ. (2006) (manuscript)
Varre, J.S., Delahaye, J.P., Rivals, E.: Transformation distances: a family of dissimilarity measures based on movements of segments. Bioinformatics 15(3), 194–202 (1999)
Veltkamp, R.C.: Shape Matching: Similarity Measures and Algorithms. In: Proc. Int ’l Conf. Shape Modeling Applications, Italy, pp. 188–197 (2001) (invited talk)
Vitanyi, P.M.B.: Information distance in multiples. IEEE Trans. Inform. Theory 57(4), 2451–2456 (2011)
Wehner, S.: Analyzing worms and network traffice using compression. J. Comput. Security 15(3), 303–320 (2007)
Zhang, X., Hao, Y., Zhu, X., Li, M.: Information distance from a question to an answer. In: 13th ACM SIGKDD Int’l Conf. Knowledge Discovery Data Mining, San Jose, CA, August 12-15 (2007)
Zhang, X., Hao, Y., Zhu, X.Y., Li, M.: New information measure and its application in question answering system. J. Comput. Sci. Tech. 23(4), 557–572 (2008); This is the final version of [39]
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Li, M. (2011). Information Distance and Its Extensions. In: Elomaa, T., Hollmén, J., Mannila, H. (eds) Discovery Science. DS 2011. Lecture Notes in Computer Science(), vol 6926. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-24477-3_3
Download citation
DOI: https://doi.org/10.1007/978-3-642-24477-3_3
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-24476-6
Online ISBN: 978-3-642-24477-3
eBook Packages: Computer ScienceComputer Science (R0)