Abstract
In a question answering (QA) system, the fundamental problem is how to measure the distance between a question and an answer, hence ranking different answers. We demonstrate that such a distance can be precisely and mathematically defined. Not only such a definition is possible, it is actually provably better than any other feasible definitions. Not only such an ultimate definition is possible, but also it can be conveniently and fruitfully applied to construct a QA system. We have built such a system — QUANTA. Extensive experiments are conducted to justify the new theory.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Tan P N, Kumar V, Srivastava J. Selecting the right interestingness measure for association patterns. In Proc. SIGKDD’02, Edmonton, Alberta, Canada, pp.32–44.
Bennett C H, Gacs P, Li M, Vitányi P, Zurek W. Information distance. IEEE Trans. Inform. Theory (STOC’93), July 1998, 44(4): 1407–1423.
Li M, Badger J, Chen X, Kwong S, Kearney P, Zhang H. An information-based sequence distance and its application to whole mitochondrial genome phylogeny. Bioinformatics, 2001, 17(2): 149–154.
Li M, Chen X, Li X, Ma B, Vitányi P. The similarity metric. IEEE Trans. Information Theory, 2004, 50(12): 3250–3264.
Li M, Vitányi P. An Introduction to Kolmogorov Complexity and Its Applications. 2nd Edition, Springer-Verlag, 1997.
V’yugin M V. Information distance and conditional complexities. Theoret. Comput. Sci., 2002, 271: 145–150.
Vereshchagin N K, V’yugin M V. Independent minimum length programs to translate between given strings. Theoret. Comput. Sci., 2002, 271: 131–143.
Shen A K, Vereshchagin N K. Logical operations and Kolmogorov complexity. Theoret. Comput. Sci., 2002, 271: 125–129.
An A Muchnik, N Vereshchagin. Shannon entropy vs. Kolmogorov complexity. In Porc. First International Computer Science Symposium in Russia, CSR 2006, St. Petersburg, Russia, June 8-12, 2006, pp.281–191.
Muchnik An A. Conditional complexity and codes. Theoretical Computer Science, 2002, 271(1): 97–109.
Muchnik An A, Vereshchagin N K. Logical operations and Kolmogorov complexity II. In Proc. 16th Conf. Comput. Complexity, Chicago, USA, 2001, pp.256–265.
Chernov A V, Muchnik An A, Romashchenko A E, Shen A K, Vereshchagin N K. Upper semi-lattice of binary strings with the relation “x is simple conditional to y”. Theoret. Comput. Sci., 2002, 271: 69–95.
Keogh E J, Lonardi S, Ratanamahatana C A. Towards parameter-free data mining. In Proc. KDD’2004, Seattle, WA, USA, pp. 206–215.
Benedetto D, Caglioti E, Loreto V. Language trees and zipping. Phys. Rev. Lett., 2002, 88(4): 048702.
Chen X, Francia B, Li M, Mckinnon B, Seker A. Shared information and program plagiarism detection. IEEE Trans. Information Theory, July 2004, 50(7): 1545–1550.
R Cilibrasi, P M B Vitányi, R de Wolf. Algorithmic clustring of music based on string compression. Comput. Music J., 2004, 28(4): 49–67.
Cilibrasi R, Vitányi P M B. The Google similarity distance. IEEE Trans. Knowledge and Data Engineering, 2007, 19(3): 370–383.
Cuturi M, Vert J P. The context-tree kernel for strings. Neural Networks, 2005, 18(4): 1111–1123.
Emanuel K, Ravela S, Vivant E, Risi C. A combined statistical-deterministic approach of hurricane risk assessment. Manuscript, Program in Atmospheres, Oceans, and Climate, MIT, 2005.
Kirk S R, Jenkins S. Information theory-based software metrics and obfuscation. J. Systems and Software, 2004, 72: 179–186.
Kraskov A, Stögbauer H, Andrzejak R G, Grassberger P. Hierarchical clustering using mutual information. Europhys. Lett., 2005, 70(2): 278–284.
Kocsor A, Kertesz-Farkas A, Kajan L, Pongor S. Application of compression-based distance measures to protein sequence classification: A methodology study. Bioinformatics, 2006, 22(4): 407–412.
Krasnogor N, Pelta D A. Measuring the similarity of protein structures by means of the universal similarity metric. Bioinformatics, 2004, 20(7): 1015–1021.
Taha W, Crosby S, Swadi K. A new approach to data mining for software design. Manuscript. Rice Univ. 2006.
Otu H H, Sayood K. A new sequence distance measure for phylogenetic tree construction. Bioinformatics 2003, 19(6): 2122–2130.
Pao H K, Case J. Computing entropy for ortholog detection. In Proc. Int. Conf. Comput. Intell., Dec. 17–19, 2004, pp.89–92.
Parry D. Use of Kolmogorov distance identification of web page authorship, topic and domain. In Proc. Workshop on Open Source Web Inf. Retrieval, Compiègne, France, 2005, pp.47–50.
Santos C C, Bernardes J, Vitányi P M B, Antunes L. Clustering fetal heart rate tracings by compression. In Proc. 19th IEEE Int. Symp. Computer-Based Medical Systems, Salt Lake City, Utah, June 22–23, 2006, pp.685–690.
Arbuckle T, Balaban A, Peters D K, Lawford M. Software documents: Comparison and measurement. In Proc. SEKE2007, Boston, USA, July 9–11, 2007, pp.740–748.
Ané C, Sanderson M J. Missing the forest for the trees: Phylogenetic compression and its implications for inferring complex evolutionary histories. Systematic Biology, 2005, 54(1): 146–157.
Cilibrasi R, Vitányi P M B, Clustering by compression. IEEE Trans. Inform. Theory, 2005, 51(4): 1523–1545.
Zhang X, Hao Y, Zhu X, Li M. Information distance from a question to an answer. In Proc. 13th ACM SIGKDD, San Jose, California, USA, 2007, pp.874–883.
Li M. Information distance and its applications. Int. J. Found. Comput. Sci., 2007, 18(4): 669–681.
Bennett C H, Li M, Ma B. Chain letters and evolutionary histories. Scientific American, June 2003, feature article, 288(6): 76–81.
Siebes A, Struzik Z. Complex Data: Mining using patterns. In Proc. the ESF Exploratory Workshop on Pattern Detection and Discovery, London, 2002, pp.24–35.
Fagin R, Stockmeyer L. Relaxing the triangle inequality in pattern matching. Int. J. Comput. Vision, 1998, 28(3): 219–231.
Veltkamp R C. Shape matching: Similarity measures and algorithms. In Proc. Int. Conf. Shape Modeling Applications, Italy, Invited talk, 2001, pp.188–197.
Lin J. The web as a resource for question answering: Perspectives and challenges. In Proc. 3rd Int. Conf. Language Resources and Evolution, Las Palmas, Spain, May, 2002.
Clarke C, Cormack G V, Kemkes G, Laszlo M, Lynam T R, Terra E L, Tilker P L. Statistical selection of exact answers (multitext experiments for TREC 2002). Report, University of Waterloo, 2002.
Cimiano P, Staab S. Learning by googling. ACM SIGKDD Explorations Newsletter, 2004, 6(2): 24–33.
Lin J, Katz B. Question answering from the web using knowledge annotation and knowledge mining techniques. In Proc. 12th Int. CIKM, New Orleans, Louisiana, USA, 2003, pp.116–123.
Li X, Roth D. Learning question classifiers. In Proc. COLING’02, Taipei, Taiwan, China, 2002, pp.556–562.
Chang C C, Lin C J. LIBSVM: A library for support vector machines. 2001, http://www.csie.ntu.edu.tw/~cjlin/libsvm.
Tsuruoka Y, Tsujii J. Bidirectional inference with the easiest-first strategy for tagging sequence data. In Proc. HLT/EMNLP’05, Vancouver, October 2005, pp.467–474.
Ramshaw L, Marcus M. Text chunking using transformation-based learning. In Proc. 3rd Workshop on Very Large Corpora, Cambridge, Massachusetts, USA, 1995, pp.82–94.
Finkel J R, Grenager T, Manning C. Incorporating non-local information into information extraction systems by Gibbs sampling. In Proc. 43rd Annual Meeting of ACL, Michigan, USA, 2005, pp.363–370.
Lin J, Katz B. Building a reusable test collection for question answering. Journal of the American Society for Information Science and Technology, 2006, 57(7): 851–861.
Author information
Authors and Affiliations
Corresponding author
Additional information
This work is supported by the National Natural Science Foundation of China under Grant Nos. 60572084 and 60621062.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Zhang, X., Hao, Y., Zhu, XY. et al. New Information Distance Measure and Its Application in Question Answering System. J. Comput. Sci. Technol. 23, 557–572 (2008). https://doi.org/10.1007/s11390-008-9152-9
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11390-008-9152-9