Skip to main content
Log in

New Information Distance Measure and Its Application in Question Answering System

  • Regular Paper
  • Published:
Journal of Computer Science and Technology Aims and scope Submit manuscript

Abstract

In a question answering (QA) system, the fundamental problem is how to measure the distance between a question and an answer, hence ranking different answers. We demonstrate that such a distance can be precisely and mathematically defined. Not only such a definition is possible, it is actually provably better than any other feasible definitions. Not only such an ultimate definition is possible, but also it can be conveniently and fruitfully applied to construct a QA system. We have built such a system — QUANTA. Extensive experiments are conducted to justify the new theory.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Tan P N, Kumar V, Srivastava J. Selecting the right interestingness measure for association patterns. In Proc. SIGKDD’02, Edmonton, Alberta, Canada, pp.32–44.

  2. Bennett C H, Gacs P, Li M, Vitányi P, Zurek W. Information distance. IEEE Trans. Inform. Theory (STOC’93), July 1998, 44(4): 1407–1423.

    Google Scholar 

  3. Li M, Badger J, Chen X, Kwong S, Kearney P, Zhang H. An information-based sequence distance and its application to whole mitochondrial genome phylogeny. Bioinformatics, 2001, 17(2): 149–154.

    Article  Google Scholar 

  4. Li M, Chen X, Li X, Ma B, Vitányi P. The similarity metric. IEEE Trans. Information Theory, 2004, 50(12): 3250–3264.

    Article  Google Scholar 

  5. Li M, Vitányi P. An Introduction to Kolmogorov Complexity and Its Applications. 2nd Edition, Springer-Verlag, 1997.

  6. V’yugin M V. Information distance and conditional complexities. Theoret. Comput. Sci., 2002, 271: 145–150.

    Article  MathSciNet  Google Scholar 

  7. Vereshchagin N K, V’yugin M V. Independent minimum length programs to translate between given strings. Theoret. Comput. Sci., 2002, 271: 131–143.

    Article  MATH  MathSciNet  Google Scholar 

  8. Shen A K, Vereshchagin N K. Logical operations and Kolmogorov complexity. Theoret. Comput. Sci., 2002, 271: 125–129.

    Article  MATH  MathSciNet  Google Scholar 

  9. An A Muchnik, N Vereshchagin. Shannon entropy vs. Kolmogorov complexity. In Porc. First International Computer Science Symposium in Russia, CSR 2006, St. Petersburg, Russia, June 8-12, 2006, pp.281–191.

  10. Muchnik An A. Conditional complexity and codes. Theoretical Computer Science, 2002, 271(1): 97–109.

    Article  MathSciNet  Google Scholar 

  11. Muchnik An A, Vereshchagin N K. Logical operations and Kolmogorov complexity II. In Proc. 16th Conf. Comput. Complexity, Chicago, USA, 2001, pp.256–265.

  12. Chernov A V, Muchnik An A, Romashchenko A E, Shen A K, Vereshchagin N K. Upper semi-lattice of binary strings with the relation “x is simple conditional to y”. Theoret. Comput. Sci., 2002, 271: 69–95.

    Article  MATH  MathSciNet  Google Scholar 

  13. Keogh E J, Lonardi S, Ratanamahatana C A. Towards parameter-free data mining. In Proc. KDD’2004, Seattle, WA, USA, pp. 206–215.

  14. Benedetto D, Caglioti E, Loreto V. Language trees and zipping. Phys. Rev. Lett., 2002, 88(4): 048702.

    Article  Google Scholar 

  15. Chen X, Francia B, Li M, Mckinnon B, Seker A. Shared information and program plagiarism detection. IEEE Trans. Information Theory, July 2004, 50(7): 1545–1550.

    Article  MathSciNet  Google Scholar 

  16. R Cilibrasi, P M B Vitányi, R de Wolf. Algorithmic clustring of music based on string compression. Comput. Music J., 2004, 28(4): 49–67.

    Article  Google Scholar 

  17. Cilibrasi R, Vitányi P M B. The Google similarity distance. IEEE Trans. Knowledge and Data Engineering, 2007, 19(3): 370–383.

    Article  Google Scholar 

  18. Cuturi M, Vert J P. The context-tree kernel for strings. Neural Networks, 2005, 18(4): 1111–1123.

    Article  Google Scholar 

  19. Emanuel K, Ravela S, Vivant E, Risi C. A combined statistical-deterministic approach of hurricane risk assessment. Manuscript, Program in Atmospheres, Oceans, and Climate, MIT, 2005.

  20. Kirk S R, Jenkins S. Information theory-based software metrics and obfuscation. J. Systems and Software, 2004, 72: 179–186.

    Article  Google Scholar 

  21. Kraskov A, Stögbauer H, Andrzejak R G, Grassberger P. Hierarchical clustering using mutual information. Europhys. Lett., 2005, 70(2): 278–284.

    Article  MathSciNet  Google Scholar 

  22. Kocsor A, Kertesz-Farkas A, Kajan L, Pongor S. Application of compression-based distance measures to protein sequence classification: A methodology study. Bioinformatics, 2006, 22(4): 407–412.

    Article  Google Scholar 

  23. Krasnogor N, Pelta D A. Measuring the similarity of protein structures by means of the universal similarity metric. Bioinformatics, 2004, 20(7): 1015–1021.

    Article  Google Scholar 

  24. Taha W, Crosby S, Swadi K. A new approach to data mining for software design. Manuscript. Rice Univ. 2006.

  25. Otu H H, Sayood K. A new sequence distance measure for phylogenetic tree construction. Bioinformatics 2003, 19(6): 2122–2130.

    Article  Google Scholar 

  26. Pao H K, Case J. Computing entropy for ortholog detection. In Proc. Int. Conf. Comput. Intell., Dec. 17–19, 2004, pp.89–92.

  27. Parry D. Use of Kolmogorov distance identification of web page authorship, topic and domain. In Proc. Workshop on Open Source Web Inf. Retrieval, Compiègne, France, 2005, pp.47–50.

  28. Santos C C, Bernardes J, Vitányi P M B, Antunes L. Clustering fetal heart rate tracings by compression. In Proc. 19th IEEE Int. Symp. Computer-Based Medical Systems, Salt Lake City, Utah, June 22–23, 2006, pp.685–690.

  29. Arbuckle T, Balaban A, Peters D K, Lawford M. Software documents: Comparison and measurement. In Proc. SEKE2007, Boston, USA, July 9–11, 2007, pp.740–748.

  30. Ané C, Sanderson M J. Missing the forest for the trees: Phylogenetic compression and its implications for inferring complex evolutionary histories. Systematic Biology, 2005, 54(1): 146–157.

    Article  Google Scholar 

  31. Cilibrasi R, Vitányi P M B, Clustering by compression. IEEE Trans. Inform. Theory, 2005, 51(4): 1523–1545.

    Article  MathSciNet  Google Scholar 

  32. Zhang X, Hao Y, Zhu X, Li M. Information distance from a question to an answer. In Proc. 13th ACM SIGKDD, San Jose, California, USA, 2007, pp.874–883.

  33. Li M. Information distance and its applications. Int. J. Found. Comput. Sci., 2007, 18(4): 669–681.

    Article  MATH  Google Scholar 

  34. Bennett C H, Li M, Ma B. Chain letters and evolutionary histories. Scientific American, June 2003, feature article, 288(6): 76–81.

    Article  Google Scholar 

  35. Siebes A, Struzik Z. Complex Data: Mining using patterns. In Proc. the ESF Exploratory Workshop on Pattern Detection and Discovery, London, 2002, pp.24–35.

  36. Fagin R, Stockmeyer L. Relaxing the triangle inequality in pattern matching. Int. J. Comput. Vision, 1998, 28(3): 219–231.

    Article  Google Scholar 

  37. Veltkamp R C. Shape matching: Similarity measures and algorithms. In Proc. Int. Conf. Shape Modeling Applications, Italy, Invited talk, 2001, pp.188–197.

  38. Lin J. The web as a resource for question answering: Perspectives and challenges. In Proc. 3rd Int. Conf. Language Resources and Evolution, Las Palmas, Spain, May, 2002.

  39. Clarke C, Cormack G V, Kemkes G, Laszlo M, Lynam T R, Terra E L, Tilker P L. Statistical selection of exact answers (multitext experiments for TREC 2002). Report, University of Waterloo, 2002.

  40. Cimiano P, Staab S. Learning by googling. ACM SIGKDD Explorations Newsletter, 2004, 6(2): 24–33.

    Article  Google Scholar 

  41. Lin J, Katz B. Question answering from the web using knowledge annotation and knowledge mining techniques. In Proc. 12th Int. CIKM, New Orleans, Louisiana, USA, 2003, pp.116–123.

  42. Li X, Roth D. Learning question classifiers. In Proc. COLING’02, Taipei, Taiwan, China, 2002, pp.556–562.

  43. Chang C C, Lin C J. LIBSVM: A library for support vector machines. 2001, http://www.csie.ntu.edu.tw/~cjlin/libsvm.

  44. Tsuruoka Y, Tsujii J. Bidirectional inference with the easiest-first strategy for tagging sequence data. In Proc. HLT/EMNLP’05, Vancouver, October 2005, pp.467–474.

  45. Ramshaw L, Marcus M. Text chunking using transformation-based learning. In Proc. 3rd Workshop on Very Large Corpora, Cambridge, Massachusetts, USA, 1995, pp.82–94.

  46. Finkel J R, Grenager T, Manning C. Incorporating non-local information into information extraction systems by Gibbs sampling. In Proc. 43rd Annual Meeting of ACL, Michigan, USA, 2005, pp.363–370.

  47. Lin J, Katz B. Building a reusable test collection for question answering. Journal of the American Society for Information Science and Technology, 2006, 57(7): 851–861.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ming Li.

Additional information

This work is supported by the National Natural Science Foundation of China under Grant Nos. 60572084 and 60621062.

Electronic supplementary material

Below is the link to the electronic supplementary material.

(PDF 133 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhang, X., Hao, Y., Zhu, XY. et al. New Information Distance Measure and Its Application in Question Answering System. J. Comput. Sci. Technol. 23, 557–572 (2008). https://doi.org/10.1007/s11390-008-9152-9

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11390-008-9152-9

Keywords

Navigation