Skip to main content
Log in

Measuring structural similarity of semistructured data based on information-theoretic approaches

  • Regular Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

We propose and experimentally evaluate different approaches for measuring the structural similarity of semistructured documents based on information-theoretic concepts. Common to all approaches is a two-step procedure: first, we extract and linearize the structural information from documents, and then, we use similarity measures that are based on, respectively, Kolmogorov complexity and Shannon entropy to determine the distance between the documents. Compared to other approaches, we are able to achieve a linear run-time complexity and demonstrate in an experimental evaluation that the results of our technique in terms of clustering quality are on a par with or even better than those of other, slower approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Arasu, A., Garcia-Molina, H.: Extracting structured data from web pages. In: ACM SIGMOD International Conference on Management of Data, pp. 337–348 (2003)

  2. Ashby F.G., Perrin N.A.: Toward a unified theory of similarity and recognition. Psychol. Rev. 95(1), 124–150 (1988)

    Article  Google Scholar 

  3. Augsten, N., Böhlen, M.H., Gamper, J.: Approximate matching of hierarchical data using pq-grams. In: Proceedings of the 31st International Conference on Very Large Data Bases (VLDB’05), pp. 301–312, Trondheim (2005)

  4. Augsten, N., Böhlen, M., Dyreson, C., Gamper, J.: Approximate joins for data-centric XML. In: Proceedings of the International Conference on Data Engineering (ICDE), pp. 814–823, Cancún, Mexico. IEEE Computer Society (2008)

  5. Augsten, N., Barbosa, D., Böhlen, M., Palpanas, T.: TASM: Top-k approximate subtree matching. In: Proceedings of the International Conference on Data Engineering (ICDE), pp. 353–364, Long Beach, California, USA. IEEE Computer Society (2010)

  6. Augsten, N., Böhlen, M., Gamper, J.: The pq-gram distance between ordered labeled trees. ACM Trans. Database Syst. (TODS) 35(1) (2010)

  7. Baeza-Yates R.A., Ribeiro-Neto B.A.: Modern Information Retrieval. Addison-Wesley, Reading (1999)

    Google Scholar 

  8. Barbosa D., Mignet L., Veltri P.: Studying the XML Web: gathering statistics from an XML sample. World Wide Web J. 8(4), 413–438 (2005)

    Article  Google Scholar 

  9. Baumgartner, R., Flesca, S., Gottlob, G.: Visual web information extraction with Lixto. In: International Conference on Very Large Databases (VLDB’01), pp. 119–128 (2001)

  10. Bennet C.H., Gács P., Li M., Vitányi P.M.B.: Zurek W.H.: Information distance. IEEE Trans. Inf. Theory 44(4), 1407–1423 (1998)

    Article  Google Scholar 

  11. Bertino E., Guerrini G., Mesiti M.: A matching algorithm for measuring the structural similarity between an XML document and a DTD and its applications. Inf. Syst. 29(1), 23–46 (2004)

    Article  MathSciNet  Google Scholar 

  12. Bohrer, K., Liu, X., McLaughlin, S., Schonberg, E., Singh, M.: Object oriented XML query by example. In: ER (Workshops), pp. 323–329 (2003)

  13. Buttler, D.: A short survey of document structure similarity algorithms. In: 5th International Conference on Internet Computing, Las Vegas, Nevada (2004)

  14. Chaitin G.J.: On the length of programs for computing finite binary sequences. J. ACM 13, 547–569 (1966)

    Article  MathSciNet  MATH  Google Scholar 

  15. Chawathe, S., Garcia-Molina, H.: Meaningful change detection in structured data. In: ACM SIGMOD International Conference on Management of Data, pp. 26–37 (1997)

  16. Chawathe, S., Rajaraman, A., Garcia-Molina, H., Widom, J.: Change detection in hierarchically structured information. In: ACM SIGMOD International Conference on Management of Data, pp. 493–504 (1996)

  17. Cherukuri, V.S., Candan, K.S.: Propagation-vectors for trees (PVT): concise yet effective summaries for hierarchical data and trees. In: ACM Workshop on Large-Scale Distributed Systems for Information Retrieval (LSDS-IR), Napa Valley, CA (2008)

  18. Cilibrasi R., Vitányi P.M.B.: Clustering by compression. IEEE Trans. Inf. Theory 51(4), 1523– (2005)

    Article  Google Scholar 

  19. Coutinho, D.P., Figueiredo, M.A.T.: Information theoretic text classification using the Ziv-Merhav method. In: Proceeding 2nd Iberian Conference on Pattern Recognition and Image Analysis (IbPRIA), pp. 355–362, Estoril, Portugal (2005)

  20. Cover T.M., Thomas J.A.: Elements of Information Theory. Wiley, London (2006)

    MATH  Google Scholar 

  21. Crescenzi, V., Mecca, G., Merialdo, P.: RoadRunner: towards automatic data extraction from large web sites. In: International Conference on Very Large Databases (VLDB’01), pp. 109–118 (2001)

  22. Dalamagas T., Cheng T., Winkel K.-J., Sellis T.: A methodology for clustering XML documents by structure. Inf. Syst. 31(3), 187–228 (2006)

    Article  Google Scholar 

  23. de Castro Reis, D., Golgher, P.B., da Silva, A.S., Laender, A.H.F.: Automatic web news extraction using tree edit distance. In: 13th International World Wide Web Conference (WWW’04), Manhattan, New York (2004)

  24. Dice L.R.: Measures of the amount of ecologic association between species. Ecology 26(3), 297–302 (1945)

    Article  Google Scholar 

  25. Flesca S., Manco G., Masciari E., Pontieri L., Pugliese A.: Fast detection of XML structural similarity. IEEE Trans. Knowl. Data Eng. 17(2), 160–175 (2005)

    Article  Google Scholar 

  26. Garofalakis, M.N., Gionis, A., Rastogi, R., Seshadri, S., Shim, K.: XTRACT: a system for extracting document type descriptors from XML documents. In: ACM SIGMOD International Conference on Management of Data, pp. 165–176 (2000)

  27. Grünwald, P., Vitányi, P.M.B.: Shannon information and Kolmogorov complexity. The Computing Research Repository (CoRR), cs.IT/0410002 (2004)

  28. Helmer, S.: Measuring the structural similarity of semistructured documents using entropy. In: Proceedings of the 33rd International Conference on Very Large Data Bases (VLDB’07), pp. 1022–1032, Vienna (2007)

  29. Herbert K.G., Wang J.T.L.: Biological data cleaning: a case study. Int. J. Inf. Qual. 1(1), 60–82 (2007)

    Article  Google Scholar 

  30. Jardine N., Sibson R.: Mathematical Taxonomy. Wiley, New York (1971)

    MATH  Google Scholar 

  31. Kim, J.W., Candan, K.S.: CP/CV: concept similarity mining without frequency information from domain describing taxonomies. In: ACM International Conference on Information and Knowledge Management (CIKM), pp. 483–492, Arlington, Virginia (2006)

  32. Knuth D.: The Art of Computer Programming, Volume I: Fundamental Algorithms. Addison-Wesley, Reading (1973)

    Google Scholar 

  33. Kolmogorov A.N.: Three approaches to the quantitative definition of information. Probl. Inf. Transm. 1, 1–7 (1965)

    Google Scholar 

  34. Kullback S.: Information Theory and Statistics. Dover Publications, New York (1968)

    Google Scholar 

  35. Lee, M.L., Yang, L.H., Hsu, W., Yang, X.: XClust: clustering XML schemas for effective integration. In: 11th International Conference on Information and Knowledge Management (CIKM’02), McLean, Virginia (2002)

  36. Li M., Vitányi P.M.B.: An Introduction to Kolmogorov Complexity. Springer, (1997)

    MATH  Google Scholar 

  37. Lian W., Cheung D.W.L., Mamoulis N., Yiu S.-M.: An efficient and scalable algorithm for clustering XML documents by structure. IEEE Trans. Knowl. Data Eng. (TKDE) 16(1), 82–96 (2004)

    Article  Google Scholar 

  38. Martins, A.: String kernels and similarity measures for information retrieval. Technical report, Priberam, Lisbon, Portugal (2006)

  39. Mesiti, M., Bertino, E., Guerrini, G.: An abstraction-based approach to measuring the structural similarity between two unordered XML documents. In: ISICT ’03: Proceedings of the 1st International Symposium on Information and Communication Technologies, pp. 316–321 (2003)

  40. Nestorov, S., Abiteboul, S., Motwani, R.: Extracting schema from semistructured data. In: ACM SIGMOD International Conference on Management of Data, pp. 295–306 (1998)

  41. Nierman, A., Jagadish, H.V.: Evaluating structural similarity in XML documents. In: Proceedings of the 5th International Workshop on the Web and Databases (WebDB), pp. 61–66, Madison, Wisconsin, (2002)

  42. Puglisi A., Benedetto D., Caglioti E., Loreto V., Vulpiani A.: Data compression and learning in time sequences analysis. Phys. D 189, 92–107 (2003)

    Article  MathSciNet  Google Scholar 

  43. Santini, S., Jain, R.: Similarity measures. IEEE Trans. Pattern Anal. Mach. Intell. 21(9) (1999)

  44. Selkow S.: The tree-to-tree editing problem. Inf. Process. Lett. 6(6), 184–186 (1977)

    Article  MathSciNet  MATH  Google Scholar 

  45. Shannon, C.E.: The mathematical theory of communication. Bell Syst. Tech. J. 27, 379–423, 623–656 (1948)

  46. Shasha D., Zhang K.: Pattern Matching in Strings, Trees, and Arrays, chapter Approximate Tree Pattern Matching. Oxford University Press, Oxford (1995)

    Google Scholar 

  47. Sneath P.H.A., Sokal R.R.: Numerical Taxonomy. Freeman, San Francisco (1973)

    MATH  Google Scholar 

  48. Solomonoff R.: A formal theory of inductive inference, part I. Inf. Control 7(1), 1–22 (1964)

    Article  MathSciNet  MATH  Google Scholar 

  49. Solomonoff R.: A formal theory of inductive inference, part II. Inf. Control 7(2), 224–254 (1964)

    Article  MathSciNet  MATH  Google Scholar 

  50. Tai K.C.: The tree-to-tree correction problem. J. ACM 26(3), 422–433 (1979)

    Article  MathSciNet  MATH  Google Scholar 

  51. Theobald, A., Weikum, G.: The XXL search engine: ranked retrieval of XML data using indexes and ontologies. In: ACM SIGMOD International Conference on Management of Data, p. 615 (2002)

  52. Ukkonen E.: On-line construction of suffix trees. Algorithmica 14(3), 249–260 (1995)

    Article  MathSciNet  MATH  Google Scholar 

  53. Wang J., Zhang K., Jeong K., Shasha D.: A system for approximate tree matching. IEEE Trans. Knowl. Data Eng. 6(4), 559–571 (1994)

    Article  Google Scholar 

  54. Weiner, P.: Linear pattern matching algorithms. In: 14th Annual Symposium on Found. of Computer Science (FOCS), pp. 1–11, Iowa City, Iowa (1973)

  55. Witten I.H., Moffat A., Bell T.C.: Managing Gigabytes. Morgan Kaufmann, San Francisco (1999)

    Google Scholar 

  56. Zhang K., Shasha D.: Simple fast algorithms for the editing distance between trees and related problems. SIAM J. Comput. 18(6), 1245–1262 (1989)

    Article  MathSciNet  MATH  Google Scholar 

  57. Ziv J., Lempel A.: A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 23(3), 337–343 (1977)

    Article  MathSciNet  MATH  Google Scholar 

  58. Ziv J., Merhav N.: A measure of relative entropy between individual sequences with application to universal classification. IEEE Trans. Inf. Theory 39(4), 1270–1279 (1993)

    Article  MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sven Helmer.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Helmer, S., Augsten, N. & Böhlen, M. Measuring structural similarity of semistructured data based on information-theoretic approaches. The VLDB Journal 21, 677–702 (2012). https://doi.org/10.1007/s00778-012-0263-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-012-0263-0

Keywords

Navigation