Measuring structural similarity of semistructured data based on information-theoretic approaches

Helmer, Sven; Augsten, Nikolaus; Böhlen, Michael

doi:10.1007/s00778-012-0263-0

Measuring structural similarity of semistructured data based on information-theoretic approaches

Regular Paper
Published: 08 February 2012

Volume 21, pages 677–702, (2012)
Cite this article

The VLDB Journal Aims and scope Submit manuscript

Sven Helmer¹,
Nikolaus Augsten² &
Michael Böhlen³

436 Accesses
11 Citations
Explore all metrics

Abstract

We propose and experimentally evaluate different approaches for measuring the structural similarity of semistructured documents based on information-theoretic concepts. Common to all approaches is a two-step procedure: first, we extract and linearize the structural information from documents, and then, we use similarity measures that are based on, respectively, Kolmogorov complexity and Shannon entropy to determine the distance between the documents. Compared to other approaches, we are able to achieve a linear run-time complexity and demonstrate in an experimental evaluation that the results of our technique in terms of clustering quality are on a par with or even better than those of other, slower approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Models of the representation and clustering of semistructured information

Article 01 December 2015

An overview of distance and similarity functions for structured data

Article 27 February 2020

AST Method for Scoring String-to-text Similarity

References

Arasu, A., Garcia-Molina, H.: Extracting structured data from web pages. In: ACM SIGMOD International Conference on Management of Data, pp. 337–348 (2003)
Ashby F.G., Perrin N.A.: Toward a unified theory of similarity and recognition. Psychol. Rev. 95(1), 124–150 (1988)
Article Google Scholar
Augsten, N., Böhlen, M.H., Gamper, J.: Approximate matching of hierarchical data using pq-grams. In: Proceedings of the 31st International Conference on Very Large Data Bases (VLDB’05), pp. 301–312, Trondheim (2005)
Augsten, N., Böhlen, M., Dyreson, C., Gamper, J.: Approximate joins for data-centric XML. In: Proceedings of the International Conference on Data Engineering (ICDE), pp. 814–823, Cancún, Mexico. IEEE Computer Society (2008)
Augsten, N., Barbosa, D., Böhlen, M., Palpanas, T.: TASM: Top-k approximate subtree matching. In: Proceedings of the International Conference on Data Engineering (ICDE), pp. 353–364, Long Beach, California, USA. IEEE Computer Society (2010)
Augsten, N., Böhlen, M., Gamper, J.: The pq-gram distance between ordered labeled trees. ACM Trans. Database Syst. (TODS) 35(1) (2010)
Baeza-Yates R.A., Ribeiro-Neto B.A.: Modern Information Retrieval. Addison-Wesley, Reading (1999)
Google Scholar
Barbosa D., Mignet L., Veltri P.: Studying the XML Web: gathering statistics from an XML sample. World Wide Web J. 8(4), 413–438 (2005)
Article Google Scholar
Baumgartner, R., Flesca, S., Gottlob, G.: Visual web information extraction with Lixto. In: International Conference on Very Large Databases (VLDB’01), pp. 119–128 (2001)
Bennet C.H., Gács P., Li M., Vitányi P.M.B.: Zurek W.H.: Information distance. IEEE Trans. Inf. Theory 44(4), 1407–1423 (1998)
Article Google Scholar
Bertino E., Guerrini G., Mesiti M.: A matching algorithm for measuring the structural similarity between an XML document and a DTD and its applications. Inf. Syst. 29(1), 23–46 (2004)
Article MathSciNet Google Scholar
Bohrer, K., Liu, X., McLaughlin, S., Schonberg, E., Singh, M.: Object oriented XML query by example. In: ER (Workshops), pp. 323–329 (2003)
Buttler, D.: A short survey of document structure similarity algorithms. In: 5th International Conference on Internet Computing, Las Vegas, Nevada (2004)
Chaitin G.J.: On the length of programs for computing finite binary sequences. J. ACM 13, 547–569 (1966)
Article MathSciNet MATH Google Scholar
Chawathe, S., Garcia-Molina, H.: Meaningful change detection in structured data. In: ACM SIGMOD International Conference on Management of Data, pp. 26–37 (1997)
Chawathe, S., Rajaraman, A., Garcia-Molina, H., Widom, J.: Change detection in hierarchically structured information. In: ACM SIGMOD International Conference on Management of Data, pp. 493–504 (1996)
Cherukuri, V.S., Candan, K.S.: Propagation-vectors for trees (PVT): concise yet effective summaries for hierarchical data and trees. In: ACM Workshop on Large-Scale Distributed Systems for Information Retrieval (LSDS-IR), Napa Valley, CA (2008)
Cilibrasi R., Vitányi P.M.B.: Clustering by compression. IEEE Trans. Inf. Theory 51(4), 1523– (2005)
Article Google Scholar
Coutinho, D.P., Figueiredo, M.A.T.: Information theoretic text classification using the Ziv-Merhav method. In: Proceeding 2nd Iberian Conference on Pattern Recognition and Image Analysis (IbPRIA), pp. 355–362, Estoril, Portugal (2005)
Cover T.M., Thomas J.A.: Elements of Information Theory. Wiley, London (2006)
MATH Google Scholar
Crescenzi, V., Mecca, G., Merialdo, P.: RoadRunner: towards automatic data extraction from large web sites. In: International Conference on Very Large Databases (VLDB’01), pp. 109–118 (2001)
Dalamagas T., Cheng T., Winkel K.-J., Sellis T.: A methodology for clustering XML documents by structure. Inf. Syst. 31(3), 187–228 (2006)
Article Google Scholar
de Castro Reis, D., Golgher, P.B., da Silva, A.S., Laender, A.H.F.: Automatic web news extraction using tree edit distance. In: 13th International World Wide Web Conference (WWW’04), Manhattan, New York (2004)
Dice L.R.: Measures of the amount of ecologic association between species. Ecology 26(3), 297–302 (1945)
Article Google Scholar
Flesca S., Manco G., Masciari E., Pontieri L., Pugliese A.: Fast detection of XML structural similarity. IEEE Trans. Knowl. Data Eng. 17(2), 160–175 (2005)
Article Google Scholar
Garofalakis, M.N., Gionis, A., Rastogi, R., Seshadri, S., Shim, K.: XTRACT: a system for extracting document type descriptors from XML documents. In: ACM SIGMOD International Conference on Management of Data, pp. 165–176 (2000)
Grünwald, P., Vitányi, P.M.B.: Shannon information and Kolmogorov complexity. The Computing Research Repository (CoRR), cs.IT/0410002 (2004)
Helmer, S.: Measuring the structural similarity of semistructured documents using entropy. In: Proceedings of the 33rd International Conference on Very Large Data Bases (VLDB’07), pp. 1022–1032, Vienna (2007)
Herbert K.G., Wang J.T.L.: Biological data cleaning: a case study. Int. J. Inf. Qual. 1(1), 60–82 (2007)
Article Google Scholar
Jardine N., Sibson R.: Mathematical Taxonomy. Wiley, New York (1971)
MATH Google Scholar
Kim, J.W., Candan, K.S.: CP/CV: concept similarity mining without frequency information from domain describing taxonomies. In: ACM International Conference on Information and Knowledge Management (CIKM), pp. 483–492, Arlington, Virginia (2006)
Knuth D.: The Art of Computer Programming, Volume I: Fundamental Algorithms. Addison-Wesley, Reading (1973)
Google Scholar
Kolmogorov A.N.: Three approaches to the quantitative definition of information. Probl. Inf. Transm. 1, 1–7 (1965)
Google Scholar
Kullback S.: Information Theory and Statistics. Dover Publications, New York (1968)
Google Scholar
Lee, M.L., Yang, L.H., Hsu, W., Yang, X.: XClust: clustering XML schemas for effective integration. In: 11th International Conference on Information and Knowledge Management (CIKM’02), McLean, Virginia (2002)
Li M., Vitányi P.M.B.: An Introduction to Kolmogorov Complexity. Springer, (1997)
MATH Google Scholar
Lian W., Cheung D.W.L., Mamoulis N., Yiu S.-M.: An efficient and scalable algorithm for clustering XML documents by structure. IEEE Trans. Knowl. Data Eng. (TKDE) 16(1), 82–96 (2004)
Article Google Scholar
Martins, A.: String kernels and similarity measures for information retrieval. Technical report, Priberam, Lisbon, Portugal (2006)
Mesiti, M., Bertino, E., Guerrini, G.: An abstraction-based approach to measuring the structural similarity between two unordered XML documents. In: ISICT ’03: Proceedings of the 1st International Symposium on Information and Communication Technologies, pp. 316–321 (2003)
Nestorov, S., Abiteboul, S., Motwani, R.: Extracting schema from semistructured data. In: ACM SIGMOD International Conference on Management of Data, pp. 295–306 (1998)
Nierman, A., Jagadish, H.V.: Evaluating structural similarity in XML documents. In: Proceedings of the 5th International Workshop on the Web and Databases (WebDB), pp. 61–66, Madison, Wisconsin, (2002)
Puglisi A., Benedetto D., Caglioti E., Loreto V., Vulpiani A.: Data compression and learning in time sequences analysis. Phys. D 189, 92–107 (2003)
Article MathSciNet Google Scholar
Santini, S., Jain, R.: Similarity measures. IEEE Trans. Pattern Anal. Mach. Intell. 21(9) (1999)
Selkow S.: The tree-to-tree editing problem. Inf. Process. Lett. 6(6), 184–186 (1977)
Article MathSciNet MATH Google Scholar
Shannon, C.E.: The mathematical theory of communication. Bell Syst. Tech. J. 27, 379–423, 623–656 (1948)
Shasha D., Zhang K.: Pattern Matching in Strings, Trees, and Arrays, chapter Approximate Tree Pattern Matching. Oxford University Press, Oxford (1995)
Google Scholar
Sneath P.H.A., Sokal R.R.: Numerical Taxonomy. Freeman, San Francisco (1973)
MATH Google Scholar
Solomonoff R.: A formal theory of inductive inference, part I. Inf. Control 7(1), 1–22 (1964)
Article MathSciNet MATH Google Scholar
Solomonoff R.: A formal theory of inductive inference, part II. Inf. Control 7(2), 224–254 (1964)
Article MathSciNet MATH Google Scholar
Tai K.C.: The tree-to-tree correction problem. J. ACM 26(3), 422–433 (1979)
Article MathSciNet MATH Google Scholar
Theobald, A., Weikum, G.: The XXL search engine: ranked retrieval of XML data using indexes and ontologies. In: ACM SIGMOD International Conference on Management of Data, p. 615 (2002)
Ukkonen E.: On-line construction of suffix trees. Algorithmica 14(3), 249–260 (1995)
Article MathSciNet MATH Google Scholar
Wang J., Zhang K., Jeong K., Shasha D.: A system for approximate tree matching. IEEE Trans. Knowl. Data Eng. 6(4), 559–571 (1994)
Article Google Scholar
Weiner, P.: Linear pattern matching algorithms. In: 14th Annual Symposium on Found. of Computer Science (FOCS), pp. 1–11, Iowa City, Iowa (1973)
Witten I.H., Moffat A., Bell T.C.: Managing Gigabytes. Morgan Kaufmann, San Francisco (1999)
Google Scholar
Zhang K., Shasha D.: Simple fast algorithms for the editing distance between trees and related problems. SIAM J. Comput. 18(6), 1245–1262 (1989)
Article MathSciNet MATH Google Scholar
Ziv J., Lempel A.: A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 23(3), 337–343 (1977)
Article MathSciNet MATH Google Scholar
Ziv J., Merhav N.: A measure of relative entropy between individual sequences with application to universal classification. IEEE Trans. Inf. Theory 39(4), 1270–1279 (1993)
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Birkbeck, University of London, Malet Street, London, WC1E 7HX, UK
Sven Helmer
Free University of Bozen-Bolzano, Dominikanerplatz 3, 39100, Bozen-Bolzano, Italy
Nikolaus Augsten
University of Zurich, Binzmühlestrasse 14, 8050, Zurich, Switzerland
Michael Böhlen

Authors

Sven Helmer
View author publications
You can also search for this author in PubMed Google Scholar
Nikolaus Augsten
View author publications
You can also search for this author in PubMed Google Scholar
Michael Böhlen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sven Helmer.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Helmer, S., Augsten, N. & Böhlen, M. Measuring structural similarity of semistructured data based on information-theoretic approaches. The VLDB Journal 21, 677–702 (2012). https://doi.org/10.1007/s00778-012-0263-0

Download citation

Received: 13 June 2011
Revised: 13 December 2011
Accepted: 16 January 2012
Published: 08 February 2012
Issue Date: October 2012
DOI: https://doi.org/10.1007/s00778-012-0263-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Measuring structural similarity of semistructured data based on information-theoretic approaches

Abstract

Access this article

Similar content being viewed by others

Models of the representation and clustering of semistructured information

An overview of distance and similarity functions for structured data

AST Method for Scoring String-to-text Similarity

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Measuring structural similarity of semistructured data based on information-theoretic approaches

Abstract

Access this article

Similar content being viewed by others

Models of the representation and clustering of semistructured information

An overview of distance and similarity functions for structured data

AST Method for Scoring String-to-text Similarity

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation