Abstract
We propose and experimentally evaluate different approaches for measuring the structural similarity of semistructured documents based on information-theoretic concepts. Common to all approaches is a two-step procedure: first, we extract and linearize the structural information from documents, and then, we use similarity measures that are based on, respectively, Kolmogorov complexity and Shannon entropy to determine the distance between the documents. Compared to other approaches, we are able to achieve a linear run-time complexity and demonstrate in an experimental evaluation that the results of our technique in terms of clustering quality are on a par with or even better than those of other, slower approaches.
Similar content being viewed by others
References
Arasu, A., Garcia-Molina, H.: Extracting structured data from web pages. In: ACM SIGMOD International Conference on Management of Data, pp. 337–348 (2003)
Ashby F.G., Perrin N.A.: Toward a unified theory of similarity and recognition. Psychol. Rev. 95(1), 124–150 (1988)
Augsten, N., Böhlen, M.H., Gamper, J.: Approximate matching of hierarchical data using pq-grams. In: Proceedings of the 31st International Conference on Very Large Data Bases (VLDB’05), pp. 301–312, Trondheim (2005)
Augsten, N., Böhlen, M., Dyreson, C., Gamper, J.: Approximate joins for data-centric XML. In: Proceedings of the International Conference on Data Engineering (ICDE), pp. 814–823, Cancún, Mexico. IEEE Computer Society (2008)
Augsten, N., Barbosa, D., Böhlen, M., Palpanas, T.: TASM: Top-k approximate subtree matching. In: Proceedings of the International Conference on Data Engineering (ICDE), pp. 353–364, Long Beach, California, USA. IEEE Computer Society (2010)
Augsten, N., Böhlen, M., Gamper, J.: The pq-gram distance between ordered labeled trees. ACM Trans. Database Syst. (TODS) 35(1) (2010)
Baeza-Yates R.A., Ribeiro-Neto B.A.: Modern Information Retrieval. Addison-Wesley, Reading (1999)
Barbosa D., Mignet L., Veltri P.: Studying the XML Web: gathering statistics from an XML sample. World Wide Web J. 8(4), 413–438 (2005)
Baumgartner, R., Flesca, S., Gottlob, G.: Visual web information extraction with Lixto. In: International Conference on Very Large Databases (VLDB’01), pp. 119–128 (2001)
Bennet C.H., Gács P., Li M., Vitányi P.M.B.: Zurek W.H.: Information distance. IEEE Trans. Inf. Theory 44(4), 1407–1423 (1998)
Bertino E., Guerrini G., Mesiti M.: A matching algorithm for measuring the structural similarity between an XML document and a DTD and its applications. Inf. Syst. 29(1), 23–46 (2004)
Bohrer, K., Liu, X., McLaughlin, S., Schonberg, E., Singh, M.: Object oriented XML query by example. In: ER (Workshops), pp. 323–329 (2003)
Buttler, D.: A short survey of document structure similarity algorithms. In: 5th International Conference on Internet Computing, Las Vegas, Nevada (2004)
Chaitin G.J.: On the length of programs for computing finite binary sequences. J. ACM 13, 547–569 (1966)
Chawathe, S., Garcia-Molina, H.: Meaningful change detection in structured data. In: ACM SIGMOD International Conference on Management of Data, pp. 26–37 (1997)
Chawathe, S., Rajaraman, A., Garcia-Molina, H., Widom, J.: Change detection in hierarchically structured information. In: ACM SIGMOD International Conference on Management of Data, pp. 493–504 (1996)
Cherukuri, V.S., Candan, K.S.: Propagation-vectors for trees (PVT): concise yet effective summaries for hierarchical data and trees. In: ACM Workshop on Large-Scale Distributed Systems for Information Retrieval (LSDS-IR), Napa Valley, CA (2008)
Cilibrasi R., Vitányi P.M.B.: Clustering by compression. IEEE Trans. Inf. Theory 51(4), 1523– (2005)
Coutinho, D.P., Figueiredo, M.A.T.: Information theoretic text classification using the Ziv-Merhav method. In: Proceeding 2nd Iberian Conference on Pattern Recognition and Image Analysis (IbPRIA), pp. 355–362, Estoril, Portugal (2005)
Cover T.M., Thomas J.A.: Elements of Information Theory. Wiley, London (2006)
Crescenzi, V., Mecca, G., Merialdo, P.: RoadRunner: towards automatic data extraction from large web sites. In: International Conference on Very Large Databases (VLDB’01), pp. 109–118 (2001)
Dalamagas T., Cheng T., Winkel K.-J., Sellis T.: A methodology for clustering XML documents by structure. Inf. Syst. 31(3), 187–228 (2006)
de Castro Reis, D., Golgher, P.B., da Silva, A.S., Laender, A.H.F.: Automatic web news extraction using tree edit distance. In: 13th International World Wide Web Conference (WWW’04), Manhattan, New York (2004)
Dice L.R.: Measures of the amount of ecologic association between species. Ecology 26(3), 297–302 (1945)
Flesca S., Manco G., Masciari E., Pontieri L., Pugliese A.: Fast detection of XML structural similarity. IEEE Trans. Knowl. Data Eng. 17(2), 160–175 (2005)
Garofalakis, M.N., Gionis, A., Rastogi, R., Seshadri, S., Shim, K.: XTRACT: a system for extracting document type descriptors from XML documents. In: ACM SIGMOD International Conference on Management of Data, pp. 165–176 (2000)
Grünwald, P., Vitányi, P.M.B.: Shannon information and Kolmogorov complexity. The Computing Research Repository (CoRR), cs.IT/0410002 (2004)
Helmer, S.: Measuring the structural similarity of semistructured documents using entropy. In: Proceedings of the 33rd International Conference on Very Large Data Bases (VLDB’07), pp. 1022–1032, Vienna (2007)
Herbert K.G., Wang J.T.L.: Biological data cleaning: a case study. Int. J. Inf. Qual. 1(1), 60–82 (2007)
Jardine N., Sibson R.: Mathematical Taxonomy. Wiley, New York (1971)
Kim, J.W., Candan, K.S.: CP/CV: concept similarity mining without frequency information from domain describing taxonomies. In: ACM International Conference on Information and Knowledge Management (CIKM), pp. 483–492, Arlington, Virginia (2006)
Knuth D.: The Art of Computer Programming, Volume I: Fundamental Algorithms. Addison-Wesley, Reading (1973)
Kolmogorov A.N.: Three approaches to the quantitative definition of information. Probl. Inf. Transm. 1, 1–7 (1965)
Kullback S.: Information Theory and Statistics. Dover Publications, New York (1968)
Lee, M.L., Yang, L.H., Hsu, W., Yang, X.: XClust: clustering XML schemas for effective integration. In: 11th International Conference on Information and Knowledge Management (CIKM’02), McLean, Virginia (2002)
Li M., Vitányi P.M.B.: An Introduction to Kolmogorov Complexity. Springer, (1997)
Lian W., Cheung D.W.L., Mamoulis N., Yiu S.-M.: An efficient and scalable algorithm for clustering XML documents by structure. IEEE Trans. Knowl. Data Eng. (TKDE) 16(1), 82–96 (2004)
Martins, A.: String kernels and similarity measures for information retrieval. Technical report, Priberam, Lisbon, Portugal (2006)
Mesiti, M., Bertino, E., Guerrini, G.: An abstraction-based approach to measuring the structural similarity between two unordered XML documents. In: ISICT ’03: Proceedings of the 1st International Symposium on Information and Communication Technologies, pp. 316–321 (2003)
Nestorov, S., Abiteboul, S., Motwani, R.: Extracting schema from semistructured data. In: ACM SIGMOD International Conference on Management of Data, pp. 295–306 (1998)
Nierman, A., Jagadish, H.V.: Evaluating structural similarity in XML documents. In: Proceedings of the 5th International Workshop on the Web and Databases (WebDB), pp. 61–66, Madison, Wisconsin, (2002)
Puglisi A., Benedetto D., Caglioti E., Loreto V., Vulpiani A.: Data compression and learning in time sequences analysis. Phys. D 189, 92–107 (2003)
Santini, S., Jain, R.: Similarity measures. IEEE Trans. Pattern Anal. Mach. Intell. 21(9) (1999)
Selkow S.: The tree-to-tree editing problem. Inf. Process. Lett. 6(6), 184–186 (1977)
Shannon, C.E.: The mathematical theory of communication. Bell Syst. Tech. J. 27, 379–423, 623–656 (1948)
Shasha D., Zhang K.: Pattern Matching in Strings, Trees, and Arrays, chapter Approximate Tree Pattern Matching. Oxford University Press, Oxford (1995)
Sneath P.H.A., Sokal R.R.: Numerical Taxonomy. Freeman, San Francisco (1973)
Solomonoff R.: A formal theory of inductive inference, part I. Inf. Control 7(1), 1–22 (1964)
Solomonoff R.: A formal theory of inductive inference, part II. Inf. Control 7(2), 224–254 (1964)
Tai K.C.: The tree-to-tree correction problem. J. ACM 26(3), 422–433 (1979)
Theobald, A., Weikum, G.: The XXL search engine: ranked retrieval of XML data using indexes and ontologies. In: ACM SIGMOD International Conference on Management of Data, p. 615 (2002)
Ukkonen E.: On-line construction of suffix trees. Algorithmica 14(3), 249–260 (1995)
Wang J., Zhang K., Jeong K., Shasha D.: A system for approximate tree matching. IEEE Trans. Knowl. Data Eng. 6(4), 559–571 (1994)
Weiner, P.: Linear pattern matching algorithms. In: 14th Annual Symposium on Found. of Computer Science (FOCS), pp. 1–11, Iowa City, Iowa (1973)
Witten I.H., Moffat A., Bell T.C.: Managing Gigabytes. Morgan Kaufmann, San Francisco (1999)
Zhang K., Shasha D.: Simple fast algorithms for the editing distance between trees and related problems. SIAM J. Comput. 18(6), 1245–1262 (1989)
Ziv J., Lempel A.: A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 23(3), 337–343 (1977)
Ziv J., Merhav N.: A measure of relative entropy between individual sequences with application to universal classification. IEEE Trans. Inf. Theory 39(4), 1270–1279 (1993)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Helmer, S., Augsten, N. & Böhlen, M. Measuring structural similarity of semistructured data based on information-theoretic approaches. The VLDB Journal 21, 677–702 (2012). https://doi.org/10.1007/s00778-012-0263-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00778-012-0263-0