On the use of hierarchical information in sequential mining-based XML document similarity computation

Leung, Ho-pong; Chung, Fu-lai; Chan, Stephen Chi-fai

doi:10.1007/s10115-004-0156-7

On the use of hierarchical information in sequential mining-based XML document similarity computation

Published: 01 May 2005

Volume 7, pages 476–498, (2005)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Ho-pong Leung¹,
Fu-lai Chung¹ &
Stephen Chi-fai Chan¹

154 Accesses
20 Citations
3 Altmetric
Explore all metrics

Abstract

Measuring the structural similarity among XML documents is the task of finding their semantic correspondence and is fundamental to many web-based applications. While there exist several methods to address the problem, the data mining approach seems to be a novel, interesting and promising one. It explores the idea of extracting paths from XML documents, encoding them as sequences and finding the maximal frequent sequences using the sequential pattern mining algorithms. In view of the deficiencies encountered by ignoring the hierarchical information in encoding the paths for mining, a new sequential pattern mining scheme for XML document similarity computation is proposed in this paper. It makes use of a preorder tree representation (PTR) to encode the XML tree’s paths so that both the semantics of the elements and the hierarchical structure of the document can be taken into account when computing the structural similarity among documents. In addition, it proposes a postprocessing step to reuse the mined patterns to estimate the similarity of unmatched elements so that another metric to qualify the similarity between XML documents can be introduced. Encouraging experimental results were obtained and reported.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Prufer Sequence Based Approach to Measure Structural Similarity of XML Documents

Clustering XML documents by patterns

Article Open access 23 January 2015

A New Sequence-Based Approach for XML Data Query

References

ACM SIGMOD Record home page [http://www.acm.org/sigmod/record/xml]
Agrawal R, Srikant R (1995) Mining sequential patterns. In: Proceedings of the 11th international conference on data engineering (ICDE), Taipei, Taiwan, March, pp 3–14
Altinel M, Franklin MJ (2000) Efficient filtering of XML documents for selective dissemination of information. In: Proceedings of 26th international conference on very large data bases (VLDB), Cairo, Egypt, September, pp 53–64
Bunke H (1997) On a relation between graph edit distance and maximum common subgraph. Pattern Recogn Lett 18:689–694
Article MathSciNet Google Scholar
Chan CY, Felber P, Garofalakis MN, Rastogi R (2002a) Efficient filtering of XML documents with XPath expressions. In: Proceedings of 18th international conference on data engineering (ICDE), San Jose, CA, February 26–March 1, pp 235–244
Chan CY, Felber P, Garofalakis M, Rastogi R (2002b) Tree pattern aggregation for scalable XML data dissemination. In: Proceedings of 28th international conference on very large data bases (VLDB), Hong Kong, China, August, pp 826–837
Chang CH, Lui SC, Wu YC (2001) Applying pattern mining to Web information extraction. In: Proceedings of the fifth Pacific-Asia conference on knowledge discovery and data mining (PAKDD), Hong Kong, China, April, pp 4–16
Galhardas H, Florescu D, Shasha D, Simon E, Saita CA (2002) Declarative data cleaning: language, model, and algorithms. In: Proceedings of 28th international conference on very large data bases (VLDB), Hong Kong, China, August, pp 371–380
Guha S, Jagadish HV, Koudas N, Srivastava D, Yu T (2002) Approximate XML joins. In: Proceedings of the ACM SIGMOD conference on management of data, Madison, WI, June, pp 287–298
Liefke H, Suciu D (2000) XMill: an efficient compressor for XML data. In: Proceedings of the 2000 ACM SIGMOD international conference on management of data, Dallas, TX, May, pp 153–164
IBM’s XML Generator homepage [http://www.alphaworks.ibm.com]
Lee JW, Lee K, Kim W (2001) Preparations for semantics-based XML mining. In: Proceedings of the 2001 IEEE international conference on data mining, San Jose, CA, December, pp 345–352
Meng W, Wang W, Sun H, Yu C (2002) Concept hierarchy based text database categorization. J Knowl Inf Syst 4(2):132–150
Article Google Scholar
Miyahara T, Shoudai T, Uchida T, Takahashi K, Ueda H (2001) Discovery of frequent tree structured patterns in semi-structured Web documents. In: Proceedings of the Fifth Pacific-Asia conference on knowledge discovery and data mining (PAKDD), Hong Kong, China, April, pp 47–52
Miyahara T, Suzuki Y, Shoudai T, Uchida T, Takahashi K, Ueda H (2002) Discovery of frequent tag tree patterns in semistructured Web Documents. In: Proceedings of the sixth Pacific-Asia conference on knowledge discovery and data mining (PAKDD), Taipei, Taiwan, May, pp 341–355
Moh CH, Lim EP, Ng WK (2000) DTD-miner: a tool for mining DTD from XML documents. In: Proceedings of the 2nd international workshop on advance issues of e-commerce and web-based information systems, Milpitas, CA, June, pp 144–151
Nestorov S, Abiteboul S, Motwani R (1998) Extracting schema from semi-structured data. In: Proceedings of ACM SIGMOD international conference on management of data, Seattle, WA, June, pp 295–306
Nierman A, Jagadish HV (2002) Evaluating structural similarity in XML documents. In: Proceedings of the fifth international workshop on the Web and databases (WEDDB), Madison, WI, June
Pereira J, Fabret F, Jacobsen HA, Llirbat F, Shasha D (2001) WebFilter: a high-throughput XML-based publish and subscribe system. In: Proceedings of 27th international conference on very large data bases (VLDB), Roma, Italy, September, pp 723–724
Sedgewick R (1996) Chapter 5 trees, an introduction to the analysis of algorithms. Addison-Wesley, pp 221–298
W3C’s Document Object Model (DOM) home page [http://www.w3.org/DOM/]
W3C’s Extensible Markup Language (XML) home page [http://www.w3.org/XML/]
W3C’s Standard Generalized Markup Language (SGML) home page [http://www.w3.org/MarkUp/SGML/]

Download references

Author information

Authors and Affiliations

Department of Computing, Hong Kong Polytechnic University, Hunghom, Kowloon, Hong Kong
Ho-pong Leung, Fu-lai Chung & Stephen Chi-fai Chan

Authors

Ho-pong Leung
View author publications
You can also search for this author in PubMed Google Scholar
Fu-lai Chung
View author publications
You can also search for this author in PubMed Google Scholar
Stephen Chi-fai Chan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Fu-lai Chung.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Leung, Hp., Chung, Fl. & Chan, Sf. On the use of hierarchical information in sequential mining-based XML document similarity computation. Knowl Inf Syst 7, 476–498 (2005). https://doi.org/10.1007/s10115-004-0156-7

Download citation

Received: 08 February 2003
Revised: 07 October 2003
Accepted: 18 December 2003
Published: 01 May 2005
Issue Date: May 2005
DOI: https://doi.org/10.1007/s10115-004-0156-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

On the use of hierarchical information in sequential mining-based XML document similarity computation

Abstract

Access this article

Similar content being viewed by others

A Prufer Sequence Based Approach to Measure Structural Similarity of XML Documents

Clustering XML documents by patterns

A New Sequence-Based Approach for XML Data Query

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

On the use of hierarchical information in sequential mining-based XML document similarity computation

Abstract

Access this article

Similar content being viewed by others

A Prufer Sequence Based Approach to Measure Structural Similarity of XML Documents

Clustering XML documents by patterns

A New Sequence-Based Approach for XML Data Query

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation