Abstract
XML has become the standard format for web publishing and data exchange on the Internet. Much research has been done to provide efficient access to relevant information that is ubiquitous on the Web. In this paper, we present an algorithm to find a sequence of top-down edit operations with minimum cost that transforms an XML document such that it conforms to a schema. The minimum cost is based on the tree edit distance with top-down edit operations. It is shown that the algorithm runs in O(p × log p × n), where p is the size of the schema(grammar) and n is the size of the XML document(tree).
Experimental studies have also shown that the running time of our algorithm is linear with respect to the size of the XML document when normalized regular hedge grammar is used to specify a schema.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Suzuki, N.: Finding an Optimum Edit Script between an XML Document and a DTD. In: Proceedings of ACM Symposium on Applied Computing, Santa Fe, NM, pp. 647–653 (March 2005)
Canfield, R., Xing, G.: Approximate XML Document Matching (Poster). In: Proceedings of ACM Symposium on Applied Computing, Santa Fe, NM (March 2005)
Bray, T., Paoli, J., Sperberg-McQueen, M., et al.: Extensible Markup Language (XML) 1.0. W3C, 3rd edn., http://www.w3.org/TR/2004/REC-xml-20040204/
Shasha, D., Zhang, K.: Approximate Tree Pattern Matching. In: Apostolico, A., Galil, Z. (eds.) Pattern Matching Algorithms, ch. 14. Oxford University Press, Oxford (June 1997)
Shasha, D., Zhang, K.: Fast algorithms for the unit cost editing distance between trees. Journal of Algorithms 11, 581–621 (1990)
Tanaka, E., Tanaka, K.: The Tree-to-tree Editing Problem. International Journal of Pattern Recognition and Artificial Intelligence 2(2), 221–240 (1988)
Courcelle, B.: On recognizable sets and tree automata. In: Nivat, M., Ait-Kaci, H. (eds.) Resolution of Equations in Algebraic Structures. Academic Press, London (1989)
Murata, M.: Hedge Automata: A Formal Model for XML Schemata, http://www.xml.gr.jp/relax/hedge_nice.html
Myers, G.: Approximately Matching Context Free Languages. Information Processing Letters 54(2), 85–92 (1995)
Bertino, E., Guerrini, G., Mesiti, M.: A Matching Algorithm for Measuring the Structural Similarity Between an XML document and a DTD and its Applications. Information Systems 29, 23–46 (2004)
Boukottaya, A., Vanoirbeek, C., Paganelli, F., Abou Khaled, O.: Automating XML Documents Transformations: A Conceptual Modelling Based Approach. In: Proceedings of 1st Asian-Pacific conference on Conceptual modelling, Dunedin, New Zealand, vol. 31, pp. 81–90 (2004)
de Castro Reis, D., Golgher, P.B., da Silva, A.S., Laender, A.H.F.: Automatic web news extraction using tree edit distance. In: WWW 2004, Manhattan, NY, pp. 502–511 (2004)
Selkow, S.M.: The Tree-to-Tree Editing Problem. Information Processing Letters 6, 184–186 (1977)
Chen, W.: New Algorithm for Ordered Tree-to-Tree Correction Problem. Journal of Algorithms 40, 135–158 (2001)
Garofalakis, M., Gionis, A., Rastogi, R., Seshadri, S., Shim, K.: Xtract: A System For Extracting Document Type Descriptors From XML Documents. In: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, pp. 165–176 (2000)
Nierman, A., Jagadish, H.V.: Evaluating structural similarity in XML documents. In: Proceedings of WebDB 2002, Madison, Wisconsin (June 2002)
Schlieder, T.: Similarity Search in XML Data using Cost-Based Query Transformations. In: Proceedings of WebDB 2001, pp. 19–24 (2001)
Schmidt, A.R., Waas, F., Kersten, M.L., Florescu, D., Manolescu, I., Carey, M.J., Busse, R.: The XML Benchmark Project. Technical Report INS-R0103, CWI, Amsterdam, The Netherlands (April 2001)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Xing, G. (2006). Fast Approximate Matching Between XML Documents and Schemata. In: Zhou, X., Li, J., Shen, H.T., Kitsuregawa, M., Zhang, Y. (eds) Frontiers of WWW Research and Development - APWeb 2006. APWeb 2006. Lecture Notes in Computer Science, vol 3841. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11610113_38
Download citation
DOI: https://doi.org/10.1007/11610113_38
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-31142-3
Online ISBN: 978-3-540-32437-9
eBook Packages: Computer ScienceComputer Science (R0)