Abstract
A corrector takes an invalid XML file F as input and produces a valid file F′ which is not far from F when F is ε –close to its DTD, using the classical Tree Edit distance between a tree T and a language L defined by a DTD or a tree-automaton. We show how testers and correctors for regular trees can be used to estimate distances between a document and a set of DTDs, a useful operation to rank XML documents.
We describe the implementation of a linear time corrector using the Xerces parser and present test data for various DTDs comparing the parsing and correction time. We propose a generalization to homomorphic DTDs.
Work supported by ACI Sécurité Informatique: VERA of the French Ministry of Research.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Alon, N., Krivelevich, M., Newman, I., Szegedy, M.: Regular languages are testable with a constant number of queries. In: IEEE Symposium on Foundations of Computer Science (1999)
Apostolico, A., Galil, Z.: Pattern matching algorithms, chapter 14. In: Approximate tree Pattern matching, Oxford University Press, Oxford (1997)
Blum, M., Kannan, S.: Designing programs that test their work. In: ACM Symposium on Theory of Computing, pp. 86–97 (1989)
Blum, M., Luby, M., Rubinfeld, R.: Self-testing/correcting with applications to numerical problems. In: ACM Symposium on Theory of Computing, pp. 73–83 (1990)
Chawathe, S., Rajaraman, A., Garcia-Molina, H., Widom, J.: Change detection in hierarchically structured information. In: Proceedings of the ACM SIGMOD, pp. 493–504 (1996)
Cormode, G.: Sequence distance embeddings. Ph.D. thesis, University of Warwick (2003)
de Rougemont. M.: A corrector for XML. In: ISIP: Franco-Japanese Workshop on Information Search, Integration and Personalization, Hokkaido University (2003), http://ca.meme.hokudai.ac.jp/project/fj2003/
Goldreich, O., Goldwasser, S., Ron, D.: Property testing and its connection to learning and approximation. In: IEEE Symposium on Foundations of Computer Science, pp. 339–348 (1996)
Magniez, F., Rougemont, M.: Property testing of regular tree languages. ICALP (2004)
Nierman, A., Jagadish, H.V.: Evaluating structural similarity in XML documents. In: Proceedings of the fifth International Workshop on the Web and Databases, pp. 61–66 (2002)
Rubinfeld, R., Sudan, M.: Robust characterizations of polynomials with applications to program testing. SIAM Journal on Computing 25, 23–32 (1996)
Tai, K.C.: The tree-to-tree correction Problem. Journal of the Association for Computing Machinery 26, 422–433 (1979)
Tidy. HTML Tidy Library Project (2000), http://tidy.sourceforge.net
Wagner, R., Fisher, M.: The string-to-string correction Problem. Journal of the Association for Computing Machinery 21, 168–173 (1974)
Wu, S., Manber, U., Myers, E.: A subquadratic algorithm for approximate regular expression matching. Journal of algorithms 19, 346–360 (1995)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Boobna, U., de Rougemont, M. (2004). Correctors for XML Data. In: Bellahsène, Z., Milo, T., Rys, M., Suciu, D., Unland, R. (eds) Database and XML Technologies. XSym 2004. Lecture Notes in Computer Science, vol 3186. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30081-6_8
Download citation
DOI: https://doi.org/10.1007/978-3-540-30081-6_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-22969-8
Online ISBN: 978-3-540-30081-6
eBook Packages: Springer Book Archive