Abstract
An XML schema specifies the structural properties of XML documents generated from the schema and, thus, is useful to manage XML data efficiently. However, there are often XML documents without a valid schema or with an incorrect schema in practice. This leads us to study the problem of inferring a Relax NG schema from a set of XML documents that are presumably generated from a specific XML schema. Relax NG is an XML schema language developed for the next generation of XML schema languages such as document type definitions (DTDs) and XML Schema Definitions (XSDs). Regular hedge grammars accept regular tree languages and the design of Relax NG is closely related with regular hedge grammars. We develop an XML schema inference system using hedge grammars. We employ a genetic algorithm and state elimination heuristics in the process of retrieving a concise Relax NG schema. We present experimental results using real-world benchmark.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
The Oracle Multi-Schema XML Validator (MSV). https://msv.java.net/.
References
Athan, T., Boley, H.: Design and implementation of highly modular schemas for XML: customization of RuleML in relax NG. In: Palmirani, M. (ed.) RuleML - America 2011. LNCS, vol. 7018, pp. 17–32. Springer, Heidelberg (2011)
Barbosa, D., Mignet, L., Veltri, P.: Studying the XML web: gathering statistics from an XML sample. World Wide Web 8(4), 413–438 (2005)
Bex, G.J., Gelade, W., Neven, F., Vansummeren, S.: Learning deterministic regular expressions for the inference of schemas from XML data. ACM Trans. Web 4(4), 14 (2010)
Bex, G.J., Neven, F., Schwentick, T., Tuyls, K.: Inference of concise DTDs from XML data. In: Proceedings of the 32nd International Conference on Very Large Data Bases, pp. 115–126. VLDB Endowment (2006)
Bex, G.J., Neven, F., Vansummeren, S.: Inferring XML schema definitions from XML data. In: Proceedings of the 33rd International Conference on Very Large Data Bases, pp. 998–1009. VLDB Endowment (2007)
Comon, H., Dauchet, M., Jacquemard, F., Lugiez, D., Tison, S., Tommasi, M.: Tree Automata Techniques and Applications (2007). http://www.tata.gforge.inria.fr
Delgado, M., Morais, J.J.: Approximation to the smallest regular expression for a given regular language. In: Domaratzki, M., Okhotin, A., Salomaa, K., Yu, S. (eds.) CIAA 2004. LNCS, vol. 3317, pp. 312–314. Springer, Heidelberg (2005)
Gruber, H., Holzer, M.: Provably shorter regular expressions from deterministic finite automata. In: Ito, M., Toyama, M. (eds.) DLT 2008. LNCS, vol. 5257, pp. 383–395. Springer, Heidelberg (2008)
Han, Y.S.: State elimination heuristics for short regular expressions. Fundam. Inf. 128(4), 445–462 (2013)
Han, Y.S., Wood, D.: Obtaining shorter regular expressions from finite-state automata. Theor. Comput. Sci. 370(1), 110–120 (2007)
He, B., Tao, T., Chang, K.C.-C.: Clustering structured web sources: a schema-based, model-differentiation approach. In: Lindner, W., Fischer, F., Türker, C., Tzitzikas, Y., Vakali, A.I. (eds.) EDBT 2004. LNCS, vol. 3268, pp. 536–546. Springer, Heidelberg (2004)
Hopcroft, J., Ullman, J.: Introduction to Automata Theory, Languages, and Computation. Addison-Wesley, Boston (1979)
Jiang, T., Ravikumar, B.: Minimal nfa problems are hard. SIAM J. Comput. 22(6), 1117–1141 (1993)
Koch, C., Scherzinger, S., Schweikardt, N., Stegmaier, B.: Schema-based scheduling of event processors and buffer minimization for queries on structured data streams. In: Proceedings of the 30th International Conference on Very Large Data Bases, pp. 228–239. VLDB Endowment (2004)
League, C., Eng, K.: Schema-based compression of XML data with RELAX NG. J. Comput. 2(10), 9–17 (2007)
Löser, A., Siberski, W., Wolpers, M., Nejdl, W.: Information integration in schema-based peer-to-peer networks. In: Eder, J., Missikoff, M. (eds.) CAiSE 2003. LNCS, vol. 2681, pp. 258–272. Springer, Heidelberg (2003)
Mignet, L., Barbosa, D., Veltri, P.: The XML web: a first study. In: Proceedings of the 12th International Conference on World Wide Web, pp. 500–510 (2003)
Murata, M.: Hedge automata: a formal model for XML schemata (1999)
Paige, R., Tarjan, R.E.: Three partition refinement algorithms. SIAM J. Comput. 16(6), 973–989 (1987)
Shvaiko, P.: A classification of schema-based matching approaches (2004)
Wang, G., Liu, M., Yu, G., Sun, B., Yu, G., Lv, J., Lu, H.: Effective schema-based XML query optimization techniques. In: Proceedings of the 7th International Symposium on Database Engineering and Applications, pp. 230–235 (2003)
Wood, D.: Theory of Computation. Harper & Row, New York (1987)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Kim, GH., Ko, SK., Han, YS. (2016). Inferring a Relax NG Schema from XML Documents. In: Dediu, AH., JanouÅ¡ek, J., MartÃn-Vide, C., Truthe, B. (eds) Language and Automata Theory and Applications. LATA 2016. Lecture Notes in Computer Science(), vol 9618. Springer, Cham. https://doi.org/10.1007/978-3-319-30000-9_31
Download citation
DOI: https://doi.org/10.1007/978-3-319-30000-9_31
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-29999-0
Online ISBN: 978-3-319-30000-9
eBook Packages: Computer ScienceComputer Science (R0)