Skip to main content

Inferring a Relax NG Schema from XML Documents

  • Conference paper
  • First Online:
Language and Automata Theory and Applications (LATA 2016)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9618))

Included in the following conference series:

Abstract

An XML schema specifies the structural properties of XML documents generated from the schema and, thus, is useful to manage XML data efficiently. However, there are often XML documents without a valid schema or with an incorrect schema in practice. This leads us to study the problem of inferring a Relax NG schema from a set of XML documents that are presumably generated from a specific XML schema. Relax NG is an XML schema language developed for the next generation of XML schema languages such as document type definitions (DTDs) and XML Schema Definitions (XSDs). Regular hedge grammars accept regular tree languages and the design of Relax NG is closely related with regular hedge grammars. We develop an XML schema inference system using hedge grammars. We employ a genetic algorithm and state elimination heuristics in the process of retrieving a concise Relax NG schema. We present experimental results using real-world benchmark.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The Oracle Multi-Schema XML Validator (MSV). https://msv.java.net/.

References

  1. Athan, T., Boley, H.: Design and implementation of highly modular schemas for XML: customization of RuleML in relax NG. In: Palmirani, M. (ed.) RuleML - America 2011. LNCS, vol. 7018, pp. 17–32. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  2. Barbosa, D., Mignet, L., Veltri, P.: Studying the XML web: gathering statistics from an XML sample. World Wide Web 8(4), 413–438 (2005)

    Article  Google Scholar 

  3. Bex, G.J., Gelade, W., Neven, F., Vansummeren, S.: Learning deterministic regular expressions for the inference of schemas from XML data. ACM Trans. Web 4(4), 14 (2010)

    Article  Google Scholar 

  4. Bex, G.J., Neven, F., Schwentick, T., Tuyls, K.: Inference of concise DTDs from XML data. In: Proceedings of the 32nd International Conference on Very Large Data Bases, pp. 115–126. VLDB Endowment (2006)

    Google Scholar 

  5. Bex, G.J., Neven, F., Vansummeren, S.: Inferring XML schema definitions from XML data. In: Proceedings of the 33rd International Conference on Very Large Data Bases, pp. 998–1009. VLDB Endowment (2007)

    Google Scholar 

  6. Comon, H., Dauchet, M., Jacquemard, F., Lugiez, D., Tison, S., Tommasi, M.: Tree Automata Techniques and Applications (2007). http://www.tata.gforge.inria.fr

  7. Delgado, M., Morais, J.J.: Approximation to the smallest regular expression for a given regular language. In: Domaratzki, M., Okhotin, A., Salomaa, K., Yu, S. (eds.) CIAA 2004. LNCS, vol. 3317, pp. 312–314. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  8. Gruber, H., Holzer, M.: Provably shorter regular expressions from deterministic finite automata. In: Ito, M., Toyama, M. (eds.) DLT 2008. LNCS, vol. 5257, pp. 383–395. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  9. Han, Y.S.: State elimination heuristics for short regular expressions. Fundam. Inf. 128(4), 445–462 (2013)

    MATH  Google Scholar 

  10. Han, Y.S., Wood, D.: Obtaining shorter regular expressions from finite-state automata. Theor. Comput. Sci. 370(1), 110–120 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  11. He, B., Tao, T., Chang, K.C.-C.: Clustering structured web sources: a schema-based, model-differentiation approach. In: Lindner, W., Fischer, F., Türker, C., Tzitzikas, Y., Vakali, A.I. (eds.) EDBT 2004. LNCS, vol. 3268, pp. 536–546. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  12. Hopcroft, J., Ullman, J.: Introduction to Automata Theory, Languages, and Computation. Addison-Wesley, Boston (1979)

    MATH  Google Scholar 

  13. Jiang, T., Ravikumar, B.: Minimal nfa problems are hard. SIAM J. Comput. 22(6), 1117–1141 (1993)

    Article  MathSciNet  MATH  Google Scholar 

  14. Koch, C., Scherzinger, S., Schweikardt, N., Stegmaier, B.: Schema-based scheduling of event processors and buffer minimization for queries on structured data streams. In: Proceedings of the 30th International Conference on Very Large Data Bases, pp. 228–239. VLDB Endowment (2004)

    Google Scholar 

  15. League, C., Eng, K.: Schema-based compression of XML data with RELAX NG. J. Comput. 2(10), 9–17 (2007)

    Article  Google Scholar 

  16. Löser, A., Siberski, W., Wolpers, M., Nejdl, W.: Information integration in schema-based peer-to-peer networks. In: Eder, J., Missikoff, M. (eds.) CAiSE 2003. LNCS, vol. 2681, pp. 258–272. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  17. Mignet, L., Barbosa, D., Veltri, P.: The XML web: a first study. In: Proceedings of the 12th International Conference on World Wide Web, pp. 500–510 (2003)

    Google Scholar 

  18. Murata, M.: Hedge automata: a formal model for XML schemata (1999)

    Google Scholar 

  19. Paige, R., Tarjan, R.E.: Three partition refinement algorithms. SIAM J. Comput. 16(6), 973–989 (1987)

    Article  MathSciNet  MATH  Google Scholar 

  20. Shvaiko, P.: A classification of schema-based matching approaches (2004)

    Google Scholar 

  21. Wang, G., Liu, M., Yu, G., Sun, B., Yu, G., Lv, J., Lu, H.: Effective schema-based XML query optimization techniques. In: Proceedings of the 7th International Symposium on Database Engineering and Applications, pp. 230–235 (2003)

    Google Scholar 

  22. Wood, D.: Theory of Computation. Harper & Row, New York (1987)

    MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yo-Sub Han .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Kim, GH., Ko, SK., Han, YS. (2016). Inferring a Relax NG Schema from XML Documents. In: Dediu, AH., Janoušek, J., Martín-Vide, C., Truthe, B. (eds) Language and Automata Theory and Applications. LATA 2016. Lecture Notes in Computer Science(), vol 9618. Springer, Cham. https://doi.org/10.1007/978-3-319-30000-9_31

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-30000-9_31

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-29999-0

  • Online ISBN: 978-3-319-30000-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics