Skip to main content

Learning k-Occurrence Regular Expressions with Interleaving

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11447))

Abstract

Since lacking valid schemas is a critical problem for XML and present research on interleaving for XML is also quite insufficient, in this paper we focus on the inference of XML schemas with interleaving. Previous researches have shown that the essential task in schema learning is inferring regular expressions from a set of given samples. Presently, the most powerful model to learn XML schemas is the k-occurrence regular expressions (k-OREs for short). However, there have been no algorithms that can learn k-OREs with interleaving. Therefore, we propose an entire framework which can support both k-OREs and interleaving. To the best of our knowledge, our work is the first to address these two inference problems at the same time. We first defined a new subclass of regular expressions named k-OIREs, and developed an inference algorithm iKOIRE to learn k-OIRE based on genetic algorithm and maximum independent set (MIS). We further conducted a series of experiments on large-scale real datasets, and evaluated the effectiveness of our work compared with both ongoing learning algorithms in academia and industrial tools in real world. The results reveal the high practicability and outstanding performance of our work, and indicate its promising prospects in application.

Work supported by the National Natural Science Foundation of China under Grant Nos. 61872339 and 61472405.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    http://dblp.org/xml/release/dblp-2015-03-02.xml.gz.

References

  1. Benedikt, M., Fan, W., Geerts, F.: XPath satisfiability in the presence of DTDs. J. ACM 55(2), 8:1–8:79 (2008)

    Article  MathSciNet  Google Scholar 

  2. Bex, G.J., Gelade, W., Neven, F., Vansummeren, S.: Learning deterministic regular expressions for the inference of schemas from XML data. TWEB 4(4), 14:1–14:32 (2010)

    Article  Google Scholar 

  3. Bex, G.J., Neven, F., Schwentick, T., Tuyls, K.: Inference of concise DTDs from XML data. In: Proceedings of the 32nd VLDB, pp. 115–126 (2006)

    Google Scholar 

  4. Bex, G.J., Neven, F., Schwentick, T., Vansummeren, S.: Inference of concise regular expressions and DTDs. ACM Trans. Database Syst. 35(2), 11:1–11:47 (2010)

    Article  Google Scholar 

  5. Bex, G.J., Neven, F., Vansummeren, S.: Inferring XML schema definitions from XML data. In: Proceedings of the 33rd VLDB, pp. 998–1009 (2007)

    Google Scholar 

  6. Boneva, I., Ciucanu, R., Staworko, S.: Simple schemas for unordered XML. In: Proceedings of the 16th WebDB, pp. 13–18 (2013)

    Google Scholar 

  7. Che, D., Aberer, K., Özsu, M.T.: Query optimization in XML structured-document databases. VLDB J. 15(3), 263–289 (2006)

    Article  Google Scholar 

  8. Ciucanu, R., Staworko, S.: Learning schemas for unordered XML. In: Proceedings of the 14th DBPL (2013)

    Google Scholar 

  9. devutilsonline: Free XML to XSD Generator, March 2018. https://devutilsonline.com/xsd-xml/generate-xsd-from-xml

  10. EditiX: Open Source XML Editor, March 2018. https://www.editix.com/

  11. Feng, X.Q., Zheng, L.X., Chen, H.M.: Inference algorithm for a restricted class of regular expressions. Comput. Sci. 41, 178–183 (2014)

    Google Scholar 

  12. freeformatter: XML Schema Generator, March 2018. https://www.freeformatter.com/xsd-generator.html

  13. Freydenberger, D.D., Kötzing, T.: Fast learning of restricted regular expressions and DTDs. Theor. Comput. Syst. 57(4), 1114–1158 (2015)

    Article  MathSciNet  Google Scholar 

  14. García, P., Vidal, E.: Inference of k-testable languages in the strict sense and application to syntactic pattern recognition. IEEE Trans. Pattern Anal. Mach. Intell. 12(9), 920–925 (1990)

    Article  Google Scholar 

  15. Garofalakis, M.N., Gionis, A., Rastogi, R., Seshadri, S., Shim, K.: XTRACT: learning document type descriptors from XML document collections. Data Min. Knowl. Discov. 7(1), 23–56 (2003)

    Article  MathSciNet  Google Scholar 

  16. Gold, E.M.: Language identification in the limit. Inf. Control 10(5), 447–474 (1967)

    Article  MathSciNet  Google Scholar 

  17. Grijzenhout, S., Marx, M.: The quality of the XML web. J. Web Semant. 19, 59–68 (2013)

    Article  Google Scholar 

  18. Hopcroft, J.E., Motwani, R., Ullman, J.D.: Introduction to Automata Theory, Languages, and Computation. Addison-Wesley, Boston (2001)

    MATH  Google Scholar 

  19. InstanceToSchema: RELAX NG Schema Generator, October 2003. http://www.xmloperator.net/i2s/

  20. JetBrains: Capable and Ergonomic IDE for JVM, March 2018. https://www.jetbrains.com/idea/

  21. Koch, C., Scherzinger, S., Schweikardt, N., Stegmaier, B.: Schema-based scheduling of event processors and buffer minimization for queries on structured data streams. In: Proceedings of the 30th VLDB, pp. 228–239 (2004)

    Chapter  Google Scholar 

  22. Li, Y., Chu, X., Mou, X., Dong, C., Chen, H.: Practical study of deterministic regular expressions from large-scale XML and schema data. In: Proceedings of the 22nd IDEAS, pp. 45–53 (2018)

    Google Scholar 

  23. Li, Y., Mou, X., Chen, H.: Learning concise relax NG schemas supporting interleaving from XML documents. In: Proceedings of the 14th ADMA, pp. 303–317 (2018)

    Chapter  Google Scholar 

  24. Li, Y., Zhang, X., Peng, F., Chen, H.: Practical study of subclasses of regular expressions in DTD and XML schema. In: Proceedings of the 18th APWeb, pp. 368–382 (2016)

    Chapter  Google Scholar 

  25. Li, Y., Zhang, X., Xu, H., Mou, X., Chen, H.: Learning restricted regular expressions with interleaving from XML data. In: Proceedings of the 37th ER, pp. 586–593 (2018)

    Chapter  Google Scholar 

  26. Manolescu, I., Florescu, D., Kossmann, D.: Answering XML queries on heterogeneous data sources. In: Proceedings of the 27th VLDB, pp. 241–250 (2001)

    Google Scholar 

  27. Martens, W., Neven, F.: Typechecking top-down uniform unranked tree transducers. In: Proceedings of the 9th ICDT, pp. 64–78 (2003)

    Google Scholar 

  28. Martens, W., Neven, F.: Frontiers of tractability for typechecking simple XML transformations. J. Comput. Syst. Sci. 73(3), 362–390 (2007)

    Article  MathSciNet  Google Scholar 

  29. mherman: XML Schema Generator, March 2018. http://xml.mherman.org/

  30. Microsoft: Xml Schema Inference - Developer Network, March 2018. https://msdn.microsoft.com/en-us/library/system.xml.schema.xmlschemainference.aspx

  31. Oxygen: XML Editor, March 2018. https://www.oxygenxml.com/

  32. Papakonstantinou, Y., Vianu, V.: DTD inference for views of XML data. In: Proceedings of the 19th PODS, pp. 35–46 (2000)

    Google Scholar 

  33. Peng, F., Chen, H.: Discovering restricted regular expressions with interleaving. In: Proceedings of the 17th APWeb, pp. 104–115 (2015)

    Chapter  Google Scholar 

  34. Quinlan, J.R., Rivest, R.L.: Inferring decision trees using the minimum description length principle. Inf. Comput. 80(3), 227–248 (1989)

    Article  MathSciNet  Google Scholar 

  35. StylusStudio: XML Integrated Development Environment (XML IDE), March 2018. http://www.stylusstudio.com/

  36. liquid technologies: Graphical XML Editor, March 2018. https://www.liquid-technologies.com/

  37. Trang: Multi-Format Schema Converter Based on RELAX NG, October 2008. http://www.thaiopensource.com/relaxng/trang.html

  38. XMLBlueprint: XML Editor, March 2018. https://www.xmlblueprint.com/

  39. Zhang, X., Li, Y., Cui, F., Dong, C., Chen, H.: Inference of a concise regular expression considering interleaving from XML documents. In: Proceedings of the 22nd PAKDD, pp. 389–401 (2018)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Haiming Chen .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Li, Y., Zhang, X., Cao, J., Chen, H., Gao, C. (2019). Learning k-Occurrence Regular Expressions with Interleaving. In: Li, G., Yang, J., Gama, J., Natwichai, J., Tong, Y. (eds) Database Systems for Advanced Applications. DASFAA 2019. Lecture Notes in Computer Science(), vol 11447. Springer, Cham. https://doi.org/10.1007/978-3-030-18579-4_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-18579-4_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-18578-7

  • Online ISBN: 978-3-030-18579-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics