Skip to main content

Inference of a Concise Regular Expression Considering Interleaving from XML Documents

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10938))

Abstract

XML schemas are useful in various applications. However, many XML documents in practice are not accompanied by a schema or by a valid schema. Therefore, it is essential to design efficient algorithms for schema learning. Each element in XML schema has its content model defined by a regular expression. Schema learning can be reduced to the inference of restricted regular expressions. In this paper, we focus on learning restricted regular expressions with interleaving from a set of XML documents. The new subclass is named as CHAin Regular Expression with Interleaving (ICHARE). Then based on single occurrence automaton (SOA) and maximum independent set (MIS), we introduce an inference algorithm GenICHARE. The algorithm is proved to infer a descriptive ICHARE from a set of given sample. At last, based on the data set crawled from the Web, we compare the coverage proportion of ICHARE compared with other existing subclasses. Besides, we analyze the conciseness of regular expressions inferred by GenICHARE based on DBLP. Experimental results show that ICHARE is more concise and useful in practice, and the inference algorithm is promising and effective.

H. Chen—Work supported by the National Natural Science Foundation of China under Grant Nos. 61472405.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    https://www.oasis-open.org/standards.

  2. 2.

    http://dblp.uni-trier.de/xml/.

  3. 3.

    http://repo1.maven.org/maven2/.

  4. 4.

    http://www.thaiopensource.com/relaxng/trang.html.

  5. 5.

    http://www.xmloperator.net/i2s/.

References

  1. Abiteboul, S., Buneman, P., Suciu, D.: Data on the Web: from Relations to Semistructured Data and XML. Morgan Kaufmann, San Francisco (2000)

    Google Scholar 

  2. Bex, G.J., Gelade, W., Neven, F., Vansummeren, S.: Learning deterministic regular expressions for the inference of schemas from XML data. ACM Trans. Web 4(4), 1–32 (2010)

    Article  Google Scholar 

  3. Bex, G.J., Neven, F., Bussche, J.V.D.: DTDs versus XML schema: a practical study. In: International Workshop on the Web and Databases, pp. 79–84 (2004)

    Google Scholar 

  4. Bex, G.J., Neven, F., Schwentick, T., Tuyls, K.: Inference of concise DTDs from XML data. In: International Conference on Very Large Data Bases, Seoul, Korea, September, pp. 115–126 (2006)

    Google Scholar 

  5. Bex, G.J., Neven, F., Schwentick, T., Vansummeren, S.: Inference of concise regular expressions and DTDs. ACM Trans. Database Syst. 35(2), 1–47 (2010)

    Article  Google Scholar 

  6. Bex, G.J., Neven, F., Vansummeren, S.: Inferring XML schema definitions from XML data. In: International Conference on Very Large Data Bases, University of Vienna, Austria, September, pp. 998–1009 (2007)

    Google Scholar 

  7. Boneva, I., Ciucanu, R., Staworko, S.: Simple schemas for unordered XML. In: International Workshop on the Web and Databases (2015)

    Google Scholar 

  8. Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms, 2nd edn., pp. 1297–1305 (2001)

    Google Scholar 

  9. Feng, X.Q., Zheng, L.X., Chen, H.M.: Inference Algorithm for a Restricted Class of Regular Expressions, vol. 41. Computer Science (2014)

    Google Scholar 

  10. Freydenberger, D.D., Kötzing, T.: Fast learning of restricted regular expressions and DTDs. Theory Comput. Syst. 57(4), 1114–1158 (2015)

    Article  MathSciNet  Google Scholar 

  11. Garcia, P., Vidal, E.: Inference of K-testable languages in the strict sense and application to syntactic pattern recognition. IEEE Trans. Pattern Anal. Mach. Intell. 12(9), 920–925 (2002)

    Article  Google Scholar 

  12. Garofalakis, M., Gionis, A., Shim, K.: XTRACT: learning document type descriptors from XML document collections. Data Min. Knowl. Discov. 7(1), 23–56 (2003)

    Article  MathSciNet  Google Scholar 

  13. Ghelli, G., Colazzo, D., Sartiani, C.: Efficient inclusion for a class of XML types with interleaving and counting. Inf. Syst. 34(7), 643–656 (2009)

    Article  Google Scholar 

  14. Gold, E.M.: Language Identification in the limit. Inf. Control 10(5), 447–474 (1967)

    Article  MathSciNet  Google Scholar 

  15. Grijzenhout, S., Marx, M.: The quality of the XML web. Web Semant. Sci. Serv. Agents World Wide Web 19, 59–68 (2013)

    Article  Google Scholar 

  16. Koch, C., Scherzinger, S., Schweikardt, N., Stegmaier, B.: Schema-based scheduling of event processors and buffer minimization for queries on structured data streams. In: Proceedings of the Thirtieth International Conference on Very Large Data Bases, vol. 30, pp. 228–239. VLDB Endowment (2004)

    Chapter  Google Scholar 

  17. Li, Y., Zhang, X., Peng, F., Chen, H.: Practical study of subclasses of regular expressions in DTD and XML schema. In: Li, F., Shim, K., Zheng, K., Liu, G. (eds.) APWeb 2016. LNCS, vol. 9932, pp. 368–382. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-45817-5_29

    Chapter  Google Scholar 

  18. Manolescu, I., Florescu, D., Kossmann, D.: Answering XML queries on heterogeneous data sources. VLDB 1, 241–250 (2001)

    Google Scholar 

  19. Martens, W., Neven, F.: Typechecking top-down uniform unranked tree transducers. In: Calvanese, D., Lenzerini, M., Motwani, R. (eds.) ICDT 2003. LNCS, vol. 2572, pp. 64–78. Springer, Heidelberg (2003). https://doi.org/10.1007/3-540-36285-1_5

    Chapter  Google Scholar 

  20. Martens, W., Neven, F., Schwentick, T.: Complexity of decision problems for XML schemas and chain regular expressions. Siam J. Comput. 39(4), 1486–1530 (2009)

    Article  MathSciNet  Google Scholar 

  21. Peng, F., Chen, H.: Discovering restricted regular expressions with interleaving. In: Cheng, R., Cui, B., Zhang, Z., Cai, R., Xu, J. (eds.) APWeb 2015. LNCS, vol. 9313, pp. 104–115. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-25255-1_9

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Haiming Chen .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG, part of Springer Nature

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zhang, X., Li, Y., Cui, F., Dong, C., Chen, H. (2018). Inference of a Concise Regular Expression Considering Interleaving from XML Documents. In: Phung, D., Tseng, V., Webb, G., Ho, B., Ganji, M., Rashidi, L. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2018. Lecture Notes in Computer Science(), vol 10938. Springer, Cham. https://doi.org/10.1007/978-3-319-93037-4_31

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-93037-4_31

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-93036-7

  • Online ISBN: 978-3-319-93037-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics