Skip to main content

Learning a Subclass of Deterministic Regular Expression with Counting

  • Conference paper
  • First Online:
Knowledge Science, Engineering and Management (KSEM 2019)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11775))

Abstract

In this paper, we propose a subclass of single-occurrence regular expressions with counting (cSOREs) and give a learning algorithm of cSOREs. First, we learn a SORE. Then, we construct a countable finite automaton (CFA) by traversing the syntax tree of the obtained SORE. Next, the CFA runs on the given finite sample to obtain the minimum and maximum number of repetitions of the subexpressions under the iteration operators. Finally we obtain a cSORE by traversing the syntax tree and introducing the counting operators. Our algorithm not only can learn a cSORE, which is expressive enough to cover more XML data, but also has better generalization ability for smaller sample.

Work supported by National Natural Science Foundation of China under Grant No. 61472405.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    For instance, the original schema in XSD can be denoted by \(r_0=(a|b)^+\), given sample \(S=\{ba,aa,baabaa\}\), the ECsore learnt by InfECsore is \(r_1=(b?a^{[1,2]})^{[1,2]}\). However, an learnt cSORE can be \(r_2=(b?a)^{[1,4]}\), \(|\mathcal {L}(r_1)|=16<|\mathcal {L}(r_2)|=30\). Note that \(\mathcal {L}(r_0)\supseteq \mathcal {L}(r_2)\supseteq S\) and \(\mathcal {L}(r_0)\supseteq \mathcal {L}(r_1)\supseteq S\).

References

  1. Abiteboul, S., Buneman, P., Suciu, D.: Data on the Web: From Relations to Semistructured Data and XML. Morgan Kaufmann, Burlington (2000)

    Google Scholar 

  2. Barbosa, D., Mendelzon, A.O., Keenleyside, J., Lyons, K.: ToXgene: an extensible template-based data generator for XML. In: WebDB (2002)

    Google Scholar 

  3. Barbosa, D., Mignet, L., Veltri, P.: Studying the XML web: gathering statistics from an XML sample. World Wide Web 9(2), 187–212 (2006)

    Article  Google Scholar 

  4. Bex, G.J., Gelade, W., Neven, F., Vansummeren, S.: Learning deterministic regular expressions for the inference of schemas from XML data. In: Proceedings of the 17th International Conference on World Wide Web, pp. 825–834. ACM (2008)

    Google Scholar 

  5. Bex, G.J., Gelade, W., Neven, F., Vansummeren, S.: Learning deterministic regular expressions for the inference of schemas from XML data. ACM Trans. Web 4(4), 1–32 (2010)

    Article  Google Scholar 

  6. Bex, G.J., Martens, W., Neven, F., Schwentick, T.: Expressiveness of XSDs: from practice to theory, there and back again. In: Proceedings of the 14th International Conference on World Wide Web, pp. 712–721. ACM (2005)

    Google Scholar 

  7. Bex, G.J., Neven, F., Van den Bussche, J.: DTDs versus XML schema: a practical study. In: Proceedings of the 7th International Workshop on the Web and Databases: Colocated with ACM SIGMOD/PODS 2004, pp. 79–84. ACM (2004)

    Google Scholar 

  8. Bex, G.J., Neven, F., Schwentick, T., Tuyls, K.: Inference of concise DTDs from XML data. In: International Conference on Very Large Data Bases, Seoul, Korea, pp. 115–126, September 2006

    Google Scholar 

  9. Bex, G.J., Neven, F., Schwentick, T., Vansummeren, S.: Inference of concise regular expressions and DTDs. ACM Trans. Database Syst. 35(2), 1–47 (2010)

    Article  Google Scholar 

  10. Boneva, I., Ciucanu, R., Staworko, S.: Schemas for unordered XML on a DIME. Theor. Comput. Syst. 57(2), 337–376 (2015)

    Article  MathSciNet  Google Scholar 

  11. Brüggemann-Klein, A., Wood, D.: One-unambiguous regular languages. Inf. Comput. 142(2), 182–206 (1998)

    Article  MathSciNet  Google Scholar 

  12. Che, D., Aberer, K., Özsu, M.T.: Query optimization in XML structured-document databases. VLDB J. 15(3), 263–289 (2006)

    Article  Google Scholar 

  13. Freydenberger, D.D., Kötzing, T.: Fast learning of restricted regular expressions and DTDs. In: Proceedings of the 16th International Conference on Database Theory, pp. 45–56. ACM (2013)

    Google Scholar 

  14. Freydenberger, D.D., Kötzing, T.: Fast learning of restricted regular expressions and DTDs. Theor. Comput. Syst. 57(4), 1114–1158 (2015)

    Article  MathSciNet  Google Scholar 

  15. Gelade, W., Gyssens, M., Martens, W.: Regular expressions with counting: weak versus strong determinism. SIAM J. Comput. 41(1), 160–190 (2012)

    Article  MathSciNet  Google Scholar 

  16. Gold, E.M.: Language identification in the limit. Inf. Control 10(5), 447–474 (1967)

    Article  MathSciNet  Google Scholar 

  17. Hovland, D.: Regular expressions with numerical constraints and automata with counters. In: Leucker, M., Morgan, C. (eds.) ICTAC 2009. LNCS, vol. 5684, pp. 231–245. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-03466-4_15

    Chapter  Google Scholar 

  18. Kilpeläinen, P., Tuhkanen, R.: Towards efficient implementation of XML schema content models. In: Proceedings of the 2004 ACM Symposium on Document Engineering, pp. 239–241. ACM (2004)

    Google Scholar 

  19. Kilpeläinen, P., Tuhkanen, R.: One-unambiguity of regular expressions with numeric occurrence indicators. Inf. Comput. 205(6), 890–916 (2007)

    Article  MathSciNet  Google Scholar 

  20. Latte, M., Niewerth, M.: Definability by weakly deterministic regular expressions with counters is decidable. In: Italiano, G.F., Pighizzini, G., Sannella, D.T. (eds.) MFCS 2015. LNCS, vol. 9234, pp. 369–381. Springer, Heidelberg (2015). https://doi.org/10.1007/978-3-662-48057-1_29

    Chapter  Google Scholar 

  21. Manolescu, I., Florescu, D., Kossmann, D.: Answering XML queries on heterogeneous data sources. In: VLDB, vol. 1, pp. 241–250 (2001)

    Google Scholar 

  22. Martens, W., Neven, F.: Typechecking top-down uniform unranked tree transducers. In: Calvanese, D., Lenzerini, M., Motwani, R. (eds.) ICDT 2003. LNCS, vol. 2572, pp. 64–78. Springer, Heidelberg (2003). https://doi.org/10.1007/3-540-36285-1_5

    Chapter  Google Scholar 

  23. Mignet, L., Barbosa, D., Veltri, P.: The XML web: a first study. In: Proceedings of the 12th International Conference on World Wide Web, pp. 500–510. ACM (2003)

    Google Scholar 

  24. Thompson, H., Beech, D., Maloney, M., Mendelsohn, N.: XML Schema Part 1: Structures, 2nd edn. W3C Recommendation (2004)

    Google Scholar 

  25. Wang, X., Chen, H.: Inferring deterministic regular expression with counting. In: Trujillo, J.C., et al. (eds.) ER 2018. LNCS, vol. 11157, pp. 184–199. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00847-5_15

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiaofan Wang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Wang, X., Chen, H. (2019). Learning a Subclass of Deterministic Regular Expression with Counting. In: Douligeris, C., Karagiannis, D., Apostolou, D. (eds) Knowledge Science, Engineering and Management. KSEM 2019. Lecture Notes in Computer Science(), vol 11775. Springer, Cham. https://doi.org/10.1007/978-3-030-29551-6_29

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-29551-6_29

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-29550-9

  • Online ISBN: 978-3-030-29551-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics