Skip to main content

Learning Restricted Deterministic Regular Expressions with Counting

  • Conference paper
  • First Online:
Book cover Web Information Systems Engineering – WISE 2019 (WISE 2020)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11881))

Included in the following conference series:

  • 2227 Accesses

Abstract

Regular expressions are widely used in various fields. Learning regular expressions from sequence data is still a popular topic. Since many XML documents are not accompanied by a schema, or a valid schema, learning regular expressions from XML documents becomes an essential work. In this paper, we propose a restricted subclass of single-occurrence regular expressions with counting (RCsores) and give a learning algorithm of RCsores. First, we learn a single-occurrence regular expressions (SORE). Then, we construct an equivalent countable finite automaton (CFA). Next, the CFA runs on the given finite sample to obtain an updated CFA, which contains counting operators occurring in an RCsore. Finally we transform the updated CFA to an RCsore. Moreover, our algorithm can ensure the result is a minimal generalization (such generalization is called descriptive) of the given finite sample.

Work supported by National Natural Science Foundation of China under Grant Nos. 61872339, 61472405.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://schemas.opengis.net/.

  2. 2.

    For instance, the original expression in XSD can be denoted by \(r_0=(a|b)^{[1,6]}\), given sample \(\{ba,aa,abaa,aabaa\}\), the ECsore learnt by InfECsore is \(r_1=(b?a^{[1,2]})^{[1,2]}\). However, the learnt RCsore can be \(r_2=(b?a)^{[1,4]}\). Let \(S_1\!=\!\{s|s\!\in \! \mathcal {L}(r_0),s\!\in \! \mathcal {L}(r_1)\}\) and \(S_2\!=\!\{s|s\!\in \! \mathcal {L}(r_0),s\!\in \! \mathcal {L}(r_2)\}\). Then, \(|S_1|\!=\!14\) and \(|S_2|\!=\!25\). Thus, \(\frac{|S_1|}{|\mathcal {L}(r_0)|}<\frac{|S_2|}{|\mathcal {L}(r_0)|}\).

  3. 3.

    Let \(S\!=\!\{b,abd,ad,cddcdd\}\), the cSORE learnt by InfcSORE is \(r_3\!=\!((a?b?|c)d?)^{[1,4]}\), however, there is a cSORE \(r_4\!=\!(a?b?|c?(d^{[1,2]})?)^{[1,2]}\) such that \(\mathcal {L}(r_3)\!\supset \!\mathcal {L}(r_4)\!\supseteq \!S\).

  4. 4.

    Note that, the CFA \(\mathcal {A}\) runs on S, the direct counting result for b is \((l(b),u(b))\!=\!(1,2)\). However, (l(b), u(b)) is subsequently updated by Counting that b can be repeated by using the counting operator \([l(+_2),u(+_2)]\!=\![1,3]\).

  5. 5.

    http://dblp.org/xml/release/.

  6. 6.

    http://www.dbis.informatik.uni-goettingen.de/Mondial/#XML.

  7. 7.

    http://www.cs.toronto.edu/tox/toxgene/.

  8. 8.

    http://schemas.opengis.net/.

References

  1. Barbosa, D., Mignet, L., Veltri, P.: Studying the XML Web: gathering statistics from an XML sample. World Wide Web 9(2), 187–212 (2006)

    Article  Google Scholar 

  2. Bex, G.J., Gelade, W., Neven, F., Vansummeren, S.: Learning deterministic regular expressions for the inference of schemas from XML data. In: Proceedings of the 17th International Conference on World Wide Web, pp. 825–834. ACM (2008)

    Google Scholar 

  3. Bex, G.J., Gelade, W., Neven, F., Vansummeren, S.: Learning deterministic regular expressions for the inference of schemas from XML data. ACM Trans. Web 4(4), 1–32 (2010)

    Article  Google Scholar 

  4. Bex, G.J., Martens, W., Neven, F., Schwentick, T.: Expressiveness of XSDs: from practice to theory, there and back again. In: Proceedings of the 14th International Conference on World Wide Web, pp. 712–721. ACM (2005)

    Google Scholar 

  5. Bex, G.J., Neven, F., Van den Bussche, J.: DTDs versus XML Schema: a practical study. In: Proceedings of the 7th International Workshop on the Web and Databases: Colocated with ACM SIGMOD/PODS 2004, pp. 79–84. ACM (2004)

    Google Scholar 

  6. Bex, G.J., Neven, F., Schwentick, T., Tuyls, K.: Inference of concise DTDs from XML data. In: International Conference on Very Large Data Bases, Seoul, Korea, pp. 115–126, September 2006

    Google Scholar 

  7. Bex, G.J., Neven, F., Schwentick, T., Vansummeren, S.: Inference of concise regular expressions and DTDs. ACM Trans. Database Syst. 35(2), 1–47 (2010)

    Article  Google Scholar 

  8. Brüggemann-Klein, A., Wood, D.: One-unambiguous regular languages. Inf. Comput. 142(2), 182–206 (1998)

    Article  MathSciNet  MATH  Google Scholar 

  9. Bui, D.D.A., Zeng-Treitler, Q.: Learning regular expressions for clinical text classification. J. Am. Med. Inform. Assoc. 21(5), 850–857 (2014)

    Article  Google Scholar 

  10. Che, D., Aberer, K., Özsu, M.T.: Query optimization in XML structured-document databases. VLDB J. 15(3), 263–289 (2006)

    Article  Google Scholar 

  11. Freydenberger, D.D., Kötzing, T.: Fast learning of restricted regular expressions and DTDs. In: Proceedings of the 16th International Conference on Database Theory, pp. 45–56. ACM (2013)

    Google Scholar 

  12. Freydenberger, D.D., Kötzing, T.: Fast learning of restricted regular expressions and DTDs. Theory Comput. Syst. 57(4), 1114–1158 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  13. Freydenberger, D.D., Reidenbach, D.: Inferring descriptive generalisations of formal languages. J. Comput. Syst. Sci. 79(5), 622–639 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  14. Gelade, W., Gyssens, M., Martens, W.: Regular expressions with counting: weak versus strong determinism. SIAM J. Comput. 41(1), 160–190 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  15. Gold, E.M.: Language identification in the limit. Inf. Control 10(5), 447–474 (1967)

    Article  MathSciNet  MATH  Google Scholar 

  16. Hovland, D.: Regular expressions with numerical constraints and automata with counters. In: Leucker, M., Morgan, C. (eds.) ICTAC 2009. LNCS, vol. 5684, pp. 231–245. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-03466-4_15

    Chapter  Google Scholar 

  17. Kilpeläinen, P., Tuhkanen, R.: Towards efficient implementation of XML Schema content models. In: Proceedings of the 2004 ACM Symposium on Document Engineering, pp. 239–241. ACM (2004)

    Google Scholar 

  18. Kilpeläinen, P., Tuhkanen, R.: One-unambiguity of regular expressions with numeric occurrence indicators. Inf. Comput. 205(6), 890–916 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  19. Latte, M., Niewerth, M.: Definability by weakly deterministic regular expressions with counters is decidable. In: Italiano, G.F., Pighizzini, G., Sannella, D.T. (eds.) MFCS 2015. LNCS, vol. 9234, pp. 369–381. Springer, Heidelberg (2015). https://doi.org/10.1007/978-3-662-48057-1_29

    Chapter  Google Scholar 

  20. Lee, M., So, S., Oh, H.: Synthesizing regular expressions from examples for introductory automata assignments. In: ACM SIGPLAN Notices, vol. 52, pp. 70–80. ACM (2016)

    Google Scholar 

  21. Manolescu, I., Florescu, D., Kossmann, D.: Answering XML queries on heterogeneous data sources. In: VLDB, vol. 1, pp. 241–250 (2001)

    Google Scholar 

  22. Martens, W., Neven, F.: Typechecking top-down uniform unranked tree transducers. In: Calvanese, D., Lenzerini, M., Motwani, R. (eds.) ICDT 2003. LNCS, vol. 2572, pp. 64–78. Springer, Heidelberg (2003). https://doi.org/10.1007/3-540-36285-1_5

    Chapter  Google Scholar 

  23. Mignet, L., Barbosa, D., Veltri, P.: The XML Web: a first study. In: Proceedings of the 12th International Conference on World Wide Web, pp. 500–510. ACM (2003)

    Google Scholar 

  24. Moreo, A., Eisman, E.M., Castro, J.L., Zurita, J.M.: Learning regular expressions to template-based FAQ retrieval systems. Knowl.-Based Syst. 53, 108–128 (2013)

    Article  Google Scholar 

  25. Wang, X., Chen, H.: Inferring deterministic regular expression with counting. In: Trujillo, J.C., et al. (eds.) ER 2018. LNCS, vol. 11157, pp. 184–199. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00847-5_15

    Chapter  Google Scholar 

  26. Wang, X., Chen, H.: Learning a subclass of deterministic regular expression with counting. In: Douligeris, C., Karagiannis, D., Apostolou, D. (eds.) KSEM 2019. LNCS (LNAI), vol. 11775, pp. 341–348. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-29551-6_29

    Chapter  Google Scholar 

  27. Xie, Y., Yu, F., Achan, K., Panigrahy, R., Hulten, G., Osipkov, I.: Spamming botnets: signatures and characteristics. ACM SIGCOMM Comput. Commun. Rev. 38(4), 171–182 (2008)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Haiming Chen .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Wang, X., Chen, H. (2019). Learning Restricted Deterministic Regular Expressions with Counting. In: Cheng, R., Mamoulis, N., Sun, Y., Huang, X. (eds) Web Information Systems Engineering – WISE 2019. WISE 2020. Lecture Notes in Computer Science(), vol 11881. Springer, Cham. https://doi.org/10.1007/978-3-030-34223-4_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-34223-4_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-34222-7

  • Online ISBN: 978-3-030-34223-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics