Abstract
Deterministic regular expressions (DREs) have been used in a myriad of areas in data management. However, to the best of our knowledge, presently there has been no large-scale repository of DREs in the literature. In this paper, based on a large corpus of data that we harvested from the Web, we build a large-scale repository of DREs by first collecting a repository after analyzing determinism of the real data; and then further processing the data by using normalized DREs to construct a compact repository of DREs, called DRE pattern set. At last we use our DRE patterns as benchmark datasets in several algorithms that have lacked experiments on real DRE data before. Experimental results demonstrate the usefulness of the repository.
Work supported by the National Natural Science Foundation of China under Grant Nos. 61872339, 61472405, 61762061 and the Natural Science Foundation of Jiangxi Province, China under Grant 20161ACB20004.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
The number of total is not equal to the sum of DTD, XSD, RNG and RegExLib, because there exist duplicate DREs among the different types of files.
References
igraph - the network analysis package. http://igraph.org/
RegExLib. www.regexlib.com
Software for complex networks. http://networkx.github.io/
Abiteboul, S., Milo, T., Benjelloun, O.: Regular rewriting of active XML and unambiguity. In: PODS 2005, pp. 295–303. ACM (2005)
Barbosa, D., Mignet, L., Veltri, P.: Studying the XML Web: gathering statistics from an XML sample. World Wide Web 9(2), 187–212 (2006)
Bex, G.J., Martens, W., Neven, F., Schwentick, T.: Expressiveness of XSDs: from practice to theory, there and back again. In: WWW 2005, pp. 712–721. ACM (2005)
Bex, G.J., Neven, F., Van den Bussche, J.: DTDs versus XML schema: a practical study. In: WebDB 2004, pp. 79–84. ACM (2004)
Bex, G.J., Neven, F., Schwentick, T., Tuyls, K.: Inference of concise DTDs from XML data. In: VLDB 2006, pp. 115–126. VLDB Endowment (2006)
Bex, G.J., Neven, F., Vansummeren, S.: Inferring XML schema definitions from XML data. In: VLDB 2007, pp. 998–1009 (2007)
Björklund, H., Martens, W., Timm, T.: Efficient incremental evaluation of succinct regular expressions. In: CIKM 2015, pp. 1541–1550. ACM (2015)
Brüggemann-Klein, A., Wood, D.: One-unambiguous regular languages. Inf. Comput. 142(2), 182–206 (1998)
Chen, H., Chen, L.: Inclusion test algorithms for one-unambiguous regular expressions. In: Fitzgerald, J.S., Haxthausen, A.E., Yenigun, H. (eds.) ICTAC 2008. LNCS, vol. 5160, pp. 96–110. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-85762-4_7
Chen, H., Lu, P.: Checking determinism of regular expressions with counting. Inf. Comput. 241, 302–320 (2015)
Choi, B.: What are real DTDs like. Technical reports (CIS), p. 17 (2002)
Ciucanu, R., Staworko, S.: Learning schemas for unordered XML. arXiv:1307.6348 [cs.DB] (2013)
Colazzo, D., Ghelli, G., Pardini, L., Sartiani, C.: Efficient asymmetric inclusion of regular expressions with interleaving and counting for XML type-checking. Theor. Comput. Sci. 492(2013), 88–116 (2013)
Colazzo, D., Ghelli, G., Sartiani, C.: Linear time membership in a class of regular expressions with counting, interleaving, and unordered concatenation. ACM Trans. Database Syst. (TODS) 42(4), 24 (2017)
Freydenberger, D.D., Kötzing, T.: Fast learning of restricted regular expressions and DTDs. Theory Comput. Syst. 57, 1114–1158 (2015)
Grijzenhout, S., Marx, M.: The quality of the XML web. In: CIKM 2011, pp. 1719–1724 (2011)
Huang, X., Bao, Z., Davidson, S.B., Milo, T., Yuan, X.: Answering regular path queries on workflow provenance, pp. 375–386. IEEE (2015)
Boneva, I., Ciucanu, R., Staworko, S.: Simple schemas for unordered XML. In: WebDB 2013, pp. 13–18 (2013)
Kilpeläinen, P.: Checking determinism of XML Schema content models in optimal time. Inf. Syst. 36(3), 596–617 (2011)
Laender, A.H., Moro, M.M., Nascimento, C., Martins, P.: An X-ray on web-available XML schemas. ACM SIGMOD Rec. 38(1), 37–42 (2009)
Li, Y., Chu, X., Mou, X., Dong, C., Chen, H.: Practical study of deterministic regular expressions from large-scale XML and schema files. In: IDEAS 2018, pp. 45–53. ACM (2018)
Li, Y., Zhang, X., Peng, F., Chen, H.: Practical study of subclasses of regular expressions in DTD and XML schema. In: Li, F., Shim, K., Zheng, K., Liu, G. (eds.) APWeb 2016. LNCS, vol. 9932, pp. 368–382. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-45817-5_29
Losemann, K., Martens, W.: The complexity of regular expressions and property paths in SPARQL. ACM Trans. Database Syst. 38(4), 24:1–24:39 (2013)
Makoto, M.: RELAX NG home page (2014). http://relaxng.org/. Accessed 25 Feb 2014
Peng, F., Chen, H.: Discovering restricted regular expressions with interleaving. In: Cheng, R., Cui, B., Zhang, Z., Cai, R., Xu, J. (eds.) APWeb 2015. LNCS, vol. 9313, pp. 104–115. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-25255-1_9
Peng, F., Chen, H., Mou, X.: Deterministic regular expressions with interleaving. In: Leucker, M., Rueda, C., Valencia, F.D. (eds.) ICTAC 2015. LNCS, vol. 9399, pp. 203–220. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-25150-9_13
Thompson, H.S., Beech, D., Maloney, M., Mendelsohn, N.: XML Schema part 1: structures second edition. W3C Recommendation (2004)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Chen, H., Li, Y., Dong, C., Chu, X., Mou, X., Min, W. (2019). A Large-Scale Repository of Deterministic Regular Expression Patterns and Its Applications. In: Yang, Q., Zhou, ZH., Gong, Z., Zhang, ML., Huang, SJ. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2019. Lecture Notes in Computer Science(), vol 11441. Springer, Cham. https://doi.org/10.1007/978-3-030-16142-2_20
Download citation
DOI: https://doi.org/10.1007/978-3-030-16142-2_20
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-16141-5
Online ISBN: 978-3-030-16142-2
eBook Packages: Computer ScienceComputer Science (R0)