skip to main content
10.1145/3452021.3458325acmconferencesArticle/Chapter ViewAbstractPublication PagespodsConference Proceedingsconference-collections
research-article

Spanner Evaluation over SLP-Compressed Documents

Published: 20 June 2021 Publication History

Abstract

We consider the problem of evaluating regular spanners over compressed documents, i.e., we wish to solve evaluation tasks directly on the compressed data, without decompression. As compressed forms of the documents we use straight-line programs (SLPs) --- a lossless compression scheme for textual data widely used in different areas of theoretical computer science and particularly well-suited for algorithmics on compressed data. In data complexity, our results are as follows. For a regular spanner M and an SLP $\mathcalS $ of size $\mathbfs $ that represents a document D, we can solve the tasks of model checking and of checking non-emptiness in time $O(\mathbfs )$. Computing the set $łlbracket M \rrbracket(D)$ of all span-tuples extracted from D can be done in time $Ø(\mathbfs |łlbracket M \rrbracket(D)|)$, and enumeration of $łlbracket M \rrbracket(D)$ can be done with linear preprocessing $O(\mathbfs )$ and a delay of $O(depth\mathcalS )$, where $depth\mathcalS $ is the depth of $\mathcalS $'s derivation tree. Note that $\mathbfs $ can be exponentially smaller than the document's size $|D|$; and, due to known balancing results for SLPs, we can always assume that $depth\mathcalS = O(log(|D|))$ independent of D's compressibility. Hence, our enumeration algorithm has a delay logarithmic in the size of the non-compressed data and a preprocessing time that is at best (i.e., in the case of highly compressible documents) also logarithmic, but at worst still linear. Therefore, in a big-data perspective, our enumeration algorithm for SLP-compressed documents may nevertheless beat the known linear preprocessing and constant delay algorithms for non-compressed documents.

Supplementary Material

MP4 File (PODS 2021 ACM DL video.mp4)
This is a presentation video for the paper "Spanner Evaluation over SLP-Compressed Documents" published at ACM SIGMOD/PODS 2021.

References

[1]
A. Abboud, A. Backurs, K. Bringmann, and M. Künnemann. 2017. Fine-Grained Complexity of Analyzing Compressed Data: Quantifying Improvements over Decompress-and-Solve. In Proc. FOCS'17. 192--203. Extended version available at http://arxiv.org/abs/1803.00796.
[2]
A. Amarilli, P. Bourhis, S. Mengel, and M. Niewerth. 2019. Constant-Delay Enumeration for Nondeterministic Document Spanners. In Proc. ICDT'19 .
[3]
A. Amarilli, P. Bourhis, S. Mengel, and M. Niewerth. 2020. Constant-Delay Enumeration for Nondeterministic Document Spanners. SIGMOD Record, Vol. 49, 1 (2020), 25--32. https://doi.org/10.1145/3422648.3422655
[4]
K. Casel, H. Fernau, S. Gaspers, B. Gras, and M.L. Schmid. 2020. On the Complexity of the Smallest Grammar Problem over Fixed Alphabets. Theory of Computing Systems (2020). https://doi.org/10.1007/s00224-020-10013-w
[5]
M. Charikar, E. Lehman, D. Liu, R. Panigrahy, M. Prabhakaran, A. Sahai, and A. Shelat. 2005. The smallest grammar problem. IEEE Transactions on Information Theory, Vol. 51, 7 (2005), 2554--2576.
[6]
Patrick Hagge Cording. 2015. Algorithms and data structures for grammar-compressed strings . Ph.D. Dissertation.
[7]
J. Doleschal, B. Kimelfeld, W. Martens, Y. Nahshon, and F. Neven. 2019. Split-Correctness in Information Extraction. In Proc. PODS'19. 149--163.
[8]
R. Fagin, B. Kimelfeld, F. Reiss, and S. Vansummeren. 2015. Document Spanners: A Formal Approach to Information Extraction. J. ACM, Vol. 62, 2 (2015), 12:1--12:51.
[9]
F. Florenzano, C. Riveros, M. Ugarte, S. Vansummeren, and D. Vrgoc. 2018. Constant Delay Algorithms for Regular Document Spanners. In Proc. PODS'18 .
[10]
D. Freydenberger. 2019. A Logic for Document Spanners. Theory Comput. Syst., Vol. 63, 7 (2019), 1679--1754. https://doi.org/10.1007/s00224-018-9874-1
[11]
D. Freydenberger and M. Holldack. 2018. Document Spanners: From Expressive Power to Decision Problems. Theory Comput. Syst., Vol. 62, 4 (2018), 854--898.
[12]
D. Freydenberger, B. Kimelfeld, and L. Peterfreund. 2018. Joining Extractions of Regular Expressions. In Proc. PODS'18. 137--149.
[13]
M. Ganardi, A. Jez, and M. Lohrey. 2019. Balancing Straight-Line Programs. In Proc. FOCS'19. 1169--1183.
[14]
K. Goto, S. Maruyama, S. Inenaga, H. Bannai, H. Sakamoto, and M. Takeda. 2011. Restructuring Compressed Texts without Explicit Decompression. CoRR, Vol. abs/1107.2729 (2011). http://arxiv.org/abs/1107.2729
[15]
J. C. Kieffer and E.-H. Yang. 2000. Grammar-based codes: A new class of universal lossless source codes. IEEE Trans. on Information Theory, Vol. 46, 3 (2000), 737--754.
[16]
E. Lehman. 2002. Approximation Algorithms for Grammar-Based Data Compression. Ph.D. Dissertation. Massachusetts Institute of Technology.
[17]
M. Lohrey. 2012. Algorithmics on SLP-compressed strings: A survey. Groups Complex. Cryptol., Vol. 4, 2 (2012), 241--299. https://doi.org/10.1515/gcc-2012-0016
[18]
M. Lohrey. 2014. The Compressed Word Problem for Groups Springer Briefs in Mathematics ed. Springer.
[19]
N. Markey and P. Schnoebelen. 2004. A PTIME-complete matching problem for SLP-compressed words. Inf. Process. Lett., Vol. 90, 1 (2004), 3--6.
[20]
F. Maturana, C. Riveros, and D. Vrgoc. 2018. Document Spanners for Extracting Incomplete Information: Expressiveness and Complexity. In Proc. PODS'18 .
[21]
C. Nevill-Manning and I. Witten. 1997. Identifying Hierarchical Structure in Sequences: A linear-time algorithm. J. Artif. Intelligence Research, Vol. 7 (1997), 67--82.
[22]
C. G. Nevill-Manning. 1996. Inferring Sequential Structure . Ph.D. Dissertation. University of Waikato, NZ.
[23]
L. Peterfreund. 2019. The Complexity of Relational Queries over Extractions from Text. Ph.D. Dissertation.
[24]
L. Peterfreund. 2021. Grammars for Document Spanners. In Proc. ICDT'21 . Extended version available at https://arxiv.org/abs/2003.06880.
[25]
W. Plandowski and W. Rytter. 1999. Complexity of Language Recognition Problems for Compressed Words. In Jewels are Forever, Contributions on Theoretical Computer Science in Honor of Arto Salomaa . 262--272.
[26]
W. Rytter. 2003. Application of Lempel-Ziv factorization to the approximation of grammar-based compression. Theor. Comput. Sci., Vol. 302, 1--3 (2003), 211--222.
[27]
M.L. Schmid and N. Schweikardt. 2021. A Purely Regular Approach to Non-Regular Core Spanners. In Proc. ICDT'21 . Extended version available at https://arxiv.org/abs/2010.13442.
[28]
M.L. Schmid and N. Schweikardt. 2021. Spanner evaluation over SLP-Compressed documents. CoRR, Vol. abs/2101.10890 (2021). https://arxiv.org/abs/2101.10890
[29]
J. A. Storer and T. G. Szymanski. 1982. Data compression via textual substitution. Journal of the ACM, Vol. 29, 4 (1982), 928--951.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
PODS'21: Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems
June 2021
440 pages
ISBN:9781450383813
DOI:10.1145/3452021
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 June 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. algorithmics on compressed data
  2. document spanners
  3. information extraction
  4. straight-line programs

Qualifiers

  • Research-article

Funding Sources

Conference

SIGMOD/PODS '21
Sponsor:

Acceptance Rates

Overall Acceptance Rate 642 of 2,707 submissions, 24%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)16
  • Downloads (Last 6 weeks)3
Reflects downloads up to 03 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Enumeration for MSO-Queries on Compressed TreesProceedings of the ACM on Management of Data10.1145/36511412:2(1-17)Online publication date: 14-May-2024
  • (2024)Structural and Combinatorial Properties of 2-Swap Word Permutation GraphsLATIN 2024: Theoretical Informatics10.1007/978-3-031-55601-2_5(61-76)Online publication date: 6-Mar-2024
  • (2023)Enumerating grammar-based extractionsDiscrete Applied Mathematics10.1016/j.dam.2023.08.014341:C(372-392)Online publication date: 31-Dec-2023
  • (2022)Conjunctive Regular Path Queries with Capture GroupsACM Transactions on Database Systems10.1145/351423047:2(1-52)Online publication date: 23-May-2022

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media