skip to main content
research-article

Streaming Enumeration on Nested Documents

Published: 12 December 2024 Publication History

Abstract

Some of the most relevant document schemas used online, such as XML and JSON, have a nested format. In the past decade, the task of extracting data from nested documents over streams has become especially relevant. We focus on the streaming evaluation of queries with outputs of varied sizes over nested documents. We model queries of this kind as Visibly Pushdown Annotators (VPAnn), a computational model that extends visibly pushdown automata with outputs and has the same expressive power as monadic second-order logic over nested documents. Since processing a document through a VPAnn can generate a massive number of results, we are interested in reading the input in a streaming fashion and enumerating the outputs one after another as efficiently as possible, namely, with constant delay. This article presents an algorithm that enumerates these elements with constant delay after processing the document stream in a single pass. Furthermore, we show that this algorithm is worst-case optimal in terms of update-time per symbol and memory usage.

References

[1]
Alfred V. Aho, John E. Hopcroft, and Jeffrey D. Ullman. 1974. The Design and Analysis of Computer Algorithms. Addison-Wesley.
[2]
Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman. 1986. Compilers, Principles, Techniques. Addison-Wesley, Boston, MA.
[3]
Mehmet Altınel and Michael J. Franklin. 2000. Efficient filtering of XML documents for selective dissemination of information. In Proceedings of the VLDB. 53–64.
[4]
Rajeev Alur, Dana Fisman, Konstantinos Mamouras, Mukund Raghothaman, and Caleb Stanford. 2020. Streamable regular transductions. Theor. Comput. Sci. 807 (2020), 15–41.
[5]
Rajeev Alur and P. Madhusudan. 2004. Visibly pushdown languages. In Proceedings of the STOC. 202–211.
[6]
Antoine Amarilli, Pierre Bourhis, Louis Jachiet, and Stefan Mengel. 2017. A circuit-based approach to efficient enumeration. In Proceedings of the ICALP. 111:1–111:15.
[7]
Antoine Amarilli, Pierre Bourhis, Stefan Mengel, and Matthias Niewerth. 2019. Constant-delay enumeration for nondeterministic document spanners. In Proceedings of the ICDT. 22:1–22:19.
[8]
Antoine Amarilli, Pierre Bourhis, Stefan Mengel, and Matthias Niewerth. 2019. Enumeration on trees with tractable combined complexity and efficient updates. In Proceedings of the PODS. 89–103.
[9]
Antoine Amarilli, Louis Jachiet, Martín Muñoz, and Cristian Riveros. 2022. Efficient enumeration for annotated grammars. In Proceedings of the PODS. 291–300.
[10]
Marcelo Arenas, Luis Alberto Croquevielle, Rajesh Jayaram, and Cristian Riveros. 2019. Efficient logspace classes for enumeration, counting, and uniform generation. In Proceedings of the PODS. 59–73.
[11]
Brian Babcock, Shivnath Babu, Mayur Datar, Rajeev Motwani, and Jennifer Widom. 2002. Models and issues in data stream systems. In Proceedings of the SIGMOD. 1–16.
[12]
Guillaume Bagan. 2006. MSO queries on tree decomposable structures are computable with linear delay. In Proceedings of the CSL. 167–181.
[13]
Guillaume Bagan, Arnaud Durand, and Etienne Grandjean. 2007. On acyclic conjunctive queries and constant delay enumeration. In Proceedings of the CSL. 208–222.
[14]
Ziv Bar-Yossef, Marcus Fontoura, and Vanja Josifovski. 2005. Buffering in query evaluation over XML streams. In Proceedings of the PODS. 216–227.
[15]
Ziv Bar-Yossef, Marcus Fontoura, and Vanja Josifovski. 2007. On the memory requirements of XPath evaluation over XML streams. J. Comput. Syst. Sci. 73, 3 (2007), 391–441.
[16]
Corentin Barloy, Filip Murlak, and Charles Paperman. 2021. Stackless processing of streamed trees. In Proceedings of the PODS.
[17]
Christoph Berkholz, Fabian Gerhardt, and Nicole Schweikardt. 2020. Constant delay enumeration for conjunctive queries: A tutorial. ACM SIGLOG News 7, 1 (2020), 4–33.
[18]
Christoph Berkholz, Jens Keppeler, and Nicole Schweikardt. 2017. Answering conjunctive queries under updates. In Proceedings of the PODS. 303–318.
[19]
Pierre Bourhis, Juan L. Reutter, and Domagoj Vrgoc. 2020. JSON: Data model and query languages. Inf. Syst. 89 (2020), 101478.
[20]
Mathieu Caralp, Pierre-Alain Reynier, and Jean-Marc Talbot. 2015. Trimming visibly pushdown automata. Theor. Comput. Sci. 578 (2015), 13–29.
[21]
Yi Chen, Susan B. Davidson, and Yifeng Zheng. 2006. An efficient XPath query processor for XML streams. In Proceedings of the ICDE. 79.
[22]
Rada Chirkova and Jun Yang. 2012. Materialized views. Found. Trends Databases 4, 4 (2012), 295–405.
[23]
Bruno Courcelle. 2009. Linear delay enumeration and monadic second-order logic. Discret. Appl. Math. 157, 12 (2009), 2675–2700.
[24]
James R. Driscoll, Neil Sarnak, Daniel Dominic Sleator, and Robert Endre Tarjan. 1986. Making data structures persistent. In Proceedings of the STOC. 109–121.
[25]
Arnaud Durand and Etienne Grandjean. 2007. First-order queries on structures of bounded degree are computable with constant delay. ACM Trans. Comput. Log. 8, 4 (2007), 21.
[26]
Ronald Fagin, Benny Kimelfeld, Frederick Reiss, and Stijn Vansummeren. 2015. Document spanners: A formal approach to information extraction. J. ACM 62, 2 (2015), 12:1–12:51.
[27]
Emmanuel Filiot, Olivier Gauwin, Pierre-Alain Reynier, and Frédéric Servais. 2019. Streamability of nested word transductions. LMCS 15, 2 (2019).
[28]
Emmanuel Filiot, Jean-François Raskin, Pierre-Alain Reynier, Frédéric Servais, and Jean-Marc Talbot. 2018. Visibly pushdown transducers. JCSS 97 (2018), 147–181.
[29]
Fernando Florenzano, Cristian Riveros, Martín Ugarte, Stijn Vansummeren, and Domagoj Vrgoc. 2020. Efficient enumeration algorithms for regular document spanners. TODS 45, 1 (2020), 3:1–3:42.
[30]
Dominik D. Freydenberger. 2019. A logic for document spanners. Theory Comput. Syst. 63, 7 (2019), 1679–1754.
[31]
Dominik D. Freydenberger, Benny Kimelfeld, and Liat Peterfreund. 2018. Joining extractions of regular expressions. In Proceedings of the PODS. 137–149.
[32]
Olivier Gauwin, Joachim Niehren, and Yves Roos. 2008. Streaming tree automata. Inf. Process. Lett. 109, 1 (2008), 13–17.
[33]
Olivier Gauwin, Joachim Niehren, and Sophie Tison. 2009. Bounded delay and concurrency for earliest query answering. In Proceedings of the LATA, Vol. 5457. 350–361.
[34]
Olivier Gauwin, Joachim Niehren, and Sophie Tison. 2009. Earliest query answering for deterministic nested word automata. In Proceedings of the FCT, Vol. 5699. 121–132.
[35]
Gang Gou and Rada Chirkova. 2007. Efficient algorithms for evaluating XPath over streams. In Proceedings of the SIGMOD. ACM, 269–280.
[36]
Todd J. Green, Ashish Gupta, Gerome Miklau, Makoto Onizuka, and Dan Suciu. 2004. Processing XML streams with deterministic automata and stream indexes. ACM Trans. Database Syst. 29, 4 (2004), 752–788.
[37]
Alejandro Grez and Cristian Riveros. 2020. Towards streaming evaluation of queries with correlation in complex event processing. In Proceedings of the ICDT. 14:1–14:17.
[38]
Alejandro Grez, Cristian Riveros, and Martín Ugarte. 2019. A formal framework for complex event processing. In Proceedings of the ICDT. 5:1–5:18.
[39]
Muhammad Idris, Martín Ugarte, and Stijn Vansummeren. 2017. The dynamic yannakakis algorithm: Compact and efficient query processing under updates. In Proceedings of the SIGMOD. 1259–1274.
[40]
Mark Jerrum, Leslie G. Valiant, and Vijay V. Vazirani. 1986. Random generation of combinatorial structures from a uniform distribution. Theor. Comput. Sci. 43 (1986), 169–188.
[41]
Vanja Josifovski, Marcus Fontoura, and Attila Barta. 2005. Querying XML streams. VLDB J. 14, 2 (2005), 197–210.
[42]
Ahmet Kara, Milos Nikolic, Dan Olteanu, and Haozhe Zhang. 2020. Trade-offs in static and dynamic evaluation of hierarchical queries. In Proceedings of the PODS. 375–392.
[43]
Viraj Kumar, P. Madhusudan, and Mahesh Viswanathan. 2007. Visibly pushdown automata for streaming XML. In Proceedings of the WWW. 1053–1062.
[44]
Leonid Libkin. 2004. Elements of Finite Model Theory. Vol. 41. Springer.
[45]
Francisco Maturana, Cristian Riveros, and Domagoj Vrgoc. 2018. Document spanners for extracting incomplete information: Expressiveness and complexity. In Proceedings of the PODS. 125–136.
[46]
Martin Muñoz and Cristian Riveros. 2023. Constant-delay enumeration for SLP-compressed documents. In Proceedings of the ICDT.
[47]
Milos Nikolic and Dan Olteanu. 2018. Incremental view maintenance with triple lock factorization benefits. In Proceedings of the SIGMOD. 365–380.
[48]
Dan Olteanu. 2007. SPEX: Streamed and progressive evaluation of XPath. IEEE Trans. Knowl. Data Eng. 19, 7 (2007), 934–949.
[49]
Dan Olteanu, Tim Furche, and François Bry. 2004. An efficient single-pass query evaluator for XML data streams. In Proceedings of the SAC. 627–631.
[50]
Dan Olteanu and Jakub Závodný. 2015. Size bounds for factorised representations of query results. ACM TODS 40, 1 (2015), 2:1–2:44.
[51]
Liat Peterfreund. 2021. Grammars for document spanners. In Proceedings of the ICDT. 7:1–7:18.
[52]
Luc Segoufin. 2013. Enumerating with constant delay the answers to a query. In Proceedings of the ICDT. 10–20.
[53]
Luc Segoufin and Victor Vianu. 2002. Validating streaming XML documents. In Proceedings of the PODS. 53–64.
[54]
Mirit Shalem and Ziv Bar-Yossef. 2008. The space complexity of processing XML twig queries over indexed documents. In Proceedings of the ICDE. 824–832.
[55]
Balder ten Cate and Maarten Marx. 2007. Navigational XPath: Calculus and algebra. SIGMOD Rec. 36, 2 (2007), 19–26.
[56]
Szymon Torunczyk. 2020. Aggregate queries on sparse databases. In Proceedings of the PODS. 427–443.

Index Terms

  1. Streaming Enumeration on Nested Documents

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Database Systems
    ACM Transactions on Database Systems  Volume 49, Issue 4
    December 2024
    198 pages
    EISSN:1557-4644
    DOI:10.1145/3613725
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 12 December 2024
    Online AM: 25 October 2024
    Accepted: 20 August 2024
    Revised: 10 July 2024
    Received: 09 August 2023
    Published in TODS Volume 49, Issue 4

    Check for updates

    Author Tags

    1. Streaming
    2. nested documents
    3. query evaluation
    4. enumeration algorithms

    Qualifiers

    • Research-article

    Funding Sources

    • ANID Fondecyt Regular
    • ANID—Millennium Science Initiative Program—Code

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 91
      Total Downloads
    • Downloads (Last 12 months)91
    • Downloads (Last 6 weeks)15
    Reflects downloads up to 07 Mar 2025

    Other Metrics

    Citations

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media