Optimal and efficient generalized twig pattern processing: a combination of preorder and postorder filterings

Bača, Radim; Krátký, Michal; Ling, Tok Wang; Lu, Jiaheng

doi:10.1007/s00778-012-0295-5

Optimal and efficient generalized twig pattern processing: a combination of preorder and postorder filterings

Regular Paper
Published: 10 October 2012

Volume 22, pages 369–393, (2013)
Cite this article

The VLDB Journal Aims and scope Submit manuscript

Radim Bača¹,
Michal Krátký¹,
Tok Wang Ling² &
…
Jiaheng Lu³

1429 Accesses
7 Citations
Explore all metrics

Abstract

Searching for occurrences of a twig pattern query (TPQ) in an XML document is a core task of all XML database query languages. The generalized twig pattern (GTP) extends the TPQ model to include semantics related to output nodes, optional nodes, and boolean expressions which are part of the XQuery language. Preorder filtering holistic algorithms such as TwigStack represent a significant class of TPQ processing approaches with a linear worst-case I/O complexity with respect to the sum of the input and output sizes for some query classes. Another important class of holistic approaches is represented by postorder filtering holistic algorithms such as \(\text{ Twig}^2\)Stack which introduced a linear output enumeration time with respect to the result size. In this article, we introduce a holistic algorithm called GTPStack which is the first approach capable of processing a GTP with a linear worst-case I/O complexity with respect to the GTP result size. This is achieved by using a combination of the preorder and postorder filterings before storing nodes in an intermediate storage. Additionally, another contribution of this article is an introduction of a new perspective of holistic algorithm optimality. We show that the optimality depends not only on a query class but also on XML document characteristics. This new view on the optimality extends the general knowledge about the type of queries for which the holistic algorithms are optimal. Moreover, it allows us to determine that GTPStack is optimal for any GTP when a specific XML document is considered. We present a comprehensive experimental study of the state-of-the-art holistic algorithms showing under which conditions GTPStack outperforms the other holistic approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Graph Databases: Their Power and Limitations

A Survey on Advancing the DBMS Query Optimizer: Cardinality Estimation, Cost Model, and Plan Enumeration

Article Open access 15 January 2021

Data dependencies for query optimization: a survey

Article Open access 14 June 2021

References

Al-Khalifa, S., Jagadish, H.V., Koudas, N., Patel, J.M., Srivastava, D., Wu, Y.: Structural joins: a primitive for efficient XML query pattern matching. In: Proceedings of ICDE 2002, pp. 141–152. IEEE CS (2002)
Bača, R., Krátký, M.: On the Efficiency of a prefix path holistic algorithm. In: Proceedings of Database and XML Technologies, XSym 2009, vol. LNCS 5679, pp. 25–32. Springer (2009)
Bača, R., Krátký, M., Snášel, V.: On the efficient search of an XML twig query in large dataGuide trees. In: Proceedings of the Twelfth International Database Engineering & Applications Symposium, IDEAS 2008, pp. 149–158. ACM Press (2008)
Bača, R., Walder, J., Pawlas, M., Krátký, M.: Benchmarking the compression of XML node streams. In: Database Systems for Advanced Applications: 15th International Conference, DASFAA 2010, International Workshops, vol. 6193, pp. 179–190. Springer (2010)
Brantner, M., Helmer, S., Kanne, C.-C., Moerkotte, G.: Full-fledged algebraic XPath processing in Natix. In: Proceedings of Data Engineering, 2005. ICDE 2005, pp. 705–716. IEEE (2005)
Bruno, N., Srivastava, D., Koudas, N.: Holistic twig joins: optimal XML pattern matching. In: Proceedings of ACM SIGMOD 2002, pp. 310–321. ACM Press (2002)
Che, D., Ling, T.W., Hou, W.-C.: Holistic boolean-twig pattern matching for efficient XML query processing. IEEE Trans. Knowl. Data Eng. 99, 2008–2024
Chen, S., Li, H.-G., Tatemura, J., Hsiung, W.-P., Agrawal, D., Candan, K.S.: Twig2Stack: bottom-up processing of generalized-tree-pattern queries over XML documents. In: Proceedings of VLDB 2006, pp. 283–294 (2006)
Chen, T., Lu, J., Ling, T.W.: On boosting holism in XML twig pattern matching using structural indexing techniques. In: Proceedings of ACM SIGMOD 2005, pp. 455–466. ACM Press (2005)
Chen, Z., Jagadish, H.V., Lakshmanan, L.V.S., Paparizos, S.: From tree patterns to generalized tree patterns: on efficient evaluation of XQuery. In: Proceedings of the 29th International Conference on Very Large Data Bases, VLDB 2003, pp. 237–248 (2003)
Cooper, B., Sample, N., Franklin, M.J., Hjaltason, G.R., Shadmon, M.: A fast index for semistructured data. In: Proceedings of VLDB 2001, pp. 341–350 (2001)
Dietz, P.F.: Maintaining order in a linked list. In: Proceedings of 14th annual ACM symposium on theory of computing (STOC 1982), pp. 122–127 (1982)
Fuhr, N., Gvert, N., Malik, S., Lalmas, M., Kazai, G.: INEX (2007) https://inex.mmci.uni-saarland.de/
Goldman, R., Widom, J.: DataGuides: enabling query formulation and optimization in semistructured databases. In: Proceedings of the 23rd International Conference on Very Large Data Bases, VLDB 1997, pp. 436–445 (1997)
Grimsmo, N., Bjørklund, T.A., Hetland, M.L.: Fast optimal twig joins. In: Proceedings of the 36th International Conference on Very Large Data Bases, VLDB 2010, pp. 894–905. VLDB Endowment (2010)
Grust, T., van Keulen, M., Teubner, J.: Staircase join: teach a relational DBMS to watch its (Axis) steps. In: Proceedings of VLDB 2003, pp. 524–535 (2003)
Härder, T., Haustein, M., Mathis, C., Wagner, M.: Node labeling schemes for dynamic XML documents reconsidered. Data Knowl. Eng. 60, 126–149 (2007)
Article Google Scholar
Jiang, H., Lu, H., Wang, W.: Efficient processing of XML twig queries with OR-predicates. In: Proceedings of the 2004 ACM SIGMOD International Conference on Management of data, pp. 59–70. ACM New York (2004)
Kaushik, R., Bohannon, P., Naughton, J., Korth, H.: Covering indexes for branching path queries. In: Proceedings of ACM SIGMOD 2002, pp. 133–144. ACM Press (2002)
Krátký, M., Bača, R., Snášel, V.: On the efficient processing regular path expressions of an enormous volume of XML data. In: Proceedings of DEXA 2007, vol. 4653 of LNCS, pp. 1–12. Springer (2007)
Krátký, M., Pokorný, J., Snášel, V.: Implementation of XPath axes in the multi-dimensional approach to indexing XML data. In: Current Trends in Database Technology, EDBT 2004, vol. 3268 of LNCS. Springer (2004)
Li, G., Feng, J., Zhang, Y., Zhou, L.: Efficient holistic twig joins in leaf-to-root combining with root-to-leaf way. In: Proceedings of the 12th International Conference on Database systems for Advanced Applications, DASFAA ’07, pp. 834–849. Springer (2007)
Li, J., Wang, J.: TwigBuffer: avoiding useless intermediate solutions completely in twig joins. In: The 13th International Conference on Database Systems for Advanced Applications, DASFAA 2008, vol. 4947, pp. 554–561. Springer (2008)
Lu, J., Chen, T., Ling, T.W.: Efficient processing of XML twig patterns with parent child edges: a look-ahead approach. In: Proceedings of ACM CIKM 2004, pp. 533–542. ACM Press (2004)
Lu, J., Ling, T.W., Bao, Z., Wang, C.: Extended XML tree pattern matching: theories and algorithms. IEEE Trans. Knowl. Data Eng. 23, 402–416 (2011)
Article Google Scholar
Lu, J., Ling, T.W., Chan, C.-Y., Chen, T.: From region encoding to extended Dewey: on efficient processing of XML twig pattern matching. In: Proceedings of the 31st International Conference on Very Large Data Bases, VLDB 2005, pp. 193–204 (2005)
Lu, J., Ling, T.W., Yu, T., Li, C., Ni, W.: Efficient processing of ordered XML twig pattern. In: Proceedings of DEXA 2005, vol. 3588 of LNCS, pp. 300–309. Springer (2005)
Lu, J., Meng, X., Ling, T.W.: Indexing and querying XML using extended Dewey labeling scheme. Data Knowl. Eng. 70(1), 35–59 (2011)
Article Google Scholar
Ley, M.: The DBLP computer science bibliography, http://www.informatik.uni-trier.de/~ley/db/
Michiels, P., Mihaila, G., Siméon, J.: Put a tree pattern in your algebra. In: Proceedings of the 23th International Conference on Data Engineering, ICDE 2007, pp. 246–255 (2007)
Moro, M.M., Vagena, Z., Tsotras, V.J.: Tree-pattern queries on a lightweight XML processor. In: Proceedings of the 31st International Conference on Very Large Data Bases, VLDB 2005, pp. 205–216 (2005)
Paparizos, S., Wu, Y., Lakshmanan, L.V.S., Jagadish, H.V.: Tree logical classes for efficient evaluation of XQuery. In: Proceedings of the 2004 ACM SIGMOD International Conference on Management of data, pp. 71–82. ACM (2004)
Qin, L., Yu, J.X., Ding, B.: TwigList: make twig pattern matching fast. In: The 12th International Conference on Database Systems for Advanced Applications, DASFAA 2007, vol. 4443 of LNCS, pp. 850–862. Springer (2007)
Schmidt, A.R. et al.: The XML benchmark. Technical Report INS-R0103, CWI, The Netherlands (April 2001), http://monetdb.cwi.nl/xml/
Tatarinov, I., Viglas, S.D., Beyer, K., Shanmugasundaram, J., Shekita, E., Zhang, C.: Storing and querying ordered XML using a relational database system. In: Proceedings of ACM SIGMOD 2002, pp. 204–215. New York, USA (2002)
University of Washington Database Group: The XML Data Repository http://www.cs.washington.edu/research/xmldatasets/ 2002
W3 Consortium: XQuery 1.0: An XML Query Language, W3C Working Draft, 12 November 2003, http://www.w3.org/TR/xquery/
Wang, H., Park, S., Fan, W., Yu, P.S.: ViST: a dynamic index method for querying XML data by tree structures. In: Proceedings of the ACM SIGMOD 2003, pp. 110–121. ACM Press (2003)
Weiner, A.M., Härder, T.: Using structural joins and holistic twig joins for native XML query optimization. In: Advances in Databases and Information Systems, vol. 5739 of LNCS, pp. 149–163. Springer, Berlin Heidelberg (2009)
Weiner, A.M., Härder, T.: An integrative approach to query optimization in native XML database management systems. In: Proceedings of the Fourteenth International Database Engineering & Applications Symposium, IDEAS ’10, pp. 64–74. ACM, New York, NY, USA (2010)
Wu, H., Ling, T.W., Chen, B., Xu, L.: TwigTable: using semantics in XML twig pattern query processing. In: Journal on Data Semantics XV, vol. 6720 of LNCS, pp. 102–129. Springer, Berlin Heidelberg (2011)
Wu, H., Ling, T.W., Dobbie, G.: TP+Output: modeling complex output information in XML twig pattern query. In: Database and XML Technologies, pp. 128–143. Springer (2010)
Yang, B., Fontoura, M., Shekita, E., Rajagopalan, S., Beyer, K.: Virtual cursors for XML joins. In: Proceedings of the thirteenth ACM International Conference on Information and Knowledge Management, CIKM 2004, pp. 523–532. ACM (2004)
Yoshikawa, M., Amagasa, T., Shimura, T., Uemura, S.: XRel: a path-based approach to storage and retrieval of XML documents using relational databases. ACM Trans. Internet Technol 1(1), 110–141 (2001)
Google Scholar
Yu, T., Ling, T.W., Lu, J.: TwigStackList\(\lnot \): a holistic twig join algorithm for twig query with not-predicates on XML data. In: The 11th International Conference on Database Systems for Advanced Applications, DASFAA 2006, vol. 3882, pp. 249–263. Springer (2006)
Zhang, C., Naughton, J., DeWitt, D., Luo, Q., Lohman, G.: On supporting containment queries in relational database management systems. In: Proceedings of ACM SIGMOD 2001, pp. 425–436 (2001)

Download references

Acknowledgments

This work is supported by the Grant of GACR No. GAP202/10/0573. Jiaheng Lu is partially supported by National Science Foundation in China (NO. 61170011).

Author information

Authors and Affiliations

Department of Computer Science, VŠB—Technical University of Ostrava, Ostrava, Czech Republic
Radim Bača & Michal Krátký
Department of Computer Science, National University of Singapore, Singapore, Singapore
Tok Wang Ling
DEKE, MOE and School of Information, Renmin University of China, Beijing, China
Jiaheng Lu

Authors

Radim Bača
View author publications
You can also search for this author in PubMed Google Scholar
Michal Krátký
View author publications
You can also search for this author in PubMed Google Scholar
Tok Wang Ling
View author publications
You can also search for this author in PubMed Google Scholar
Jiaheng Lu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Radim Bača.

Appendices

Appendix 1: Logical expression in getMatch

In Algorithm 5, we show an extended version of the fwdToAncOf function for logical expressions with NOT operators. The function is based on the preorder filtering functions for Boolean expressions introduced in [7, 18]. The algorithm uses a logical tree which is a rooted tree, where the leaf nodes are the query nodes and the non-leaf nodes are the bool nodes. A logical tree represents a logical expression corresponding to a query node \(\#q\), where the logical variables of the expression are the child query nodes of \(\#q\). The function ltree \(\mathtt{(\#q)}\) returns the root node of \(\#q\)’s logical tree. Our algorithm supposes that every NOT bool node has exactly one child and this child is a query node. We can easily rewrite every logical expression so that every NOT bool node has exactly one query node using DeMorgan’s laws. The isUseless function, which represents the core functionality of fwdToAncOf, evaluates the logical tree and returns true if the head node of the current query node is useless.

Appendix 2: Node concatenation

For the purpose of the concatenation, we store a (prev:next) pair as a part of each stack item, where the prev and next items are pointers to nodes. If we consider a set \(R\) of nodes stored in an intermediate storage which correspond to \(\#q\) and which are descendants of \(n_{\#q}\), then when we pop \(n_{\#q}\), the prev item of \(n_{\#q}\) points to \(n^{\prime }_{\#q} \in R\) having the lowest document order among all nodes in \(R\) and the next item of \(n_{\#q}\) points to \(n^{\prime \prime }_{\#q} \in R\) having the highest document order among all nodes in \(R\).

Each pair is set to (empty:empty) when a node is pushed onto a stack. When a node \(n_{\#q}\) is popped out from a stack, then we set up the (prev:next) pair of \(n^{anc}_{\#q}\), where \(n^{anc}_{\#q}\) is a top node of \(S_{\#q}\). If \(S_{\#q}\) is empty, then we set the (prev:next) pair corresponding directly to \(S_{\#q}\). A procedure setting the (prev:next) pair values runs in a constant time since it accesses only the top item of \(S_{\#q}\). The (prev:next) pairs provide enough information to set up lists’ pointers in an intermediate storage; therefore, the nodes are sorted in preorder even though they are added in postorder.

Appendix 3: GTP enumeration

In Algorithm 6, we depict the GTP enumeration used by GTPStack. It is a slight modification of the TwigList enumeration, which works with the enumeration query nodes. Enumeration query nodes are those main branch query nodes which are a part of the intermediate storage as is described in Sect. 6.2.2. An enumeration parent relationship of an enumeration query node \(\#q\) is AD if any relationship between \(\#q\) and its enumeration parent query node is AD.

There are three pointers to the intermediate storage list corresponding to every \(\#q\): start[\(\#q\)], end[\(\#q\)], and move[\(\#q\)]. The start[\(\#q\)] and end[\(\#q\)] pointers specify an interval in the list and the move[\(\#q\)] pointer is always within this interval.

Appendix 4: Early enumeration

The \(\text{ Twig}^2\)Stack algorithm introduces an early enumeration algorithm, which starts when we pop the last node from a stack \(S_{\#fbr}\) corresponding to the first branching node. In this way, we can significantly reduce the intermediate storage size. However, this approach has a limitation: if any top stack (i.e., any stack corresponding to a query node between the root and the first branching query node) contains more than one node, then the early enumeration outputs an unordered result. To detect this problem we keep a switch E for each top stack, which is set to true when the stack is empty and false when the stack contains more than one node. Note that if the top stack contains exactly one node, then E preserves the current value. The early enumeration can start only if all E switches corresponding to the output nodes are true.

Appendix 5: Real-world queries and processing time

A list of the selected real-world collections’ queries together with some important statistics can be found in Tables 11, 12 and 13. The raw data of query processing times for the real-world collections can be seen in Tables 14, 15 and 16.

Table 11 The XMARK queries

Full size table

Table 12 The TreeBank queries

Full size table

Table 13 The INEX queries

Full size table

Table 14 Processing times for the XMARK queries

Full size table

Table 15 Processing times for the TreeBank queries

Full size table

Table 16 Processing times for the INEX queries

Full size table

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bača, R., Krátký, M., Ling, T.W. et al. Optimal and efficient generalized twig pattern processing: a combination of preorder and postorder filterings. The VLDB Journal 22, 369–393 (2013). https://doi.org/10.1007/s00778-012-0295-5

Download citation

Received: 26 October 2011
Revised: 06 July 2012
Accepted: 10 September 2012
Published: 10 October 2012
Issue Date: June 2013
DOI: https://doi.org/10.1007/s00778-012-0295-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Optimal and efficient generalized twig pattern processing: a combination of preorder and postorder filterings

Abstract

Access this article

Similar content being viewed by others

Graph Databases: Their Power and Limitations

A Survey on Advancing the DBMS Query Optimizer: Cardinality Estimation, Cost Model, and Plan Enumeration

Data dependencies for query optimization: a survey

References

Acknowledgments