Skip to main content
Log in

XHQE: A hybrid system for scalable selectivity estimation of XML queries

  • Published:
Information Systems Frontiers Aims and scope Submit manuscript

Abstract

With the increasing popularity of XML applications in enterprise and big data systems, the use of efficient query optimizers is becoming very essential. The performance of an XML query optimizer depends heavily on the query selectivity estimators it uses to find the best possible query execution plan. In this work, we propose a novel selectivity estimator which is a hybrid of structural synopsis and statistics, called XHQE. The structural synopsis enhances the accuracy of estimation and the structural statistics makes it scalable to the allocated memory space. The structural synopsis is generated by labeling the nodes of the source XML dataset using a fingerprint function and merging subtrees with similar fingerprints (i.e. having similar structures). The generated structural synopsis and structural statistics are then used to estimate the selectivity of given queries. We studied the performance of the proposed approach using different types of queries and four benchmark datasets with different structural characteristics. We compared XHQE with existing algorithms such as Sampling, TreeSketch and one histogram-based algorithm. The experimental results showed that the XHQE is significantly better than other algorithms in terms of estimation accuracy and scalability for semi-uniform datasets. For non-uniform datasets, the proposed algorithm has comparable estimation accuracy to TreeSketch as the allocated memory size is highly reduced, yet the estimation data generation time of the proposed approach is much lower (e.g., TreeSketch took more than 50 times longer than that of the proposed approach for XMark dataset). Comparing to the histogram-based algorithm, our approach supports regular twig quires in addition to having higher accuracy when both run under similar memory constraints.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

References

  • Aboulnaga, A, & Naughton, JF (2003). Building XML statistics for the hidden web. In Proceedings of the twelfth ACM International Conference on Information and Knowledge Management, pp 358–365.

  • Aboulnaga, A, Alameldeen, AR, & Naughton, JF (2001). Estimating the selectivity of XML path expressions for Internet scale applications. In Proceedings of the 27th International Conference on Very Large Data Bases, San Francisco, CA, USA, VLDB’01.

  • Agrawal, R, Ailamaki, A, Bernstein, PA, Brewer, EA, Carey, MJ, Chaudhuri, S, Doan, A, Florescu, D, Franklin, MJ, Garcia-Molina, H, & et al (2009). The claremont report on database research. Communications of the ACM, 52(6), 56–65.

    Article  Google Scholar 

  • Alrammal, M, & Hains, G (2014). A research survey on large XML data: Streaming, selectivity estimation and parallelism. Inter-cooperative Collective Intelligence: Techniques and Applications Studies in Computational Intelligence, 495, 167–202.

    Article  Google Scholar 

  • Alrammal, M, Hains, G, & Zergaoui, M (2011). Path tree: Document synopsis for XPath query selectivity estimation. In Proceedings of the 5th International Conference on Complex, Intelligent, and Software Intensive Systems (CISIS-2011), pp 321–328.

  • Bosak, J (2014). verified April 2014 Shakespeare plays. http://www.ibiblio.org/xml/examples/shakespeare/.

  • Bray, TJ, Paoli, C, McQueen, S, & Maler, E (2000). Extensible markup language (XML) 1.0, Second Edition. Available: http://www.w3.org/TR/REC-xml.

  • Bruno, N, Koudas, N, & Srivastava, D (2002). Holistic twig joins: optimal XML pattern matching. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD ’02, pp 310–321.

  • Chu, Y, & Yu, J (2012). The research of database query optimization based on XML. Advanced Materials Research, 546-547, 519–525.

    Article  Google Scholar 

  • Drukh, N, Polyzotis, N, Garofalakis, M, & Matias, Y (2004). Fractional XSketch synopses for XML databases. In Bellahsne, Z, Milo, T, Rys, M, Suciu, D, & Unland, R (Eds.) Database and XML Technologies, Lecture Notes in Computer Science, (Vol. 3186 pp. 189–203): Springer Berlin Heidelberg.

  • Fisher, D, & Maneth, S (2007). Structural selectivity estimation for XML documents. In Proceedings of the IEEE 23rd International Conference on Data Engineering, ICDE, (Vol. 2007 pp. 626–635).

  • Gou, G, & Chirkova, R (2007). Efficiently querying large XML data repositories: A survey. IEEE Transactions on Knowledge and Data Engineering, 19(10), 1381–1403.

    Article  Google Scholar 

  • Grün C (2010). Storing and querying large XML instances. PhD thesis.

  • Hachicha, M, & Darmont, J (2013). A survey of XML tree patterns. IEEE Transactions on Knowledge and Data Engineering, 25(1), 29–46.

    Article  Google Scholar 

  • Haw, SC, & Lee, CS (2011). Data storage practices and query processing in XML databases: A survey. Knowledge-Based Systems, 24(8), 1317–1340.

    Article  Google Scholar 

  • He, W, Lv, T, Meis, M, & Yan, P (2013). Visual evaluation of XPath queries. In IEEE Fifth International Conference on Computational and Information Sciences (ICCIS), pp 434–437.

  • Izadi, SK, Haghjoo, MS, & H?rder, T (2012). S3: Processing tree-pattern XML queries with all logical operators. Data & Knowledge Engineering 72:31–62.

  • Karp, RM, & Rabin, MO (1987). Efficient randomized pattern-matching algorithms. IBM Journal of Research and Development, 31(2), 249–260.

    Article  Google Scholar 

  • Lee, ML, Li, H, Hsu, W, & Ooi, BC (2004). A statistical approach for XML query size estimation. In Proceedings of the 2004 International Conference on Current Trends in Database Technology, EDBT’04.

  • Li, H, Lee, ML, & Hsu, W (2005a). A histogram-based selectivity estimator for skewed xml data. In Database and Expert Systems Applications, Springer, pp 270–279.

  • Li, H, Lee, ML, & Hsu, W (2005b). A histogram-based selectivity estimator for skewed XML data. In Andersen, K, Debenham, J, & Wagner, R (Eds.) Database and Expert Systems Applications, Lecture Notes in Computer Science, vol 3588, Springer Berlin Heidelberg (pp. 270–279).

  • Lim L, Wang M, Padmanabhan S, Vitter JS, & Parr R (2002). Xpathlearner: An on-line self-tuning markov histogram for XML path selectivity estimation. In Proceedings of the 28th International Conference on Very Large Data Bases, pp 442–453.

  • Liu, X, Chen, L, Wan, C, Liu, D, & Xiong, N (2013). Exploiting structures in keyword queries for effective XML search. Information Sciences, 240, 56–71.

    Article  Google Scholar 

  • Lu, J, Ling, T, Bao, Z, & Wang, C (2011). Extended XML tree pattern matching: Theories and algorithms. IEEE Transactions on Knowledge and Data Engineering, 23(3), 402–416.

    Article  Google Scholar 

  • Luo, C, Jiang, Z, Hou, WC, Yu, F, & Zhu, Q (2009). A sampling approach for xml query selectivity estimation. In Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology, pp 335–344.

  • Madria, S, Chen, Y, Passi, K, & Bhowmick, S (2007). Efficient processing of XPath queries using indexes. Information Systems, 32(1), 131–159.

    Article  Google Scholar 

  • Mohammed, SA, El-Alfy, ESM, & Barradah, AF (2014). Improved selectivity estimator for XML queries based on structural synopsis. World Wide Web 10.1007/s11280-014-0311-3.

  • Neoklis, P, & Minos, G (2006). Xcluster synopses for structured xml content. In Proceedings of the International Conference on Data Engineering.

  • Phan, BV, Pardede, E, & Rahayu, W (2013). On the improvement of active XML (AXML) representation and query evaluation. Information Systems Frontiers, 15(2), 203–222.

    Article  Google Scholar 

  • Polyzotis, N, & Garofalakis, M (2002). Statistical synopses for graph-structured XML databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD ’02, pp 358–369.

  • Polyzotis, N, & Garofalakis, M (2006). XSketch synopses for XML data graphs. ACM Transactions on Database Systems, 31(3), 1014–1063.

    Article  Google Scholar 

  • Polyzotis, N, Garofalakis, M, & Ioannidis, Y (2004a). Approximate XML query answers. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD ’04.

  • Polyzotis, N, Garofalakis, M, & Ioannidis, Y (2004b). Selectivity estimation for XML twigs. In Proceedings of the IEEE 20th International Conference on Data Engineering.

  • Sakr, S. (2007). Cardinality-aware and purely relational implementation of an XQuery processor: PhD thesis, University of Konstanz.

  • Sakr, S (2008). Algebra-based XQuery cardinality estimation. International Journal of Web Information Systems, 4(1), 7–46.

    Article  Google Scholar 

  • Sakr, S (2010). Towards a comprehensive assessment for selectivity estimation approaches of XML queries. International Journal of Web Engineering and Technology, 6, 58–82.

    Article  Google Scholar 

  • Sartiani, C (2003). A framework for estimating XML query cardinality. In WebDB, pp 43–48.

  • Schmidt, A, Waas, F, Kersten, M, Carey, MJ, Manolescu, I, & Busse, R (2002). XMark: A benchmark for XML data management. In Proceedings of the 28th International Conference on Very Large Databases, VLDB’02, pp 974–985.

  • Teubner, J, Grust, T, Maneth, S, & Sakr, S (2008). Dependable cardinality forecasts for XQuery. Proceedings of the VLDB Endowment, 1(1), 463–477.

    Article  Google Scholar 

  • Tian, P, Luo, D, Li, Y, & Gu, J (2014). XML multi-core query optimization based on task preemption and data partition. In Semantic Technology (pp. 294–305): Springer.

  • Verified April (2014). DBLP: Digital bibliography & library project. http://dblp.uni-trier.de/xml/.

  • Verified April (2014). UniProt. http://www.uniprot.org/.

  • Wang, C, Parthasarathy, S, & Jin, R (2006). A decomposition-based probabilistic framework for estimating the selectivity of XML twig queries. In Advances in Database Technology, EDBT (pp. 533–551): Springer.

  • Wang, W, Jiang, H, Lu, H, & Yu, JX (2004a). Bloom histogram: path selectivity estimation for XML data with updates. In Proceedings of the 30th International Conference on Very Large Databases, VLDB’04.

  • Wang, W, Jiang, H, Lu, H, & Yu, JX (2004b). Bloom histogram: Path selectivity estimation for XML data with updates. In Proceedings of the Thirtieth International Conference on Very Large Databases.

  • Wang, Y, Wang, H, Meng, X, & Wang, S (2004c). Estimating the selectivity of XML path expression with predicates by histograms. In Li, Q, Wang, G, & Feng, L (Eds.) Advances in Web-Age Information Management, Lecture Notes in Computer Science, (Vol. 3129 pp. 409–418): Springer Berlin Heidelberg.

  • Wu, X, & Liu, G (2008). XML twig pattern matching using version tree. Data & Knowledge Engineering, 64 (3), 580–599.

    Article  Google Scholar 

  • Wu, X, Theodoratos, D, Wang, WH, & Sellis, T (2013). Optimizing XML queries: Bitmapped materialized views vs. indexes. Information Systems, 38(6), 863–884.

    Article  Google Scholar 

  • Wu, Y, Patel, JM, & Jagadish, H (2002). Estimating answer sizes for XML queries. In Jensen, C, Šaltenis, S, Jeffery, K, Pokorny, J, Bertino, E, B?hn, K, & Jarke, M (Eds.) Advances in Database Technology, Lecture Notes in Computer Science, (Vol. 2287 pp. 590–608): Springer Berlin Heidelberg.

  • Yang, LH, Lee, ML, Hsu, W, Huang, D, & Wong, L (2008). Efficient mining of frequent XML query patterns with repeating-siblings. Information and Software Technology, 50(5), 375–389.

    Article  Google Scholar 

  • Zhang, C, Naughton, J, DeWitt, D, Luo, Q, & Lohman, G (2001). On supporting containment queries in relationaldatabase management systems. SIGMOD Rec, 30(2), 425–436 . doi:10.1145/376284.375722.

    Article  Google Scholar 

  • Zhang, N, Ozsu, MT, Aboulnaga, A, & Ilyas, If (2006). Xseed: Accurate and fast cardinality estimation for XPath queries. In Proceedings of the IEEE 22nd International Conference on Data Engineering, Washington, DC, USA, ICDE’06.

Download references

Acknowledgments

The first author would like to acknowledge the support provided by King Abdulaziz City for Science and Technology (KACST) through the Science & Technology Unit at King Fahd University of Petroleum & Minerals (KFUPM) for funding this work under Project no. 11-INF1658-04 as part of the National Science, Technology, and Innovation Plan.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to E.-S. M. El-Alfy.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

El-Alfy, ES.M., Mohammed, S. & Barradah, A.F. XHQE: A hybrid system for scalable selectivity estimation of XML queries. Inf Syst Front 18, 1233–1249 (2016). https://doi.org/10.1007/s10796-015-9561-6

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10796-015-9561-6

Keywords

Navigation