Abstract
Keyword search over XML data has attracted a lot of research efforts in the last decade, where one of the fundamental research problems is how to efficiently answer a given keyword query w.r.t. a certain query semantics. We found that the key factor resulting in the inefficiency for existing methods is that they all heavily suffer from the common-ancestor-repetition problem. In this paper, we propose a novel form of inverted list, namely the IDList; the IDList for keyword \(k\) consists of ordered nodes that directly or indirectly contain \(k\). We then show that finding keyword query results based on the smallest lowest common ancestor and exclusive lowest common ancestor semantics can be reduced to ordered set intersection problem, which has been heavily optimized due to its application in areas such as information retrieval and database systems. We propose several algorithms that exploit set intersection in different directions and with or without using additional indexes. We further propose several algorithms that are based on hash search to simplify the operation of finding common nodes from all involved IDLists. We have conducted an extensive set of experiments using many state-of-the-art algorithms and several large-scale datasets. The results demonstrate that our proposed methods outperform existing methods by up to two orders of magnitude in many cases.
Similar content being viewed by others
Notes
The matched element in \(L_{i}\) to eliminator \(e\) is the minimum element that is equal to or greater than \(e\), if all lists are sorted in ascending order.
In Fig. 18, we take the result selectivity of existing methods as that of our method to make a fair comparison.
References
Bao, Z., Ling, T.W., Chen, B., Lu, J.: Effective xml keyword search with relevance oriented ranking. In: ICDE, pp. 517–528 (2009)
Barbay, J., Lpez-Ortiz, A., Lu, T.: Faster adaptive set intersections for text searching. In: WEA, pp. 146–157 (2006)
Bentley, J.L., Yao, A.C.-C.: An almost optimal algorithm for unbounded searching. Inf. Process. Lett. 5(3), 82–87 (1976)
Chen, L.J., Papakonstantinou, Y.: Supporting top-k keyword search in xml databases. In: ICDE, pp. 689–700 (2010)
Chen, Y., Wang, W., Liu, Z.: Keyword-based search and exploration on databases. In: ICDE, pp.1380–1383 (2011)
Chen, Y., Wang, W., Liu, Z., Lin, X.: Keyword search on structured and semi-structured data. In: SIGMOD Conference, pp. 1005–1010 (2009)
Cohen, S., Mamou, J., Kanza, Y., Sagiv, Y.: Xsearch: a semantic search engine for xml. In: VLDB, pp. 45–56 (2003)
Demaine, E.D., López-Ortiz, A., Munro, J.I.: Adaptive set intersections, unions, and differences. In: SODA, pp. 743–752 (2000)
Demaine, E.D., Lopez-Ortiz, A., Munro, J.I.: Experiments on adaptive set intersections for text retrieval systems. In: ALENEX, pp. 91–104 (2001)
Ding, B., König, A.C.: Fast set intersection in memory. PVLDB 4(4), 255–266 (2011)
Fisher, D.K., Lam, F., Shui, W.M., Wong, R.K.: Efficient ordering for xml data. In: CIKM, pp. 350–357 (2003)
Guo, L., Shao, F., Botev, C., Shanmugasundaram, J.: Xrank: Ranked keyword search over xml documents. In: SIGMOD Conference, pp. 16–27 (2003)
Kong, L., Gilleron, R., Lemay, A.: Retrieving meaningful relaxed tightest fragments for xml keyword search. In: EDBT, pp. 815–826 (2009)
Li, G., Feng, J., Wang, J., Zhou, L.: Effective keyword search for valuable lcas over xml documents. In: CIKM, pp. 31–40 (2007)
Li, G., Ji, S., Li, C., Feng, J.: Efficient type-ahead search on relational data: a tastier approach. In: SIGMOD Conference, pp. 695–706 (2009)
Li, Y., Yu, C., Jagadish, H.V.: Schema-free xquery. In: VLDB, pp. 72–83 (2004)
Liu, Z., Chen, Y.: Identifying meaningful return information for xml keyword search. In: SIGMOD Conference, pp. 329–340 (2007)
Liu, Z., Chen, Y.: Reasoning and identifying relevant matches for xml keyword search. PVLDB 1(1), 921–932 (2008)
Liu, Z., Chen, Y.: Processing keyword search on xml: a survey. World Wide Web 14(5–6), 671–707 (2011)
Manning, C.D., Raghavan, P., Schutze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)
Raman, V., Qiao, L., Han, W., Narang, I., Chen, Y.-L., Yang, K.-H., Ling, F.-L.: Lazy, adaptive rid-list intersection, and its application to index anding. In: SIGMOD Conference, pp. 773–784 (2007)
Sun, C., Chan, C.Y., Goenka, A.K.: Multiway slca-based keyword search in xml data. In: WWW, pp. 1043–1052 (2007)
Tatarinov, I., Viglas, S., Beyer, K.S., Shanmugasundaram, J., Shekita, E.J., Zhang, C.: Storing and querying ordered xml using a relational database system. In: SIGMOD Conference, pp. 204–215 (2002)
Tsirogiannis, D., Guha, S., Koudas, N.: Improving the performance of list intersection. PVLDB 2(1), 838–849 (2009)
Wang, W., Wang, X., Zhou A.: Hash-search: an efficient slca-based keyword search algorithm on xml documents. In: DASFAA, pp. 496–510 (2009)
Xu, Y., Papakonstantinou, Y.: Efficient keyword search for smallest lcas in xml databases. In: SIGMOD Conference, pp. 537–538 (2005)
Xu, Y., Papakonstantinou, Y.: Efficient lca based keyword search in xml data. In: EDBT, pp. 535–546 (2008)
Yan, H., Ding, S., Suel, T.: Inverted index compression and query processing with optimized document ordering. In: WWW, pp. 401–410 (2009)
Zhang, C., Naughton, J.F., DeWitt, D.J., Luo, Q., Lohman, G.M.: On supporting containment queries in relational database management systems. In: SIGMOD Conference, pp. 425–436 (2001)
Zhou, J., Bao, Z., Wang, W., Ling, T.W., Chen, Z., Lin, X., Guo, J.: Fast slca and elca computation for xml keyword queries based on set intersection. In: ICDE, pp. 905–916 (2012)
Zhou, R., Liu, C., Li, J.: Fast elca computation for keyword queries on xml data. In: EDBT, pp. 549–560 (2010)
Acknowledgments
This research was partially supported by the grants from the Natural Science Foundation of China (No. 61073060, 60833005, 61070055, 91024032, 91124001), the National Science and Technology Major Project (No. 2010-ZX01042-002-003), the Fundamental Research Funds for the Central Univ., the Research Funds of Renmin Univ. (No. 11XNL010, 10XNI018), and the Research Funds from Education Department of Hebei Province (No. Y2012014). Zhifeng Bao’s research is carried out at the SeSaMe Centre. It is supported by the Singapore NRF under its IRC@SG Funding Initiative and administered by the IDMPO. Wei Wang was partially supported by ARC DP130103401 and DP130103405.
Author information
Authors and Affiliations
Corresponding author
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Zhou, J., Bao, Z., Wang, W. et al. Efficient query processing for XML keyword queries based on the IDList index. The VLDB Journal 23, 25–50 (2014). https://doi.org/10.1007/s00778-013-0313-2
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00778-013-0313-2