A general framework to resolve the MisMatch problem in XML keyword search

Bao, Zhifeng; Zeng, Yong; Ling, Tok Wang; Zhang, Dongxiang; Li, Guoliang; Jagadish, H. V.

doi:10.1007/s00778-015-0386-1

A general framework to resolve the MisMatch problem in XML keyword search

Regular Paper
Published: 18 April 2015

Volume 24, pages 493–518, (2015)
Cite this article

The VLDB Journal Aims and scope Submit manuscript

Zhifeng Bao¹,
Yong Zeng²,
Tok Wang Ling²,
Dongxiang Zhang²,
Guoliang Li³ &
…
H. V. Jagadish⁴

1811 Accesses
7 Citations
Explore all metrics

Abstract

When users issue a query to a database, they have expectations about the results. If what they search for is unavailable in the database, the system will return an empty result or, worse, erroneous mismatch results. We call this problem the MisMatch problem. In this paper, we solve the MisMatch problem in the context of XML keyword search. Our solution is based on two novel concepts that we introduce: target node type and Distinguishability. Target Node Type represents the type of node a query result intends to match, and Distinguishability is used to measure the importance of the query keywords. Using these concepts, we develop a low-cost post-processing algorithm on the results of query evaluation to detect the MisMatch problem and generate helpful suggestions to users. Our approach has three noteworthy features: (1) for queries with the MisMatch problem, it generates the explanation, suggested queries and their sample results as the output to users, helping users judge whether the MisMatch problem is solved without reading all query results; (2) it is portable as it can work with any lowest common ancestor-based matching semantics (for XML data without ID references) or minimal Steiner tree-based matching semantics (for XML data with ID references) which return tree structures as results. It is orthogonal to the choice of result retrieval method adopted; (3) it is lightweight in the way that it occupies a very small proportion of the whole query evaluation time. Extensive experiments on three real datasets verify the effectiveness, efficiency and scalability of our approach. A search engine called XClear has been built and is available at http://xclear.comp.nus.edu.sg.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Notes

Specifically, we use strong DataGuide proposed in [11]. Our purpose of using DataGuide is only to provide a data structure to store the occurrence constraints summarized from the XML data. Many other data structures are possible.
\(t_i\) may not necessarily be a one-to-one mapping to \(m_i\) because two keyword match nodes, say \(m_i\) and \(m_j\), could be of the same node type.
The choice of an appropriate \(\tau \) will be discussed in the experimental study.
http://www.imdb.com/interfaces.
https://inex.mmci.uni-saarland.de/.
Thanks to Craig Rodkin at ACM Headquarters for providing the ACM Digital Library dataset.
Here by default we adopt \(\tau \) = 0.9. Experiment on effects of threshold setting is discussed in Sect. 7.3.4.

References

Berkeley, D.B.: http://www.sleepycat.com
Bao, Z., Ling, T.W., Chen, B., Lu, J.: Effective xml keyword search with relevance oriented ranking. In: ICDE (2009)
Bao, Z., Lu, J., Ling, T.W., Chen, B.: Towards an effective xml keyword search. IEEE Trans. Knowl. Data Eng. 22(8), 1077–1092 (2010)
Article Google Scholar
Bao, Z., Lu, J., Ling, T.W., Xu, L., Wu, H.: An effective object-level xml keyword search. In: DASFAA (2010)
Bhalotia, G., Hulgeri, A., Nakhe, C., Chakrabarti, S., Sudarshan, S.: Keyword searching and browsing in databases using banks. In: ICDE (2002)
Chapman, A., Jagadish, H.V.: Why not? In: SIGMOD (2009)
Coffman, J., Weaver, A.C.: An empirical performance evaluation of relational keyword search techniques. IEEE Trans. Knowl. Data Eng. 26(1), 30–42 (2014)
Ding, B., Yu, J.X., Wang, S., Qin, L., Zhang, X., Lin, X.: Finding top-k min-cost connected trees in databases. In: ICDE (2007)
Dreyfus, S.E., Wagner, R.A.: The Steiner problem in graphs. In: Networks (1971)
Drosou, M., Pitoura, E.: Ymaldb: exploring relational databases via result-driven recommendations. VLDB J. 22(6), 849–874 (2013)
Article Google Scholar
Goldman, R., Widom, J.: Dataguides: enabling query formulation and optimization in semistructured databases. In: VLDB (1997)
Guo, L., Shao, F., Botev, C., Shanmugasundaram, J.: Xrank: ranked keyword search over xml documents. In: SIGMOD (2003)
Hadjieleftheriou, M., Chandel, A., Koudas, N., Srivastava, D.: Fast indexes and algorithms for set similarity selection queries. In: ICDE (2008)
He, H., Wang, H., Yang, J., Yu, P.S.: Blinks: ranked keyword searches on graphs. In: SIGMOD (2007)
Hristidis, V., Koudas, N., Papakonstantinou, Y., Srivastava, D.: Keyword proximity search in xml trees. IEEE Trans. Knowl. Data Eng. 18(4), 525–539 (2006)
Hristidis, V., Papakonstantinou, Y., Balmin, A.: Keyword proximity search on xml graphs. In: ICDE (2003)
Huang, J., Chen, T., Doan, A., Naughton, J.F.: On the provenance of non-answers to queries over extracted data. PVLDB 1(1), 736–747 (2008)
Järvelin, K., Kekäläinen, J.: Cumulated gain-based evaluation of ir techniques. ACM Trans. Inf. Syst. 20(4), 422–446 (2002)
Jones, R., Rey, B., Madani, O., Greiner, W.: Generating query substitutions. In: WWW (2006)
Kacholia, V., Pandit, S., Chakrabarti, S., Sudarshan, S., Desai, R., Karambelkar, H.: Bidirectional expansion for keyword search on graph databases. In: VLDB (2005)
Kasneci, G., Ramanath, M., Sozio, M., Suchanek, F.M., Weikum, G.: Star: Steiner-tree approximation in relationship graphs. In: ICDE (2009)
Lee, K.H., Whang, K.Y., Han, W.S., Kim, M.S.: Structural consistency: enabling xml keyword search to eliminate spurious results consistently. VLDB J. 19(4), 503–529 (2010)
Lemire, D., Kaser, O., Aouiche, K.: Sorting improves word-aligned bitmap indexes. Data Knowl. Eng. 69(1), 3–28 (2010)
Li, G., Feng, J., Wang, J., Zhou, L.: Effective keyword search for valuable lcas over xml documents. In: CIKM (2007)
Li, G., Li, C., Feng, J., Zhou, L.: Sail: structure-aware indexing for effective and progressive top-k keyword search over xml documents. Inf. Sci. 179(21), 3745–3762 (2009)
Liu, Z., Chen, Y.: Identifying meaningful return information for xml keyword search. In: SIGMOD (2007)
Liu, Z., Chen, Y.: Reasoning and identifying relevant matches for xml keyword search. PVLDB 1(1), 921–932 (2008)
Liu, Z., Sun, P., Chen, Y.: Structured search result differentiation. PVLDB 2(1), 313–324 (2009)
Muslea, I.: Machine learning for online query relaxation. In: KDD (2004)
Muslea, I., Lee, T.J.: Online query relaxation via bayesian causal structures discovery. In: AAAI (2005)
Nambiar, U., Kambhampati, S.: Answering imprecise queries over autonomous web databases. In: ICDE (2006)
Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. McGraw-Hill Inc., New York (1986)
Google Scholar
Schmidt, A., Kersten, M.L., Windhouwer, M.: Querying xml documents made easy: nearest concept queries. In: ICDE (2001)
Sun, C., Chan, C.Y., Goenka, A.K.: Multiway slca-based keyword search in xml data. In: WWW (2007)
Tao, Y., Papadopoulos, S., Sheng, C., Stefanidis, K., Stefanidis, K.: Nearest keyword search in xml documents. In: SIGMOD (2011)
Termehchy, A., Winslett, M.: Using structural information in xml keyword search effectively. ACM Trans. Database Syst. 36(1), 4 (2011)
Vesper, V.: http://www.mtsu.edu/vvesper/dewey.html
Xu, Y., Papakonstantinou, Y.: Efficient keyword search for smallest lcas in xml databases. In: SIGMOD (2005)
Zeng, Y., Bao, Z., Ling, T.W., Jagadish, H.V., Li, G.: Breaking out of the mismatch trap. In: ICDE (2014)
Zeng, Y., Bao, Z., Ling, T.W., Li, G.: Efficient xml keyword search: from graph model to tree model. In: DEXA (2013)
Zeng, Y., Bao, Z., Ling, T.W., Li, G.: Removing the mismatch headache in xml keyword search. In: SIGIR (2013, demo paper. http://xclear.comp.nus.edu.sg)
Zhang, W.V., He, X., Rey, B., Jones, R.: Query rewriting using active learning for sponsored search. In: SIGIR (2007)

Download references

Acknowledgments

H.V. Jagadish is supported in part by NSF IIS-1250880 and IIS-1017296.

Author information

Authors and Affiliations

School of Computer Science and Information Technology, RMIT University, Melbourne, 3000, Australia
Zhifeng Bao
School of Computing, National University of Singapore, Singapore, 117417, Singapore
Yong Zeng, Tok Wang Ling & Dongxiang Zhang
Department of Computer Science and Technology, Tsinghua University, Beijing, 100084, China
Guoliang Li
Electrical Engineering and Computer Science, University of Michigan, Ann Arbor, MI, 48109, USA
H. V. Jagadish

Authors

Zhifeng Bao
View author publications
You can also search for this author in PubMed Google Scholar
Yong Zeng
View author publications
You can also search for this author in PubMed Google Scholar
Tok Wang Ling
View author publications
You can also search for this author in PubMed Google Scholar
Dongxiang Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Guoliang Li
View author publications
You can also search for this author in PubMed Google Scholar
H. V. Jagadish
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Zhifeng Bao or Yong Zeng.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bao, Z., Zeng, Y., Ling, T.W. et al. A general framework to resolve the MisMatch problem in XML keyword search. The VLDB Journal 24, 493–518 (2015). https://doi.org/10.1007/s00778-015-0386-1

Download citation

Received: 26 May 2014
Revised: 30 March 2015
Accepted: 30 March 2015
Published: 18 April 2015
Issue Date: August 2015
DOI: https://doi.org/10.1007/s00778-015-0386-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A general framework to resolve the MisMatch problem in XML keyword search

Abstract

Access this article

Similar content being viewed by others

A Survey on Advancing the DBMS Query Optimizer: Cardinality Estimation, Cost Model, and Plan Enumeration

Multi-model query languages: taming the variety of big data

Boolean interpretation, matching, and ranking of natural language queries in product selection systems

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding authors

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A general framework to resolve the MisMatch problem in XML keyword search

Abstract

Access this article

Similar content being viewed by others

A Survey on Advancing the DBMS Query Optimizer: Cardinality Estimation, Cost Model, and Plan Enumeration

Multi-model query languages: taming the variety of big data

Boolean interpretation, matching, and ranking of natural language queries in product selection systems

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding authors

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation