Skip to main content
Log in

A general framework to resolve the MisMatch problem in XML keyword search

  • Regular Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

When users issue a query to a database, they have expectations about the results. If what they search for is unavailable in the database, the system will return an empty result or, worse, erroneous mismatch results. We call this problem the MisMatch problem. In this paper, we solve the MisMatch problem in the context of XML keyword search. Our solution is based on two novel concepts that we introduce: target node type and Distinguishability. Target Node Type represents the type of node a query result intends to match, and Distinguishability is used to measure the importance of the query keywords. Using these concepts, we develop a low-cost post-processing algorithm on the results of query evaluation to detect the MisMatch problem and generate helpful suggestions to users. Our approach has three noteworthy features: (1) for queries with the MisMatch problem, it generates the explanation, suggested queries and their sample results as the output to users, helping users judge whether the MisMatch problem is solved without reading all query results; (2) it is portable as it can work with any lowest common ancestor-based matching semantics (for XML data without ID references) or minimal Steiner tree-based matching semantics (for XML data with ID references) which return tree structures as results. It is orthogonal to the choice of result retrieval method adopted; (3) it is lightweight in the way that it occupies a very small proportion of the whole query evaluation time. Extensive experiments on three real datasets verify the effectiveness, efficiency and scalability of our approach. A search engine called XClear has been built and is available at http://xclear.comp.nus.edu.sg.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Notes

  1. Specifically, we use strong DataGuide proposed in [11]. Our purpose of using DataGuide is only to provide a data structure to store the occurrence constraints summarized from the XML data. Many other data structures are possible.

  2. \(t_i\) may not necessarily be a one-to-one mapping to \(m_i\) because two keyword match nodes, say \(m_i\) and \(m_j\), could be of the same node type.

  3. The choice of an appropriate \(\tau \) will be discussed in the experimental study.

  4. http://www.imdb.com/interfaces.

  5. https://inex.mmci.uni-saarland.de/.

  6. Thanks to Craig Rodkin at ACM Headquarters for providing the ACM Digital Library dataset.

  7. Here by default we adopt \(\tau \) = 0.9. Experiment on effects of threshold setting is discussed in Sect. 7.3.4.

References

  1. Berkeley, D.B.: http://www.sleepycat.com

  2. Bao, Z., Ling, T.W., Chen, B., Lu, J.: Effective xml keyword search with relevance oriented ranking. In: ICDE (2009)

  3. Bao, Z., Lu, J., Ling, T.W., Chen, B.: Towards an effective xml keyword search. IEEE Trans. Knowl. Data Eng. 22(8), 1077–1092 (2010)

    Article  Google Scholar 

  4. Bao, Z., Lu, J., Ling, T.W., Xu, L., Wu, H.: An effective object-level xml keyword search. In: DASFAA (2010)

  5. Bhalotia, G., Hulgeri, A., Nakhe, C., Chakrabarti, S., Sudarshan, S.: Keyword searching and browsing in databases using banks. In: ICDE (2002)

  6. Chapman, A., Jagadish, H.V.: Why not? In: SIGMOD (2009)

  7. Coffman, J., Weaver, A.C.: An empirical performance evaluation of relational keyword search techniques. IEEE Trans. Knowl. Data Eng. 26(1), 30–42 (2014)

  8. Ding, B., Yu, J.X., Wang, S., Qin, L., Zhang, X., Lin, X.: Finding top-k min-cost connected trees in databases. In: ICDE (2007)

  9. Dreyfus, S.E., Wagner, R.A.: The Steiner problem in graphs. In: Networks (1971)

  10. Drosou, M., Pitoura, E.: Ymaldb: exploring relational databases via result-driven recommendations. VLDB J. 22(6), 849–874 (2013)

    Article  Google Scholar 

  11. Goldman, R., Widom, J.: Dataguides: enabling query formulation and optimization in semistructured databases. In: VLDB (1997)

  12. Guo, L., Shao, F., Botev, C., Shanmugasundaram, J.: Xrank: ranked keyword search over xml documents. In: SIGMOD (2003)

  13. Hadjieleftheriou, M., Chandel, A., Koudas, N., Srivastava, D.: Fast indexes and algorithms for set similarity selection queries. In: ICDE (2008)

  14. He, H., Wang, H., Yang, J., Yu, P.S.: Blinks: ranked keyword searches on graphs. In: SIGMOD (2007)

  15. Hristidis, V., Koudas, N., Papakonstantinou, Y., Srivastava, D.: Keyword proximity search in xml trees. IEEE Trans. Knowl. Data Eng. 18(4), 525–539 (2006)

  16. Hristidis, V., Papakonstantinou, Y., Balmin, A.: Keyword proximity search on xml graphs. In: ICDE (2003)

  17. Huang, J., Chen, T., Doan, A., Naughton, J.F.: On the provenance of non-answers to queries over extracted data. PVLDB 1(1), 736–747 (2008)

  18. Järvelin, K., Kekäläinen, J.: Cumulated gain-based evaluation of ir techniques. ACM Trans. Inf. Syst. 20(4), 422–446 (2002)

  19. Jones, R., Rey, B., Madani, O., Greiner, W.: Generating query substitutions. In: WWW (2006)

  20. Kacholia, V., Pandit, S., Chakrabarti, S., Sudarshan, S., Desai, R., Karambelkar, H.: Bidirectional expansion for keyword search on graph databases. In: VLDB (2005)

  21. Kasneci, G., Ramanath, M., Sozio, M., Suchanek, F.M., Weikum, G.: Star: Steiner-tree approximation in relationship graphs. In: ICDE (2009)

  22. Lee, K.H., Whang, K.Y., Han, W.S., Kim, M.S.: Structural consistency: enabling xml keyword search to eliminate spurious results consistently. VLDB J. 19(4), 503–529 (2010)

  23. Lemire, D., Kaser, O., Aouiche, K.: Sorting improves word-aligned bitmap indexes. Data Knowl. Eng. 69(1), 3–28 (2010)

  24. Li, G., Feng, J., Wang, J., Zhou, L.: Effective keyword search for valuable lcas over xml documents. In: CIKM (2007)

  25. Li, G., Li, C., Feng, J., Zhou, L.: Sail: structure-aware indexing for effective and progressive top-k keyword search over xml documents. Inf. Sci. 179(21), 3745–3762 (2009)

  26. Liu, Z., Chen, Y.: Identifying meaningful return information for xml keyword search. In: SIGMOD (2007)

  27. Liu, Z., Chen, Y.: Reasoning and identifying relevant matches for xml keyword search. PVLDB 1(1), 921–932 (2008)

  28. Liu, Z., Sun, P., Chen, Y.: Structured search result differentiation. PVLDB 2(1), 313–324 (2009)

  29. Muslea, I.: Machine learning for online query relaxation. In: KDD (2004)

  30. Muslea, I., Lee, T.J.: Online query relaxation via bayesian causal structures discovery. In: AAAI (2005)

  31. Nambiar, U., Kambhampati, S.: Answering imprecise queries over autonomous web databases. In: ICDE (2006)

  32. Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. McGraw-Hill Inc., New York (1986)

    Google Scholar 

  33. Schmidt, A., Kersten, M.L., Windhouwer, M.: Querying xml documents made easy: nearest concept queries. In: ICDE (2001)

  34. Sun, C., Chan, C.Y., Goenka, A.K.: Multiway slca-based keyword search in xml data. In: WWW (2007)

  35. Tao, Y., Papadopoulos, S., Sheng, C., Stefanidis, K., Stefanidis, K.: Nearest keyword search in xml documents. In: SIGMOD (2011)

  36. Termehchy, A., Winslett, M.: Using structural information in xml keyword search effectively. ACM Trans. Database Syst. 36(1), 4 (2011)

  37. Vesper, V.: http://www.mtsu.edu/vvesper/dewey.html

  38. Xu, Y., Papakonstantinou, Y.: Efficient keyword search for smallest lcas in xml databases. In: SIGMOD (2005)

  39. Zeng, Y., Bao, Z., Ling, T.W., Jagadish, H.V., Li, G.: Breaking out of the mismatch trap. In: ICDE (2014)

  40. Zeng, Y., Bao, Z., Ling, T.W., Li, G.: Efficient xml keyword search: from graph model to tree model. In: DEXA (2013)

  41. Zeng, Y., Bao, Z., Ling, T.W., Li, G.: Removing the mismatch headache in xml keyword search. In: SIGIR (2013, demo paper. http://xclear.comp.nus.edu.sg)

  42. Zhang, W.V., He, X., Rey, B., Jones, R.: Query rewriting using active learning for sponsored search. In: SIGIR (2007)

Download references

Acknowledgments

H.V. Jagadish is supported in part by NSF IIS-1250880 and IIS-1017296.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Zhifeng Bao or Yong Zeng.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Bao, Z., Zeng, Y., Ling, T.W. et al. A general framework to resolve the MisMatch problem in XML keyword search. The VLDB Journal 24, 493–518 (2015). https://doi.org/10.1007/s00778-015-0386-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-015-0386-1

Keywords

Navigation