Skip to main content
Log in

Active learning in keyword search-based data integration

  • Special Issue Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

The problem of scaling up data integration, such that new sources can be quickly utilized as they are discovered, remains elusive: Global schemas for integrated data are difficult to develop and expand, and schema and record matching techniques are limited by the fact that data and metadata are often under-specified and must be disambiguated by data experts. One promising approach is to avoid using a global schema, and instead to develop keyword search-based data integration—where the system lazily discovers associations enabling it to join together matches to keywords, and return ranked results. The user is expected to understand the data domain and provide feedback about answers’ quality. The system generalizes such feedback to learn how to correctly integrate data. A major open challenge is that under this model, the user only sees and offers feedback on a few “top-\(k\)” results: This result set must be carefully selected to include answers of high relevance and answers that are highly informative when feedback is given on them. Existing systems merely focus on predicting relevance, by composing the scores of various schema and record matching algorithms. In this paper, we show how to predict the uncertainty associated with a query result’s score, as well as how informative feedback is on a given result. We build upon these foundations to develop an active learning approach to keyword search-based data integration, and we validate the effectiveness of our solution over real data from several very different domains.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18

Similar content being viewed by others

Notes

  1. Consider, e.g., the situation where users put data into comments fields because there was no appropriate column in the schema.

  2. By default tf-idf over the tuples in the data, although other metrics such as edit distance or \(n\)-grams could be used.

  3. For simplicity, we describe the outcome as if each query produces one result, although the system actually iteratively enumerates top-scoring queries, even beyond \(k\) such queries, until it gets \(k\) answers.

References

  1. Agrawal, S., Chaudhuri, S., Das, G.: DBXplorer: A system for keyword-based search over relational databases. In: ICDE (2002)

  2. Arasu, A., Götz, M., Kaushik, R.: On active learning of record matching packages. In: SIGMOD Conference, pp. 783–794 (2010)

  3. Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.G.: DBpedia: A nucleus for a web of open data. In: ISWC/ASWC (2007)

  4. Balmin, A., Hristidis, V., Papakonstantinou, Y.: ObjectRank: Authority-based keyword search in databases. In: VLDB (2004)

  5. Bergamaschi, S., Domnori, E., Guerra, F., Trillo Lado, R., Velegrakis, Y.: Keyword search over relational databases: a metadata approach. In: SIGMOD (2011)

  6. Betteridge, J., Carlson, A., Hong, S.A., Jr., E.R.H., Law, E.L.M., Mitchell, T.M., Wang, S.H.: Toward never ending language learning. In: AAAI Spring Symposium: Learning by Reading and Learning to Read (2009)

  7. Bhalotia, G., Hulgeri, A., Nakhe, C., Chakrabarti, S., Sudarshan, S.: Keyword searching and browsing in databases using BANKS. In: ICDE, pp. 431–440 (2002)

  8. Crammer, K., Dekel, O., Keshet, J., Shalev-Shwartz, S., Singer, Y.: Online passive–aggressive algorithms. J. Mach. Learn. Res. 7, 551–585 (2006)

    MATH  MathSciNet  Google Scholar 

  9. Craswell, N., Zoeter, O., Taylor, M.J., Ramsey, B.: An experimental comparison of click position-bias models. In: WSDM, pp. 87–94 (2008)

  10. Culotta, A., McCallum, A.: Reducing labeling effort for structured prediction tasks. In: AAAI, pp. 746–751 (2005)

  11. Deng, T., Fan, W.: On the complexity of query result diversification. Proc. VLDB Endow. 6(8), 557–588 (2013)

  12. Do, H.H., Rahm, E.: Matching large schemas: Aroaches and evaluatio. Inf. Syst. 32(6), 857–885 (2007)

  13. Doan, A., Domingos, P., Halevy, A.Y.: Reconciling schemas of disparate data sources: a machine-learning approach. In: SIGMOD (2001)

  14. Drosou, M., Pitoura, E.: Search result diversification. SIGMOD Rec. 39(1), 41–47 (2010)

  15. Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: A survey. IEEE TKDE 19(1), 1–16 (2007)

  16. Fagin, R., Lotem, A., Naor, M.: Optimal aggregation algorithms for middleware. J. Comput. Syst. Sci. 66(4), 614–656 (2003)

  17. Franklin, M., Halevy, A., Maier, D.: From databases to dataspaces: a new abstraction for information management. SIGMOD Rec. 34(4), 27–33 (2005)

  18. Gal, A.: Uncertain Schema Matching. Synth. Lect. Data Manag. 3(1), 1–97 (2011)

  19. Gal, A., Sagi, T.: Tuning the ensemble selection process of schema matchers. Inf. Syst. 35(8), 845–859 (2010)

    Article  Google Scholar 

  20. Gollapudi, S., Sharma, A.: An axiomatic approach for result diversification. In: Proceedings of the 18th International Conference on World Wide Web, WWW ’09 (2009)

  21. Gravano, L., Ipeirotis, P.G., Koudas, N., Srivastava, D.: Text joins in an RDBMS for web data integration. In: WWW (2003)

  22. Guo, F., Liu, C., Kannan, A., Minka, T., Taylor, M.J., Wang, Y.M., Faloutsos, C.: Click chain model in web search. In: WWW, pp. 11–20 (2009)

  23. Guo, L., Shao, F., Botev, C., Shanmugasundaram, J.: XRANK: Ranked keyword search over XML documents. In: SIGMOD (2003)

  24. He, H., Wang, H., Yang, J., Yu, P.S.: BLINKS: ranked keyword searches on graphs. In: SIGMOD (2007)

  25. Hristidis, V., Papakonstantinou, Y.: Discover: Keyword search in relational databases. In: VLDB, pp. 670–681 (2002)

  26. Hwa, R.: Sample selection for statistical parsing. Comput. Linguist. 30(3), 253–276 (2004)

    Article  MATH  MathSciNet  Google Scholar 

  27. Ilyas, I.F., Aref, W.G., Elmagarmid, A.K.: Supporting top-k join queries in relational databases. In: VLDB (2003)

  28. Jacob, M., Ives, Z.G.: Sharing work in keyword search over databases. In: SIGMOD (2011)

  29. Jeffery, S.R., Franklin, M.J., Halevy, A.Y.: Pay-as-you-go user feedback for dataspace systems. In: SIGMOD (2008)

  30. Kacholia, V., Pandit, S., Chakrabarti, S., Sudarshan, S., Desai, R., Karambelkar, H.: Bidirectional expansion for keyword search on graph databases. In: VLDB, pp. 505–516 (2005)

  31. Kimelfeld, B., Sagiv, Y.: Finding and approximating top-k answers in keyword proximity search. In: PODS, pp. 173–182 (2006)

  32. Marian, A., Bruno, N., Gravano, L.: Evaluating top-k queries over web-accessible databases. ACM Trans. Database Syst. 29(2), 319–362 (2004)

  33. Marie, A., Gal, A.: Managing uncertainty in schema matcher ensembles. In: SUM, pp. 60–73 (2007)

  34. Melnik, S., Garcia-Molina, H., Rahm, E.: Similarity flooding: a versatile graph matching algorithm and its application to schema matching. In: ICDE (2002)

  35. Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. VLDB J. 10(4), 334–350 (2001)

  36. Sayyadian, M., LeKhac, H., Doan, A., Gravano, L.: Efficient keyword search across heterogeneous relational databases. In: ICDE (2007)

  37. Settles, B.: Active Learning. Morgan and Claypool, Cambridge (2012)

    MATH  Google Scholar 

  38. Settles, B., Craven, M.: An analysis of active learning strategies for sequence labeling tasks. In: EMNLP (2008)

  39. Settles, B., Craven, M., Ray, S.: Multiple-instance active learning. In: NIPS (2007)

  40. Shen, S., Hu, B., Chen, W., Yang, Q.: Personalized click model through collaborative filtering. In: WSDM, pp. 323–332 (2012)

  41. Suchanek, F.M., Kasneci, G., Weikum, G.: YAGO: A large ontology from Wikipedia and WordNet. J. Web Sem. 6(3), 203–217 (2008)

  42. Talukdar, P.P., Ives, Z.G., Pereira, F.: Automatically incorporating new sources in keyword search-based data integration. In: SIGMOD (2010)

  43. Talukdar, P.P., Jacob, M., Mehmood, M.S., Crammer, K., Ives, Z.G., Pereira, F., Guha, S.: Learning to create data-integrating queries. In: VLDB (2008)

  44. Yakout, M., Elmagarmid, A.K., Neville, J., Ouzzani, M., Ilyas, I.F.: Guided data repair. PVLDB 4(5), 279–289 (2011)

  45. Yan, Z., Zheng, N., Ives, Z., Talukdar, P., Yu, C.: Actively soliciting feedback for query answers in keyword search-based data integration. In: PVLDB (2013)

Download references

Acknowledgments

We thank Burr Settles for his advice on active learning, and the anonymous reviewers for their feedback. This work was funded in part by the National Science Foundation Grants IIS-1050448, IIS-1217798, IIS-0477972, IIS-0513778, CNS-0721541, and by a gift from Google. Portions of this work were done when P. Talukdar was at Carnegie Mellon University.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhepeng Yan.

Appendix: Incremental update on source discovery

Appendix: Incremental update on source discovery

In addition to finding informative query answers and learning from user’s feedback, another challenge in keyword search-based data integration is to incrementally update the underlying model when we add new data sources [42]. This involves not only updating the base data in the form of the schema graph, but also updating any materialized views that were formulated through keyword search. We wish to automatically combine new data sources into the existing schema graph and to predict edge costs in order to discover query trees to generate potentially useful new results for existing keyword search-based views.

figure d

In more detail, within the Q System, each keyword query can be saved as a view whose results can be revisited over time. For each view, we seek to only add new alignment edges that can potentially affect the results in the view, upon new data sources are connected. Formally, suppose we have \(G = (V, E)\) as the existing schema graph and \(G' = (V', E')\). We are also tied to a fixed view derived from a keyword search query \(Q = \{K_i\}\). The goal of automatic incremental update is to derive a probability distribution over edge costs for each pair of attribute nodes \((v, v')\) where \(v\in V\) and \(v'\in V'\). Notice that a naive way of performing such computation for all possible pairs requires examining \(\varOmega (|V||V'|)\) pairs, which is an undesirable quadratic term that does not scale well as the number of schema graph nodes becomes large. Ideally, we need a strategy to compute only that subset of possible joins that indeed produces results affecting the top-\(k\) answers of the existing view.

Our information need-based strategy adopts a pruning approach and limits our search space to only a subset of possible pairs. Let \(C_{max}\) be the maximum expected tree cost (relevance) among all top-\(k\) trees in a fixed view. We also set a threshold \(\tau > C_{max}\) (but not too large). Building upon [42], we say that a new attribute node \(A\) is feasible if and only if there exists a keyword node \(K_i\in Q\) such that the minimum expected cost from \(K_i\) to \(A\) is less than \(\tau \).

We formalize this in Algorithm 4, which identifies all feasible attribute nodes by using BFS. The algorithm starts with all the keyword nodes as seeds, and iteratively expands new regions. When a node is to be expanded, we check if its expected distance to the keywords is greater than \(\tau \). If this happens, the algorithm will stop further expansion from the node. After we obtain all feasible nodes, we will align each of them with every node in \(V\) to compute the costs.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yan, Z., Zheng, N., Ives, Z.G. et al. Active learning in keyword search-based data integration. The VLDB Journal 24, 611–631 (2015). https://doi.org/10.1007/s00778-014-0374-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-014-0374-x

Keywords

Navigation