Active learning in keyword search-based data integration

Yan, Zhepeng; Zheng, Nan; Ives, Zachary G.; Talukdar, Partha Pratim; Yu, Cong

doi:10.1007/s00778-014-0374-x

Active learning in keyword search-based data integration

Special Issue Paper
Published: 08 January 2015

Volume 24, pages 611–631, (2015)
Cite this article

The VLDB Journal Aims and scope Submit manuscript

Zhepeng Yan¹,
Nan Zheng¹,
Zachary G. Ives¹,
Partha Pratim Talukdar² &
…
Cong Yu³

2261 Accesses
11 Citations
Explore all metrics

Abstract

The problem of scaling up data integration, such that new sources can be quickly utilized as they are discovered, remains elusive: Global schemas for integrated data are difficult to develop and expand, and schema and record matching techniques are limited by the fact that data and metadata are often under-specified and must be disambiguated by data experts. One promising approach is to avoid using a global schema, and instead to develop keyword search-based data integration—where the system lazily discovers associations enabling it to join together matches to keywords, and return ranked results. The user is expected to understand the data domain and provide feedback about answers’ quality. The system generalizes such feedback to learn how to correctly integrate data. A major open challenge is that under this model, the user only sees and offers feedback on a few “top-$k$” results: This result set must be carefully selected to include answers of high relevance and answers that are highly informative when feedback is given on them. Existing systems merely focus on predicting relevance, by composing the scores of various schema and record matching algorithms. In this paper, we show how to predict the uncertainty associated with a query result’s score, as well as how informative feedback is on a given result. We build upon these foundations to develop an active learning approach to keyword search-based data integration, and we validate the effectiveness of our solution over real data from several very different domains.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Metasearch Engine: A Technology for Information Extraction in Knowledge Computing

Extending RapidMiner with Data Search and Integration Capabilities

Federated search techniques: an overview of the trends and state of the art

Article 10 July 2023

Notes

Consider, e.g., the situation where users put data into comments fields because there was no appropriate column in the schema.
By default tf-idf over the tuples in the data, although other metrics such as edit distance or $n$-grams could be used.
For simplicity, we describe the outcome as if each query produces one result, although the system actually iteratively enumerates top-scoring queries, even beyond $k$ such queries, until it gets $k$ answers.

References

Agrawal, S., Chaudhuri, S., Das, G.: DBXplorer: A system for keyword-based search over relational databases. In: ICDE (2002)
Arasu, A., Götz, M., Kaushik, R.: On active learning of record matching packages. In: SIGMOD Conference, pp. 783–794 (2010)
Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.G.: DBpedia: A nucleus for a web of open data. In: ISWC/ASWC (2007)
Balmin, A., Hristidis, V., Papakonstantinou, Y.: ObjectRank: Authority-based keyword search in databases. In: VLDB (2004)
Bergamaschi, S., Domnori, E., Guerra, F., Trillo Lado, R., Velegrakis, Y.: Keyword search over relational databases: a metadata approach. In: SIGMOD (2011)
Betteridge, J., Carlson, A., Hong, S.A., Jr., E.R.H., Law, E.L.M., Mitchell, T.M., Wang, S.H.: Toward never ending language learning. In: AAAI Spring Symposium: Learning by Reading and Learning to Read (2009)
Bhalotia, G., Hulgeri, A., Nakhe, C., Chakrabarti, S., Sudarshan, S.: Keyword searching and browsing in databases using BANKS. In: ICDE, pp. 431–440 (2002)
Crammer, K., Dekel, O., Keshet, J., Shalev-Shwartz, S., Singer, Y.: Online passive–aggressive algorithms. J. Mach. Learn. Res. 7, 551–585 (2006)
MATH MathSciNet Google Scholar
Craswell, N., Zoeter, O., Taylor, M.J., Ramsey, B.: An experimental comparison of click position-bias models. In: WSDM, pp. 87–94 (2008)
Culotta, A., McCallum, A.: Reducing labeling effort for structured prediction tasks. In: AAAI, pp. 746–751 (2005)
Deng, T., Fan, W.: On the complexity of query result diversification. Proc. VLDB Endow. 6(8), 557–588 (2013)
Do, H.H., Rahm, E.: Matching large schemas: Aroaches and evaluatio. Inf. Syst. 32(6), 857–885 (2007)
Doan, A., Domingos, P., Halevy, A.Y.: Reconciling schemas of disparate data sources: a machine-learning approach. In: SIGMOD (2001)
Drosou, M., Pitoura, E.: Search result diversification. SIGMOD Rec. 39(1), 41–47 (2010)
Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: A survey. IEEE TKDE 19(1), 1–16 (2007)
Fagin, R., Lotem, A., Naor, M.: Optimal aggregation algorithms for middleware. J. Comput. Syst. Sci. 66(4), 614–656 (2003)
Franklin, M., Halevy, A., Maier, D.: From databases to dataspaces: a new abstraction for information management. SIGMOD Rec. 34(4), 27–33 (2005)
Gal, A.: Uncertain Schema Matching. Synth. Lect. Data Manag. 3(1), 1–97 (2011)
Gal, A., Sagi, T.: Tuning the ensemble selection process of schema matchers. Inf. Syst. 35(8), 845–859 (2010)
Article Google Scholar
Gollapudi, S., Sharma, A.: An axiomatic approach for result diversification. In: Proceedings of the 18th International Conference on World Wide Web, WWW ’09 (2009)
Gravano, L., Ipeirotis, P.G., Koudas, N., Srivastava, D.: Text joins in an RDBMS for web data integration. In: WWW (2003)
Guo, F., Liu, C., Kannan, A., Minka, T., Taylor, M.J., Wang, Y.M., Faloutsos, C.: Click chain model in web search. In: WWW, pp. 11–20 (2009)
Guo, L., Shao, F., Botev, C., Shanmugasundaram, J.: XRANK: Ranked keyword search over XML documents. In: SIGMOD (2003)
He, H., Wang, H., Yang, J., Yu, P.S.: BLINKS: ranked keyword searches on graphs. In: SIGMOD (2007)
Hristidis, V., Papakonstantinou, Y.: Discover: Keyword search in relational databases. In: VLDB, pp. 670–681 (2002)
Hwa, R.: Sample selection for statistical parsing. Comput. Linguist. 30(3), 253–276 (2004)
Article MATH MathSciNet Google Scholar
Ilyas, I.F., Aref, W.G., Elmagarmid, A.K.: Supporting top-k join queries in relational databases. In: VLDB (2003)
Jacob, M., Ives, Z.G.: Sharing work in keyword search over databases. In: SIGMOD (2011)
Jeffery, S.R., Franklin, M.J., Halevy, A.Y.: Pay-as-you-go user feedback for dataspace systems. In: SIGMOD (2008)
Kacholia, V., Pandit, S., Chakrabarti, S., Sudarshan, S., Desai, R., Karambelkar, H.: Bidirectional expansion for keyword search on graph databases. In: VLDB, pp. 505–516 (2005)
Kimelfeld, B., Sagiv, Y.: Finding and approximating top-k answers in keyword proximity search. In: PODS, pp. 173–182 (2006)
Marian, A., Bruno, N., Gravano, L.: Evaluating top-k queries over web-accessible databases. ACM Trans. Database Syst. 29(2), 319–362 (2004)
Marie, A., Gal, A.: Managing uncertainty in schema matcher ensembles. In: SUM, pp. 60–73 (2007)
Melnik, S., Garcia-Molina, H., Rahm, E.: Similarity flooding: a versatile graph matching algorithm and its application to schema matching. In: ICDE (2002)
Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. VLDB J. 10(4), 334–350 (2001)
Sayyadian, M., LeKhac, H., Doan, A., Gravano, L.: Efficient keyword search across heterogeneous relational databases. In: ICDE (2007)
Settles, B.: Active Learning. Morgan and Claypool, Cambridge (2012)
MATH Google Scholar
Settles, B., Craven, M.: An analysis of active learning strategies for sequence labeling tasks. In: EMNLP (2008)
Settles, B., Craven, M., Ray, S.: Multiple-instance active learning. In: NIPS (2007)
Shen, S., Hu, B., Chen, W., Yang, Q.: Personalized click model through collaborative filtering. In: WSDM, pp. 323–332 (2012)
Suchanek, F.M., Kasneci, G., Weikum, G.: YAGO: A large ontology from Wikipedia and WordNet. J. Web Sem. 6(3), 203–217 (2008)
Talukdar, P.P., Ives, Z.G., Pereira, F.: Automatically incorporating new sources in keyword search-based data integration. In: SIGMOD (2010)
Talukdar, P.P., Jacob, M., Mehmood, M.S., Crammer, K., Ives, Z.G., Pereira, F., Guha, S.: Learning to create data-integrating queries. In: VLDB (2008)
Yakout, M., Elmagarmid, A.K., Neville, J., Ouzzani, M., Ilyas, I.F.: Guided data repair. PVLDB 4(5), 279–289 (2011)
Yan, Z., Zheng, N., Ives, Z., Talukdar, P., Yu, C.: Actively soliciting feedback for query answers in keyword search-based data integration. In: PVLDB (2013)

Download references

Acknowledgments

We thank Burr Settles for his advice on active learning, and the anonymous reviewers for their feedback. This work was funded in part by the National Science Foundation Grants IIS-1050448, IIS-1217798, IIS-0477972, IIS-0513778, CNS-0721541, and by a gift from Google. Portions of this work were done when P. Talukdar was at Carnegie Mellon University.

Author information

Authors and Affiliations

Computer and Information Science Department, University of Pennsylvania, 3330 Walnut Street, Philadelphia, PA, 19104, USA
Zhepeng Yan, Nan Zheng & Zachary G. Ives
Room 401, SERC Indian Institute of Science, Bengaluru, 560012, India
Partha Pratim Talukdar
Google Research, 76 9th Ave, New York, NY, 10011, USA
Cong Yu

Authors

Zhepeng Yan
View author publications
You can also search for this author inPubMed Google Scholar
Nan Zheng
View author publications
You can also search for this author inPubMed Google Scholar
Zachary G. Ives
View author publications
You can also search for this author inPubMed Google Scholar
Partha Pratim Talukdar
View author publications
You can also search for this author inPubMed Google Scholar
Cong Yu
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Zhepeng Yan.

Appendix: Incremental update on source discovery

In addition to finding informative query answers and learning from user’s feedback, another challenge in keyword search-based data integration is to incrementally update the underlying model when we add new data sources [42]. This involves not only updating the base data in the form of the schema graph, but also updating any materialized views that were formulated through keyword search. We wish to automatically combine new data sources into the existing schema graph and to predict edge costs in order to discover query trees to generate potentially useful new results for existing keyword search-based views.

In more detail, within the Q System, each keyword query can be saved as a view whose results can be revisited over time. For each view, we seek to only add new alignment edges that can potentially affect the results in the view, upon new data sources are connected. Formally, suppose we have $G = (V, E)$ as the existing schema graph and $G' = (V', E')$. We are also tied to a fixed view derived from a keyword search query $Q = \{K_i\}$. The goal of automatic incremental update is to derive a probability distribution over edge costs for each pair of attribute nodes $(v, v')$ where $v\in V$ and $v'\in V'$. Notice that a naive way of performing such computation for all possible pairs requires examining $\varOmega (|V||V'|)$ pairs, which is an undesirable quadratic term that does not scale well as the number of schema graph nodes becomes large. Ideally, we need a strategy to compute only that subset of possible joins that indeed produces results affecting the top-$k$ answers of the existing view.

Our information need-based strategy adopts a pruning approach and limits our search space to only a subset of possible pairs. Let $C_{max}$ be the maximum expected tree cost (relevance) among all top-$k$ trees in a fixed view. We also set a threshold $\tau > C_{max}$ (but not too large). Building upon [42], we say that a new attribute node $A$ is feasible if and only if there exists a keyword node $K_i\in Q$ such that the minimum expected cost from $K_i$ to $A$ is less than $\tau $.

We formalize this in Algorithm 4, which identifies all feasible attribute nodes by using BFS. The algorithm starts with all the keyword nodes as seeds, and iteratively expands new regions. When a node is to be expanded, we check if its expected distance to the keywords is greater than $\tau $. If this happens, the algorithm will stop further expansion from the node. After we obtain all feasible nodes, we will align each of them with every node in $V$ to compute the costs.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yan, Z., Zheng, N., Ives, Z.G. et al. Active learning in keyword search-based data integration. The VLDB Journal 24, 611–631 (2015). https://doi.org/10.1007/s00778-014-0374-x

Download citation

Received: 31 January 2014
Revised: 20 October 2014
Accepted: 18 December 2014
Published: 08 January 2015
Issue Date: October 2015
DOI: https://doi.org/10.1007/s00778-014-0374-x

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Active learning in keyword search-based data integration

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Metasearch Engine: A Technology for Information Extraction in Knowledge Computing

Extending RapidMiner with Data Search and Integration Capabilities

Federated search techniques: an overview of the trends and state of the art

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix: Incremental update on source discovery

Appendix: Incremental update on source discovery

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now