TupleRank: Ranking Discovered Content in Virtual Databases

Berlin, Jacob; Motro, Amihai

doi:10.1007/11780991_2

Jacob Berlin¹⁹ &
Amihai Motro¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4032))

Included in the following conference series:

International Workshop on Next Generation Information Technologies and Systems

475 Accesses
2 Citations

Abstract

Recently, the problem of data integration has been newly addressed by methods based on machine learning and discovery. Such methods are intended to automate, at least in part, the laborious process of information integration, by which existing data sources are incorporated in a virtual database. Essentially, these methods scan new data sources, attempting to discover possible mappings to the virtual database. Like all discovery processes, this process is intrinsically probabilistic; that is, each discovery is associated with a specific value that denotes assurance of its appropriateness. Consequently, the rows in a discovered virtual table have mixed assurance levels, with some rows being more credible than others. We argue that rows in discovered virtual databases should be ranked, and we describe a ranking method, called TupleRank, for calculating such a ranking order. Roughly speaking, TupleRank calibrates the probabilities calculated during a discovery process with historical information about the performance of the system. The work is done in the framework of the Autoplex system for discovering content for virtual databases, and initial experimentation is reported and discussed.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

History of Databases

Overview of Databases

References

Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley/ACM Press (1999)
Google Scholar
Berlin, J., Motro, A.: Autoplex: Automated Discovery of Content for Virtual Databases. In: Batini, C., Giunchiglia, F., Giorgini, P., Mecella, M. (eds.) CoopIS 2001. LNCS, vol. 2172, pp. 108–122. Springer, Heidelberg (2001)
Chapter Google Scholar
Berlin, J., Motro, A.: Database Schema Matching Using Machine Learning with Feature Selection. In: Pidduck, A.B., Mylopoulos, J., Woo, C.C., Ozsu, M.T. (eds.) CAiSE 2002. LNCS, vol. 2348, pp. 452–466. Springer, Heidelberg (2002)
Chapter Google Scholar
Castano, S., De Antonellis, V.: A schema analysis and reconciliation tool environment for heterogeneous databases. In: Proc. IDEAS 1999, Int. Database Engineering and Applications Symposium, pp. 53–62 (1999)
Google Scholar
Dhamankar, R., Lee, Y., Doan, A., Halevy, A.Y., Domingos, P.: iMAP: Discovering complex semantic matches between database schemas. In: Proc. SIGMOD 2004, Int. Conf. on Management of Data, pp. 383–394 (2004)
Google Scholar
Doan, A., Domingos, P., Halevy, A.Y.: Learning source description for data integration. In: Proc. WebDB, pp. 81–86 (2000)
Google Scholar
Doan, A., Domingos, P., Halevy, A.Y.: Reconciling schemas of disparate data sources: A machine-learning approach. In: Proc. SIGMOD 2001, Int. Conf. on Management of Data, pp. 509–520 (2001)
Google Scholar
Garcia-Molina, H., Papakonstantinou, Y., Quass, D., Rajaraman, A., Sagiv, Y., Ullman, J.D., Vassalos, V., Widom, J.: The TSIMMIS approach to mediation: Data models and languages. J. Intelligent Information Systems 8(2), 117–132 (1997)
Article Google Scholar
Li, W.-S., Clifton, C.: SEMINT: A tool for identifying attribute correspondences in heterogeneous databases using neural networks. Data & Knowledge Engineering 33(1), 49–84 (2000)
Article MATH Google Scholar
Madhavan, J., Bernstein, P.A., Rahm, E.: Generic schema matching with Cupid. In: Proc. VLDB 2001, 27th Int. Conf. on Very Large Databases, pp. 49–58 (2001)
Google Scholar
Motro, A.: Multiplex: A formal model for multidatabases and its implementation. In: Tsur, S. (ed.) NGITS 1999. LNCS, vol. 1649, pp. 138–158. Springer, Heidelberg (1999)
Chapter Google Scholar
Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, San Francisco (2000)
Google Scholar

Download references

Author information

Authors and Affiliations

Information and Software Engineering Department, George Mason University, Fairfax, VA, 22030, USA
Jacob Berlin & Amihai Motro

Authors

Jacob Berlin
View author publications
You can also search for this author in PubMed Google Scholar
Amihai Motro
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

IBM Haifa Research Lab, 31905, Mount Carmel, Haifa, Israel
Opher Etzion
University of Haifa, Haifa, Israel
Tsvi Kuflik
Information and Software Engineering Department, George Mason University, VA 22030, Fairfax, USA
Amihai Motro

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Berlin, J., Motro, A. (2006). TupleRank: Ranking Discovered Content in Virtual Databases. In: Etzion, O., Kuflik, T., Motro, A. (eds) Next Generation Information Technologies and Systems. NGITS 2006. Lecture Notes in Computer Science, vol 4032. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11780991_2

Download citation

DOI: https://doi.org/10.1007/11780991_2
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-35472-7
Online ISBN: 978-3-540-35473-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

TupleRank: Ranking Discovered Content in Virtual Databases

Abstract

Access this chapter

Preview

Similar content being viewed by others

History of Databases

History of Databases

Overview of Databases

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us