Abstract
Recently, the problem of data integration has been newly addressed by methods based on machine learning and discovery. Such methods are intended to automate, at least in part, the laborious process of information integration, by which existing data sources are incorporated in a virtual database. Essentially, these methods scan new data sources, attempting to discover possible mappings to the virtual database. Like all discovery processes, this process is intrinsically probabilistic; that is, each discovery is associated with a specific value that denotes assurance of its appropriateness. Consequently, the rows in a discovered virtual table have mixed assurance levels, with some rows being more credible than others. We argue that rows in discovered virtual databases should be ranked, and we describe a ranking method, called TupleRank, for calculating such a ranking order. Roughly speaking, TupleRank calibrates the probabilities calculated during a discovery process with historical information about the performance of the system. The work is done in the framework of the Autoplex system for discovering content for virtual databases, and initial experimentation is reported and discussed.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley/ACM Press (1999)
Berlin, J., Motro, A.: Autoplex: Automated Discovery of Content for Virtual Databases. In: Batini, C., Giunchiglia, F., Giorgini, P., Mecella, M. (eds.) CoopIS 2001. LNCS, vol. 2172, pp. 108–122. Springer, Heidelberg (2001)
Berlin, J., Motro, A.: Database Schema Matching Using Machine Learning with Feature Selection. In: Pidduck, A.B., Mylopoulos, J., Woo, C.C., Ozsu, M.T. (eds.) CAiSE 2002. LNCS, vol. 2348, pp. 452–466. Springer, Heidelberg (2002)
Castano, S., De Antonellis, V.: A schema analysis and reconciliation tool environment for heterogeneous databases. In: Proc. IDEAS 1999, Int. Database Engineering and Applications Symposium, pp. 53–62 (1999)
Dhamankar, R., Lee, Y., Doan, A., Halevy, A.Y., Domingos, P.: iMAP: Discovering complex semantic matches between database schemas. In: Proc. SIGMOD 2004, Int. Conf. on Management of Data, pp. 383–394 (2004)
Doan, A., Domingos, P., Halevy, A.Y.: Learning source description for data integration. In: Proc. WebDB, pp. 81–86 (2000)
Doan, A., Domingos, P., Halevy, A.Y.: Reconciling schemas of disparate data sources: A machine-learning approach. In: Proc. SIGMOD 2001, Int. Conf. on Management of Data, pp. 509–520 (2001)
Garcia-Molina, H., Papakonstantinou, Y., Quass, D., Rajaraman, A., Sagiv, Y., Ullman, J.D., Vassalos, V., Widom, J.: The TSIMMIS approach to mediation: Data models and languages. J. Intelligent Information Systems 8(2), 117–132 (1997)
Li, W.-S., Clifton, C.: SEMINT: A tool for identifying attribute correspondences in heterogeneous databases using neural networks. Data & Knowledge Engineering 33(1), 49–84 (2000)
Madhavan, J., Bernstein, P.A., Rahm, E.: Generic schema matching with Cupid. In: Proc. VLDB 2001, 27th Int. Conf. on Very Large Databases, pp. 49–58 (2001)
Motro, A.: Multiplex: A formal model for multidatabases and its implementation. In: Tsur, S. (ed.) NGITS 1999. LNCS, vol. 1649, pp. 138–158. Springer, Heidelberg (1999)
Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, San Francisco (2000)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Berlin, J., Motro, A. (2006). TupleRank: Ranking Discovered Content in Virtual Databases. In: Etzion, O., Kuflik, T., Motro, A. (eds) Next Generation Information Technologies and Systems. NGITS 2006. Lecture Notes in Computer Science, vol 4032. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11780991_2
Download citation
DOI: https://doi.org/10.1007/11780991_2
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-35472-7
Online ISBN: 978-3-540-35473-4
eBook Packages: Computer ScienceComputer Science (R0)