Abstract
In this paper we propose a semi-automatic technique for deriving the similarity degree between two portions of heterogeneous information sources (hereafter, sub-sources). The proposed technique consists in two phases: the first one selects the most promising pairs of sub-sources, whereas the second one computes the similarity degree relative to each promising pair. We show that the detection of sub-source similarities is a special case (and a very interesting one, for semi-structured information sources) of the more general problem of Scheme Match. In addition, we present a real example case to clarify the proposed technique, a set of experiments we have conducted to verify the quality of its results, a discussion about its computational complexity and its classification in the context of related literature. Finally, we discuss some possible applications which can benefit by derived similarities.
Similar content being viewed by others
References
C. Batini and M. Lenzerini, “A methodology for data schema integration in the entity relationship model,” IEEE Transactions on Software Engineering 10(6), 1984, 650-664.
S. Bergamaschi, S. Castano, M. Vincini, and D. Beneventano, “Semantic integration and query of heterogeneous information sources,” Data & Knowledge Engineering 36(3), 2001, 215-249.
P. Bernstein and E. Rahm, “Data warehouse scenarios for model management,” in Proc. of International Conference on Conceptual Modeling (ER'00), Salt Lake City, UT, Lecture Notes in Computer Science, Vol. 1920, Springer, 2000, pp. 1-15.
S. Castano, V. D. Antonellis, and S. D. C. di Vimercati, “Global viewing of heterogeneous data sources,” Transactions on Data and Knowledge Engineering 13(2), 2001, 277-297.
S. Chaudhuri and U. Dayal, “An overview of data warehousing and OLAP technology,” ACM SIGMOD RECORD 26(1), 1997, 65-74.
A. Doan, P. D omingos, and A. Halevy, “Reconciling schemas of disparate data sources, a machine-learning approach,” in Proc. of International Conference on Management of Data (SIGMOD 2001), Santa Barbara, CA, ACM Press, 2001, pp. 509-520.
P. Fankhauser, M. Kracker, and E. Neuhold, “Semantic vs. structural resemblance of classes,” ACM SIGMOD RECORD 20(4), 1991, 59-63.
Z. Galil, “Efficient algorithms for finding maximum matching in graphs,” ACM Computing Surveys 18, 1986, 23-38.
W. Gotthard, P. Lockemann, and A. Neufeld, “System-guided view integration for object-oriented databases,” IEEE Transactions on Knowledge and Data Engineering 4(1), 1992 1-22.
A. Jain and R. Dubes, Algorithms for Clustering Data, Prentice-Hall, Englewood Cliffs, NJ, 1988.
L. Kaufman and P. Rousseeuw, Findings Groups in Data: an Introduction to Cluster Analysis, Wiley, New York, 1990.
J. Larson, S. Navathe, and R. Elmastri, “A theory of attribute equivalence in databases with application to schema integration,” IEEE Transactions on Softwware Engineering 15(4), 1989, 449-463.
J. Madhavan, P. Bernstein, and E. Rahm, “Generic schema matching with cupid,” in Proc. of International Conference on Very Large Data Bases (VLDB 2001), Roma, Italy, Morgan Kaufmann, 2001, pp. 49-58.
A. Miller, “WordNet: a lexical database for english,” Communications of the ACM 38(11), 1995, 39-41.
T. Milo and S. Zohar, “Using schema matching to simplify heterogenous data translations,” in Proc. of International Conference on Very Large Data Bases (VLDB'98), New York, Morgan Kaufmann, 1998, pp. 122-133.
P. Mitra, G. Wiederhold, and J. Jannink, “Semi-automatic integration of knowledge sources,” in Proc. of Fusion'99, Sunnyvale, CA, 1999.
L. Palopoli, D. Rosaci, G. Terracina, and D. Ursino, “Un modello concettuale per rappresentare e derivare la semantica associata a sorgenti informative strutturate e semi-strutturate,” In Atti del Congresso sui Sistemi Evoluti per Basi di Dati (SEBD 2001), Venezia, Italy, 2001, pp. 131-145 (in Italian).
L. Palopoli, D. Saccà, G. Terracina, and D. Ursino, “A unified graph-based framework for deriving nominal interscheme properties, type conflicts and object cluster similarities,” in Proc. of Fourth IFCIS Conference on Cooperative Information Systems (CoopIS'99), Edinburgh, UK, IEEE Computer Society, 1999, pp. 34-45.
L. Palopoli, D. Saccà, G. Terracina, and D. Ursino, “Uniform techniques for deriving similarities of objects and subschemes in heterogeneous databases,” IEEE Transactions on Knowledge and Data Engineering 15(2), 2003, 271-294.
L. Palopoli, G. Terracina, and D. Ursino, “A graph-based approach for extracting terminological properties of elements of XML documents,” in Proc. of International Conference on Data Engineering (ICDE 2001), Heidelberg, Germany, IEEE Computer Society, 2001, pp. 330-337.
E. Rahm and P. Bernstein, “A survey of approaches to automatic schema matching,” VLDB Journal 10(4), 2001, 334-350.
S. Richardson, W. Dolan, and L. Vanderwende, “MindNet: acquiring and structuring semantic information from text,” in Proc. of International Conference on Computational Linguistics (COLING-ACL'98), Montreal, Quebec, Canada, Morgan Kaufmann, 1998, pp. 1098-1102.
N. Rishe, J. Yuan, R. Athauda, S.-C. Chen, X. Lu, X. Ma, A. Vaschillo, A. Shaposhnikov, and D. Vasilevsky, “Semantic access: semantic interface for querying databases,” in Proc. of International Conference on Very Large Data Bases (VLDB 2000), Il Cairo, Egypt, Morgan Kaufmann, 2000, pp. 591-594.
D. Rosaci, G. Sarnè, and D. Ursino, “A multi-agent model for handling e-commerce activities,” in Proc.of International Database Engineering and Applications Symposium (IDEAS 2002), Edmonton, Alberta, Canada, IEEE Computer Society, 2002, pp. 202-211.
D. Rosaci, G. Terracina, and D. Ursino, “An approach for deriving a global representation of data sources having different formats and structures,” Knowledge and Information Systems, to appear.
S. Spaccapietra and C. Parent, “View integration: a step forward in solving structural conflicts,” IEEE Transactions on Knowledge and Data Engineering 6(2), 1994, 258-274.
G. Terracina and D. Ursino, “Deriving synonymies and homonymies of object classes in semi-structured information sources,” in Proc. of International Conference on Management of Data (COMAD 2000), Pune, India, McGraw Hill, 2000, pp. 21-32.
G. Terracina and D. Ursino, “A uniform methodology for extracting type conflicts and subscheme similarities from heterogeneous databases,” Information Systems 25(8), 2000, 527-552.
J. Wald and P. Sorenson, “Explaining ambiguity in a formal query language,” ACM Transactions on Database Systems 15(2), 1990, 125-161.
J. Widom, “Research problems in data warehousing,” in Proc. of International Conference on Information and Knowledge Management (CIKM'95), Baltimore, MD, ACM Press, 1995, pp. 25-30.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Rosaci, D., Terracina, G. & Ursino, D. A Technique for Extracting Sub-source Similarities from Information Sources Having Different Formats. World Wide Web 6, 375–399 (2003). https://doi.org/10.1023/A:1025614005307
Issue Date:
DOI: https://doi.org/10.1023/A:1025614005307