Abstract
A novel automatic method for detecting corresponding attributes in schemas based on content data is studied. More specifically, our proposed method for the detection of coreferent attributes in schemas is based on a statistical and lexical comparison of content data and detected coreferent tuples across multiple datasets, which increase the possibility of correct schema matching. We will show that knowledge of even a small number of coreferent tuples is sufficient to establish correct matching between corresponding attributes of heterogeneous schemas. The behaviour of the novel schema matching technique has been evaluated on several real life datasets, giving a valuable insight in the influence of the different parameters of our approach on the results obtained.
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
The order of datasets does not matter, i.e., there exists schema matching between corresponding attributes from the source dataset and the target dataset, and vice versa.
- 2.
FreeDB, http://www.freedb.org/.
- 3.
Discogs, http://www.discogs.com/data/.
- 4.
- 5.
Discogs, http://www.discogs.com/data/.
- 6.
- 7.
Google Places, http://developers.google.com/places/.
References
Bilke, A., Naumann, F.: Schema matching using duplicates. In: Proceedings of the 28th International Conference on Data Engineering (ICDE) (2005)
Bronselaer, A., De Tré, G.: A possibilistic approach on string comparison. IEEE Trans. Fuzzy Syst. 17(1), 208–223 (2009)
Bronselaer, A., De Tré, G.: Properties of possibilistic string comparison. IEEE Trans. Fuzzy Syst. 18(2), 312–325 (2010)
Bronselaer, A., Hallez, A., De Tré, G.: Extensions of fuzzy measures and the sugeno integral for possibilistic truth values. Int. J. Intel. Syst. 24(2), 97–117 (2009)
Calvo, T., Mayor, G., Mesiar, R. (eds.): Aggregation Operators: New Trends and Applications. Physica-Verlag GmbH, Heidelberg (2002)
Chua, C.E.H., Chiang, R.H.L., Lim, E.P.: Instance-based attribute identification in database integration. VLDB J. 12(3), 228–243 (2003). Oct
de Cooman, G.: Towards a possibilistic logic. In: Ruan, D. (ed.) Fuzzy Set Theory and Advanced Mathematical Applications, International Series in Intelligent Technologies, vol. 4, pp. 89–133. Springer, US (1995)
Dhamankar, R., Lee, Y., Doan, A., Halevy, A., Domingos, P.: imap: discovering complex semantic matches between database schemas. In: Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, ACM Press (2004)
Do, H.h., Rahm, E.: Coma—a system for flexible combination of schema matching approaches. In: Proceedings of the VLDB 2002, pp. 610–621 (2002)
Doan, A., Domingos, P., Levy, A.Y.: Learning source description for data integration. In: WebDB (Informal Proceedings), pp. 81–86 (2000)
Elmagarmid, A., Ipeirotis, P., Verykios, V.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)
Hallez, A., De Tré, G., Verstraete, J., Matthé, T.: Application of fuzzy quantifiers on possibilistic truth values. In: Proceedings of EUROFUSE EURO WG on Fuzzy Sets, pp. 252–254. EXIT (2004)
Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning. Springer Series in Statistics. Springer New York Inc, New York (2001)
Jain, A.K., Duin, R.P.W., Mao, J.: Statistical pattern recognition: a review. IEEE Trans. Pattern Anal. Mach. Intell. 22(1), 4–37 (2000). Jan
Little, R.J.A., Rubin, D.B.: Statistical Analysis with Missing Data. Wiley, New York (1986)
Lu, H., Fan, W., Goh, C.H., Madnick, S., Cheung, D.: Discovering and reconciling semantic conflicts: a data mining prospective. In: Proceedings of IFIP Working Conference on Data Semantics (DS-7) (1997)
Madhavan, J., Bernstein, P.A., Rahm, E.: Generic schema matching with cupid. In: Proceedings of the 27th International Conference on Very Large Data Bases. pp. 49–58. VLDB ’01, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (2001)
Mehdi, O.A., Ibrahim, H., Affendey, L.S.: Instance based matching using regular expression. Procedia CS 10, 688–695 (2012)
Perkowitz, M., Doorenbos, R.B., Etzioni, O., Weld, D.S.: Learning to understand information on the internet: an example-based approach. J. Intel. Inf. Syst. 8(2), 133–153 (1997). Mar
Prade, H.: Possibility sets, fuzzy sets and their relation to Lukasiewicz logic. In: Proceeding of 12th Int Symp on Multiple-Valued Logic. pp. 223–227 (1982)
Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. VLDB J. 10(4), 334–350 (2001). Dec
Reiss, R.D., Thomas, M.: Statistical analysis of extreme values: with applications to insurance, finance, hydrology and other fields. Birkhuser Basel, 3rd edn. (2007)
Sugeno, M.: Theory of Fuzzy Integrals and its Applications. Ph.D. thesis, Tokyo, Japan (1974)
Szymczak, M., Koepke, J.: Matching methods for semantic annotation-based XML document transformations. In: K. Atanassov, et al. (Eds.), New Developments in Fuzzy Sets, Intuitionistic Fuzzy Sets, Generalized Nets and Related Topics. Applications. Volume II. pp. 297–308. SRI PAS (2012)
Szymczak, M., Zadrożny, S., De Tré, G.: Coreference detection in XML metadata. In: Pedrycz, W., Reformat, M. (eds.) Proceedings of 2013 Joint IFSA World Congress NAFIPS Annual Meeting. pp. 1354–1359 (2013)
Szymczak, M., Bronselaer, A., Zadrożny, S., De Tré, G.: Semantical mappings of attribute values for data integration. In: Proceedings of NAFIPS 2014. pp. 1–8. IEEE (2014)
Szymczak, M., Zadrożny, S., Bronselaer, A., De Tré, G.: Coreference detection in an XML schema. Inf. Sci. 296, 237–262 (2015)
Tejada, S., Knoblock, C., Minton, S.: Learning object identification rules for information integration. Inf. Syst. 26(8), 607–633 (2001)
Yager, R.: On the theory of bags. Int. J. Gen. Syst. 13(1), 23–27 (1986)
Zadeh, L.: Fuzzy sets as a basis for a theory of possibility. Fuzzy Sets Syst. 100, 9–34 (1999). Apr
Zadrożny, S., Kacprzyk, J., Sobota, G.: Avoiding duplicate records in a database using a linguistic quantifier based aggregation—a practical approach. In: Proceedings of FUZZ-IEEE. pp. 2194–2201 (2008)
Acknowledgments
This contribution is supported by the Foundation for Polish Science under International PhD Projects in Intelligent Computing. Project financed from The European Union within the Innovative Economy Operational Programme 2007–2013 and European Regional Development Fund. This work was also partially supported by the National Science Centre (contract no. UMO-2011/01/B/ST6/06908).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this chapter
Cite this chapter
Szymczak, M., Bronselaer, A., Zadrożny, S., De Tré, G. (2016). Content Data Based Schema Matching. In: Trė, G., Grzegorzewski, P., Kacprzyk, J., Owsiński, J., Penczek, W., Zadrożny, S. (eds) Challenging Problems and Solutions in Intelligent Systems. Studies in Computational Intelligence, vol 634. Springer, Cham. https://doi.org/10.1007/978-3-319-30165-5_14
Download citation
DOI: https://doi.org/10.1007/978-3-319-30165-5_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-30164-8
Online ISBN: 978-3-319-30165-5
eBook Packages: EngineeringEngineering (R0)