Abstract
Data acquisition is no longer a problem for organizations, as many efforts have been performed in automating data collection and storage, providing access to a wide amount of heterogeneous data sources that can be used to support the decision-making process. Nevertheless, those efforts were not extended to the context of data integration, as many data transformation and integration tasks such as entity and attribute matching remain highly manual. This is not suitable for complex and dynamic contexts where Information Systems must be adaptative enough to mitigate the difficulties derived from the frequent addition and removal of sources. This work proposes a method for the automatic inference of the appropriate data mapping of heterogeneous sources, supporting the data integration process by providing a semantic overview of the data sources, with quantitative measures of the confidence level. The proposed method includes both technical and domain knowledge and has been evaluated through the implementation of a prototype and its application in a particularly dynamic and complex domain where data integration remains an open problem, i.e., genomics.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Krishnan, K.: Data Warehousing in the Age of Big Data. Newnes (2013)
Vaisman, A., Zimányi, E.: Data warehouses: next challenges. In: Aufaure, M.-A., Zimányi, E. (eds.) eBISS 2011. LNBIP, vol. 96, pp. 1–26. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-27358-2_1
Costa, C., Santos, M.Y.: Evaluating several design patterns and trends in big data warehousing systems. In: Krogstie, J., Reijers, H.A. (eds.) CAiSE 2018. LNCS, vol. 10816, pp. 459–473. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-91563-0_28
Bellahsene, Z., Bonifati, A., Duchateau, F., Velegrakis, Y.: On Evaluating Schema Matching and mapping. In: Bellahsene, Z., Bonifati, A., Rahm, E. (eds.) Schema Matching and Mapping, pp. 253–291. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-16518-4_9
Santos, M.Y., Costa, C., Galvão, J., Andrade, C., Pastor, O., Marcén, A.C.: Enhancing big data warehousing for efficient, integrated and advanced analytics - visionary paper. In: Cappiello, C., Ruiz, M. (eds.) CAiSE Forum 2019. LNBIP, vol. 350, pp. 215–226. Springer, Heidelberg (2019). https://doi.org/10.1007/978-3-030-21297-1_19
Bernstein, P.A., Madhavan, J., Rahm, E.: Generic schema matching. Ten Years Later. PVLDB 4, 695–701 (2011)
Madhavan, J., Bernstein, P.A., Rahm, E.: Generic schema matching with cupid. In: Proceedings of the 27th International Conference on Very Large Data Bases, pp. 49–58. Morgan Kaufmann Publishers Inc., San Francisco (2001)
Shirkhorshidi, A.S., Aghabozorgi, S., Wah, T.Y.: A comparison study on similarity and dissimilarity measures in clustering continuous data. PLoS ONE 10, e0144059 (2015). https://doi.org/10.1371/journal.pone.0144059
Xiao, C., Wang, W., Lin, X., Shang, H.: Top-k set similarity joins. In: Proceedings of the 2009 IEEE International Conference on Data Engineering, pp. 916–927. IEEE Computer Society, Washington, DC (2009). https://doi.org/10.1109/ICDE.2009.111
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions and reversals. Soviet Phys. Doklady 10, 707 (1966)
Jaccard, P.: Etude comparative de la distribution florale dans une portion des Alpes et du Jura. Impr. Corbaz, Lausanne (1901)
Winkler, W.E.: String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage [microform]/William E. Winkler. Distributed by ERIC Clearinghouse, [Washington, D.C.] (1990)
Zhu, E., Nargesian, F., Pu, K.Q., Miller, R.J.: LSH ensemble: internet-scale domain search. Proc. VLDB Endow. 9, 1185–1196 (2016). https://doi.org/10.14778/2994509.2994534
Banek, M., Vrdoljak, B., Tjoa, A.M.: Using ontologies for measuring semantic similarity in data warehouse schema matching process. In: 2007 9th International Conference on Telecommunications, pp. 227–234 (2007). https://doi.org/10.1109/CONTEL.2007.381876
Deb Nath, R.P., Hose, K., Pedersen, T.B.: Towards a programmable semantic extract-transform-load framework for semantic data warehouses. In: Proceedings of the ACM Eighteenth International Workshop on Data Warehousing and OLAP, pp. 15–24. ACM, New York (2015). https://doi.org/10.1145/2811222.2811229
Abdellaoui, S., Nader, F.: Semantic data warehouse at the heart of competitive intelligence systems: design approach. In: 2015 6th International Conference on Information Systems and Economic Intelligence (SIIE), pp. 141–145 (2015). https://doi.org/10.1109/ISEI.2015.7358736
El Hajjamy, O., Alaoui, L., Bahaj, M.: Semantic integration of heterogeneous classical data sources in ontological data warehouse. In: Proceedings of the International Conference on Learning and Optimization Algorithms: Theory and Applications, pp. 36:1–36:8. ACM, New York (2018). https://doi.org/10.1145/3230905.3230929
Maccioni, A., Torlone, R.: KAYAK: a framework for just-in-time data preparation in a data lake. In: Krogstie, J., Reijers, H.A. (eds.) CAiSE 2018. LNCS, vol. 10816, pp. 474–489. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-91563-0_29
Hai, R., Geisler, S., Quix, C.: Constance: an intelligent data lake system. In: Proceedings of the 2016 International Conference on Management of Data, pp. 2097–2100. ACM, New York (2016). https://doi.org/10.1145/2882903.2899389
Acknowledgements
This work has been supported by FCT – Fundação para a Ciên-cia e Tecnologia within the Project Scope: UID/CEC/00319/2019, the Doctoral scholarship PD/BDE/135100/2017 and European Structural and Investment Funds in the FEDER component, through the Operational Competitiveness and Internationalization Programme (COMPETE 2020) [Project nº 039479; Funding Reference: POCI-01-0247-FEDER-039479]. We also thank both the Spanish State Research Agency and the Generalitat Valenciana under the projects DataME TIN2016-80811-P, ACIF/2018/171, and PROMETEO/2018/176. Icons made by Freepik, from www.flaticon.com.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Galvão, J., Leon, A., Costa, C., Santos, M.Y., López, Ó.P. (2020). Automating Data Integration in Adaptive and Data-Intensive Information Systems. In: Themistocleous, M., Papadaki, M., Kamal, M.M. (eds) Information Systems. EMCIS 2020. Lecture Notes in Business Information Processing, vol 402. Springer, Cham. https://doi.org/10.1007/978-3-030-63396-7_2
Download citation
DOI: https://doi.org/10.1007/978-3-030-63396-7_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-63395-0
Online ISBN: 978-3-030-63396-7
eBook Packages: Computer ScienceComputer Science (R0)