Automating Data Integration in Adaptive and Data-Intensive Information Systems

Galvão, João; Leon, Ana; Costa, Carlos; Santos, Maribel Yasmina; López, Óscar Pastor

doi:10.1007/978-3-030-63396-7_2

Part of the book series: Lecture Notes in Business Information Processing ((LNBIP,volume 402))

Included in the following conference series:

European, Mediterranean, and Middle Eastern Conference on Information Systems

2209 Accesses
4 Citations

Abstract

Data acquisition is no longer a problem for organizations, as many efforts have been performed in automating data collection and storage, providing access to a wide amount of heterogeneous data sources that can be used to support the decision-making process. Nevertheless, those efforts were not extended to the context of data integration, as many data transformation and integration tasks such as entity and attribute matching remain highly manual. This is not suitable for complex and dynamic contexts where Information Systems must be adaptative enough to mitigate the difficulties derived from the frequent addition and removal of sources. This work proposes a method for the automatic inference of the appropriate data mapping of heterogeneous sources, supporting the data integration process by providing a semantic overview of the data sources, with quantitative measures of the confidence level. The proposed method includes both technical and domain knowledge and has been evaluated through the implementation of a prototype and its application in a particularly dynamic and complex domain where data integration remains an open problem, i.e., genomics.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Data integration from traditional to big data: main features and comparisons of ETL approaches

Article 16 September 2024

Integration of Big Data: A Survey

Operationalizing and automating Data Governance

Article Open access 10 December 2022

References

Krishnan, K.: Data Warehousing in the Age of Big Data. Newnes (2013)
Google Scholar
Vaisman, A., Zimányi, E.: Data warehouses: next challenges. In: Aufaure, M.-A., Zimányi, E. (eds.) eBISS 2011. LNBIP, vol. 96, pp. 1–26. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-27358-2_1
Chapter Google Scholar
Costa, C., Santos, M.Y.: Evaluating several design patterns and trends in big data warehousing systems. In: Krogstie, J., Reijers, H.A. (eds.) CAiSE 2018. LNCS, vol. 10816, pp. 459–473. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-91563-0_28
Chapter Google Scholar
Bellahsene, Z., Bonifati, A., Duchateau, F., Velegrakis, Y.: On Evaluating Schema Matching and mapping. In: Bellahsene, Z., Bonifati, A., Rahm, E. (eds.) Schema Matching and Mapping, pp. 253–291. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-16518-4_9
Chapter Google Scholar
Santos, M.Y., Costa, C., Galvão, J., Andrade, C., Pastor, O., Marcén, A.C.: Enhancing big data warehousing for efficient, integrated and advanced analytics - visionary paper. In: Cappiello, C., Ruiz, M. (eds.) CAiSE Forum 2019. LNBIP, vol. 350, pp. 215–226. Springer, Heidelberg (2019). https://doi.org/10.1007/978-3-030-21297-1_19
Chapter Google Scholar
Bernstein, P.A., Madhavan, J., Rahm, E.: Generic schema matching. Ten Years Later. PVLDB 4, 695–701 (2011)
Google Scholar
Madhavan, J., Bernstein, P.A., Rahm, E.: Generic schema matching with cupid. In: Proceedings of the 27th International Conference on Very Large Data Bases, pp. 49–58. Morgan Kaufmann Publishers Inc., San Francisco (2001)
Google Scholar
Shirkhorshidi, A.S., Aghabozorgi, S., Wah, T.Y.: A comparison study on similarity and dissimilarity measures in clustering continuous data. PLoS ONE 10, e0144059 (2015). https://doi.org/10.1371/journal.pone.0144059
Article Google Scholar
Xiao, C., Wang, W., Lin, X., Shang, H.: Top-k set similarity joins. In: Proceedings of the 2009 IEEE International Conference on Data Engineering, pp. 916–927. IEEE Computer Society, Washington, DC (2009). https://doi.org/10.1109/ICDE.2009.111
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions and reversals. Soviet Phys. Doklady 10, 707 (1966)
MathSciNet Google Scholar
Jaccard, P.: Etude comparative de la distribution florale dans une portion des Alpes et du Jura. Impr. Corbaz, Lausanne (1901)
Google Scholar
Winkler, W.E.: String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage [microform]/William E. Winkler. Distributed by ERIC Clearinghouse, [Washington, D.C.] (1990)
Google Scholar
Zhu, E., Nargesian, F., Pu, K.Q., Miller, R.J.: LSH ensemble: internet-scale domain search. Proc. VLDB Endow. 9, 1185–1196 (2016). https://doi.org/10.14778/2994509.2994534
Banek, M., Vrdoljak, B., Tjoa, A.M.: Using ontologies for measuring semantic similarity in data warehouse schema matching process. In: 2007 9th International Conference on Telecommunications, pp. 227–234 (2007). https://doi.org/10.1109/CONTEL.2007.381876
Deb Nath, R.P., Hose, K., Pedersen, T.B.: Towards a programmable semantic extract-transform-load framework for semantic data warehouses. In: Proceedings of the ACM Eighteenth International Workshop on Data Warehousing and OLAP, pp. 15–24. ACM, New York (2015). https://doi.org/10.1145/2811222.2811229
Abdellaoui, S., Nader, F.: Semantic data warehouse at the heart of competitive intelligence systems: design approach. In: 2015 6th International Conference on Information Systems and Economic Intelligence (SIIE), pp. 141–145 (2015). https://doi.org/10.1109/ISEI.2015.7358736
El Hajjamy, O., Alaoui, L., Bahaj, M.: Semantic integration of heterogeneous classical data sources in ontological data warehouse. In: Proceedings of the International Conference on Learning and Optimization Algorithms: Theory and Applications, pp. 36:1–36:8. ACM, New York (2018). https://doi.org/10.1145/3230905.3230929
Maccioni, A., Torlone, R.: KAYAK: a framework for just-in-time data preparation in a data lake. In: Krogstie, J., Reijers, H.A. (eds.) CAiSE 2018. LNCS, vol. 10816, pp. 474–489. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-91563-0_29
Chapter Google Scholar
Hai, R., Geisler, S., Quix, C.: Constance: an intelligent data lake system. In: Proceedings of the 2016 International Conference on Management of Data, pp. 2097–2100. ACM, New York (2016). https://doi.org/10.1145/2882903.2899389

Download references

Acknowledgements

This work has been supported by FCT – Fundação para a Ciên-cia e Tecnologia within the Project Scope: UID/CEC/00319/2019, the Doctoral scholarship PD/BDE/135100/2017 and European Structural and Investment Funds in the FEDER component, through the Operational Competitiveness and Internationalization Programme (COMPETE 2020) [Project nº 039479; Funding Reference: POCI-01-0247-FEDER-039479]. We also thank both the Spanish State Research Agency and the Generalitat Valenciana under the projects DataME TIN2016-80811-P, ACIF/2018/171, and PROMETEO/2018/176. Icons made by Freepik, from www.flaticon.com.

Author information

Authors and Affiliations

ALGORITMI Research Centre, University of Minho, Guimarães, Portugal
João Galvão, Carlos Costa & Maribel Yasmina Santos
Research Center on Software Production Methods (PROS), Universitat Politècnica de València, Valencia, Spain
Ana Leon & Óscar Pastor López

Authors

João Galvão
View author publications
You can also search for this author in PubMed Google Scholar
Ana Leon
View author publications
You can also search for this author in PubMed Google Scholar
Carlos Costa
View author publications
You can also search for this author in PubMed Google Scholar
Maribel Yasmina Santos
View author publications
You can also search for this author in PubMed Google Scholar
Óscar Pastor López
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to João Galvão .

Editor information

Editors and Affiliations

Department of Digital Innovation, School of Business, University of Nicosia, Nicosia, Cyprus
Marinos Themistocleous
British University in Dubai, Dubai, United Arab Emirates
Maria Papadaki
School of Strategy and Leadership, Coventry University, Coventry, UK
Muhammad Mustafa Kamal

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Galvão, J., Leon, A., Costa, C., Santos, M.Y., López, Ó.P. (2020). Automating Data Integration in Adaptive and Data-Intensive Information Systems. In: Themistocleous, M., Papadaki, M., Kamal, M.M. (eds) Information Systems. EMCIS 2020. Lecture Notes in Business Information Processing, vol 402. Springer, Cham. https://doi.org/10.1007/978-3-030-63396-7_2

Download citation

DOI: https://doi.org/10.1007/978-3-030-63396-7_2
Published: 21 November 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-63395-0
Online ISBN: 978-3-030-63396-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics