Skip to main content

Automating Data Integration in Adaptive and Data-Intensive Information Systems

  • Conference paper
  • First Online:
Information Systems (EMCIS 2020)

Abstract

Data acquisition is no longer a problem for organizations, as many efforts have been performed in automating data collection and storage, providing access to a wide amount of heterogeneous data sources that can be used to support the decision-making process. Nevertheless, those efforts were not extended to the context of data integration, as many data transformation and integration tasks such as entity and attribute matching remain highly manual. This is not suitable for complex and dynamic contexts where Information Systems must be adaptative enough to mitigate the difficulties derived from the frequent addition and removal of sources. This work proposes a method for the automatic inference of the appropriate data mapping of heterogeneous sources, supporting the data integration process by providing a semantic overview of the data sources, with quantitative measures of the confidence level. The proposed method includes both technical and domain knowledge and has been evaluated through the implementation of a prototype and its application in a particularly dynamic and complex domain where data integration remains an open problem, i.e., genomics.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Krishnan, K.: Data Warehousing in the Age of Big Data. Newnes (2013)

    Google Scholar 

  2. Vaisman, A., Zimányi, E.: Data warehouses: next challenges. In: Aufaure, M.-A., Zimányi, E. (eds.) eBISS 2011. LNBIP, vol. 96, pp. 1–26. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-27358-2_1

    Chapter  Google Scholar 

  3. Costa, C., Santos, M.Y.: Evaluating several design patterns and trends in big data warehousing systems. In: Krogstie, J., Reijers, H.A. (eds.) CAiSE 2018. LNCS, vol. 10816, pp. 459–473. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-91563-0_28

    Chapter  Google Scholar 

  4. Bellahsene, Z., Bonifati, A., Duchateau, F., Velegrakis, Y.: On Evaluating Schema Matching and mapping. In: Bellahsene, Z., Bonifati, A., Rahm, E. (eds.) Schema Matching and Mapping, pp. 253–291. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-16518-4_9

    Chapter  Google Scholar 

  5. Santos, M.Y., Costa, C., Galvão, J., Andrade, C., Pastor, O., Marcén, A.C.: Enhancing big data warehousing for efficient, integrated and advanced analytics - visionary paper. In: Cappiello, C., Ruiz, M. (eds.) CAiSE Forum 2019. LNBIP, vol. 350, pp. 215–226. Springer, Heidelberg (2019). https://doi.org/10.1007/978-3-030-21297-1_19

    Chapter  Google Scholar 

  6. Bernstein, P.A., Madhavan, J., Rahm, E.: Generic schema matching. Ten Years Later. PVLDB 4, 695–701 (2011)

    Google Scholar 

  7. Madhavan, J., Bernstein, P.A., Rahm, E.: Generic schema matching with cupid. In: Proceedings of the 27th International Conference on Very Large Data Bases, pp. 49–58. Morgan Kaufmann Publishers Inc., San Francisco (2001)

    Google Scholar 

  8. Shirkhorshidi, A.S., Aghabozorgi, S., Wah, T.Y.: A comparison study on similarity and dissimilarity measures in clustering continuous data. PLoS ONE 10, e0144059 (2015). https://doi.org/10.1371/journal.pone.0144059

    Article  Google Scholar 

  9. Xiao, C., Wang, W., Lin, X., Shang, H.: Top-k set similarity joins. In: Proceedings of the 2009 IEEE International Conference on Data Engineering, pp. 916–927. IEEE Computer Society, Washington, DC (2009). https://doi.org/10.1109/ICDE.2009.111

  10. Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions and reversals. Soviet Phys. Doklady 10, 707 (1966)

    MathSciNet  Google Scholar 

  11. Jaccard, P.: Etude comparative de la distribution florale dans une portion des Alpes et du Jura. Impr. Corbaz, Lausanne (1901)

    Google Scholar 

  12. Winkler, W.E.: String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage [microform]/William E. Winkler. Distributed by ERIC Clearinghouse, [Washington, D.C.] (1990)

    Google Scholar 

  13. Zhu, E., Nargesian, F., Pu, K.Q., Miller, R.J.: LSH ensemble: internet-scale domain search. Proc. VLDB Endow. 9, 1185–1196 (2016). https://doi.org/10.14778/2994509.2994534

  14. Banek, M., Vrdoljak, B., Tjoa, A.M.: Using ontologies for measuring semantic similarity in data warehouse schema matching process. In: 2007 9th International Conference on Telecommunications, pp. 227–234 (2007). https://doi.org/10.1109/CONTEL.2007.381876

  15. Deb Nath, R.P., Hose, K., Pedersen, T.B.: Towards a programmable semantic extract-transform-load framework for semantic data warehouses. In: Proceedings of the ACM Eighteenth International Workshop on Data Warehousing and OLAP, pp. 15–24. ACM, New York (2015). https://doi.org/10.1145/2811222.2811229

  16. Abdellaoui, S., Nader, F.: Semantic data warehouse at the heart of competitive intelligence systems: design approach. In: 2015 6th International Conference on Information Systems and Economic Intelligence (SIIE), pp. 141–145 (2015). https://doi.org/10.1109/ISEI.2015.7358736

  17. El Hajjamy, O., Alaoui, L., Bahaj, M.: Semantic integration of heterogeneous classical data sources in ontological data warehouse. In: Proceedings of the International Conference on Learning and Optimization Algorithms: Theory and Applications, pp. 36:1–36:8. ACM, New York (2018). https://doi.org/10.1145/3230905.3230929

  18. Maccioni, A., Torlone, R.: KAYAK: a framework for just-in-time data preparation in a data lake. In: Krogstie, J., Reijers, H.A. (eds.) CAiSE 2018. LNCS, vol. 10816, pp. 474–489. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-91563-0_29

    Chapter  Google Scholar 

  19. Hai, R., Geisler, S., Quix, C.: Constance: an intelligent data lake system. In: Proceedings of the 2016 International Conference on Management of Data, pp. 2097–2100. ACM, New York (2016). https://doi.org/10.1145/2882903.2899389

Download references

Acknowledgements

This work has been supported by FCT – Fundação para a Ciên-cia e Tecnologia within the Project Scope: UID/CEC/00319/2019, the Doctoral scholarship PD/BDE/135100/2017 and European Structural and Investment Funds in the FEDER component, through the Operational Competitiveness and Internationalization Programme (COMPETE 2020) [Project nº 039479; Funding Reference: POCI-01-0247-FEDER-039479]. We also thank both the Spanish State Research Agency and the Generalitat Valenciana under the projects DataME TIN2016-80811-P, ACIF/2018/171, and PROMETEO/2018/176. Icons made by Freepik, from www.flaticon.com.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to João Galvão .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Galvão, J., Leon, A., Costa, C., Santos, M.Y., López, Ó.P. (2020). Automating Data Integration in Adaptive and Data-Intensive Information Systems. In: Themistocleous, M., Papadaki, M., Kamal, M.M. (eds) Information Systems. EMCIS 2020. Lecture Notes in Business Information Processing, vol 402. Springer, Cham. https://doi.org/10.1007/978-3-030-63396-7_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-63396-7_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-63395-0

  • Online ISBN: 978-3-030-63396-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics