Abstract
Record linkage is a crucial step in big data integration (BDI). It is also one of its major challenges with the increasing number of structured data sources that need to be linked and do not share common attributes. Our research-in-progress aims to develop a record linkage layer that assists data scientist in integrating a variety of data sources. A structured literature review of 68 papers reveals (1) key data sets, (2) available classification algorithms (match or no match), and (3) similarity measures to consider in BDI projects. The results highlight the foundational requirements for the development of the record linkage layer such as processing unstructured attributes. As BDI emerges as a priority for industry, our work proposes a record linkage layer that provide similarity measures and integration algorithms while assisting its selection. A record linkage layer can contribute to big data adoption in industry settings and improve quality of big data integration processes to effectively support business decision-making.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
- 2.
- 3.
- 4.
- 5.
DBLP-ACM, DBLP-Scholar, Abt-Buy, Amazon-Google found here https://dbs.uni-leipzig.de/research/projects/object_matching/benchmark_datasets_for_entity_resolution.
- 6.
DBLP found here https://dblp.org/.
- 7.
Restaurant, Census, Cora found here https://hpi.de/naumann/projects/repeatability/datasets/.
- 8.
FEBRL found here https://recordlinkage.readthedocs.io/en/latest/ref-datasets.html.
- 9.
Crunchbase https://data.crunchbase.com/docs/; Standard and Poors 500 https://datahub.io/core/s-and-p-500-companies-financials; GLEIF https://www.gleif.org/en/lei-data/gleif-concatenated-file/; UPC Database https://www.upcitemdb.com/.
References
Blanco, R., Enriquez, J.G., Dominguez-Mayo, F.J., Escalona, M.J., Tuya, J.: Early integration testing for entity reconciliation in the context of heterogeneous data sources. IEEE Trans. Reliab., 1–19 (2018). https://doi.org/10.1109/TR.2018.2809866
Blazquez, D., Domenech, J.: Big data sources and methods for social and economic analyses. Technol. Forecast. Soc. Change 130, 99–113 (2018). https://doi.org/10.1016/j.techfore.2017.07.027
Bleiholder, J., Schmid, J.: Datenintegration und Deduplizierung. In: Hildebrand, K., Gebauer, M., Hinrichs, H., Mielke, M. (eds.) Daten- und Informationsqualität, vol. 1, pp. 123–142. Vieweg+Teubner, Wiesbaden (2011). https://doi.org/10.1007/978-3-8348-9953-8_7
Cato, P.: Einflüsse auf den Implementierungserfolg von Big Data Systemen. Dissertation, Verlag Dr. Kovač (2016)
Christen, P., Winkler, W.E.: Record linkage. In: Sammut, C., Webb, G.I. (eds.) Encyclopedia of Machine Learning and Data Mining, vol. 19, pp. 1–10. Springer, Boston (2016). https://doi.org/10.1007/978-1-4899-7502-7_712-1
Deloitte: Mission Zukunft: So treffen Sie die besten Entscheidungen für morgen! Unsere Experten zeigen, wie die Digitalisierung Entscheidungsprozesse in Ihrem Unternehmen nachhaltig verbessern kann (2018). https://www2.deloitte.com/de/de/pages/trends/zukunft-der-entscheidungsfindung.html
Dong, X.L., Srivastava, D.: Big data integration. In: 2013 IEEE 29th International Conference on Data Engineering (ICDE), pp. 1245–1248. IEEE (2013). https://doi.org/10.1109/ICDE.2013.6544914
Dong, X.L., Rekatsinas, T.: Data integration and machine learning. In: Das, G., Jermaine, C., Bernstein, P. (eds.) Proceedings of the 2018 International Conference on Management of Data - SIGMOD 2018, pp. 1645–1650. ACM Press, New York (2018). https://doi.org/10.1145/3183713.3197387
Ebraheem, M., Thirumuruganathan, S., Joty, S., Ouzzani, M., Tang, N.: Distributed representations of tuples for entity resolution. Proc. VLDB Endow. 11(11), 1454–1467 (2018). https://doi.org/10.14778/3236187.3236198
El-Ghafar, R.M.A., Gheith, M.H., El-Bastawissy, A.H., Nasr, E.S.: Record linkage approaches in big data: a state of art study. In: 2017 13th International Computer Engineering Conference (ICENCO), pp. 224–230. IEEE (27122017–28122017). https://doi.org/10.1109/ICENCO.2017.8289792
Enríquez, J.G., Domínguez Mayo, F.J., Escalona Cuaresma, M.J., Garcia-Garcia, J., Lee, V., Goto, M.: Entity identity reconciliation based big data federation - a MDE approach (2015)
Fasel, D., Meier, A. (eds.): Big Data: Grundlagen, Systeme und Nutzungspotenziale. Edition HMD. Springer, Wiesbaden (2016). https://doi.org/10.1007/978-3-658-11589-0
Gluchowski, P., Chamoni, P. (eds.): Analytische Informationssysteme. Springer, Heidelberg (2016). https://doi.org/10.1007/978-3-662-47763-2
Golshan, B., Halevy, A., Mihaila, G., Tan, W.C.: Data integration: after the teenage years. In: van den Bussche, J., Geerts, F., Sallinger, E. (eds.) Proceedings of the 36th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems - PODS 2017, pp. 101–106. ACM Press, New York (2017). https://doi.org/10.1145/3034786.3056124
González Enríquez, J.: A model-driven engineering approach for the uniquely identity reconciliation of heterogeneous data sources. Dissertation, Universidad de Sevilla, Sevilla (2017)
Webster, J., Watson, R.T.: Analyzing the past to prepare for the future: writing a literature review. MIS Q. 26(2), 13–23 (2002). http://www.jstor.org/stable/4132319
Jupin, J., Shi, J.Y.: Identity tracking in big data: preliminary research using in-memory data graph models for record linkage and probabilistic signature hashing for approximate string matching in big health and human services databases. In: Chin, A., Zhan, J., Ding, W., Wu, J., Xu, W., Wang, F. (eds.) Proceedings of the 2014 International Conference on Big Data Science and Computing - BigDataScience 2014, pp. 1–8. ACM Press, New York (2014). https://doi.org/10.1145/2640087.2644170
Kong, C., Gao, M., Xu, C., Qian, W., Zhou, A.: Entity matching across multiple heterogeneous data sources. In: Navathe, S.B., Wu, W., Shekhar, S., Du, X., Wang, X.S., Xiong, H. (eds.) DASFAA 2016, Part I. LNCS, vol. 9642, pp. 133–146. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-32025-0_9
Kooli, N., Allesiardo, R., Pigneul, E.: Deep learning based approach for entity resolution in databases. In: Nguyen, N.T., Hoang, D.H., Hong, T.-P., Pham, H., Trawiński, B. (eds.) ACIIDS 2018, Part II. LNCS (LNAI), vol. 10752, pp. 3–12. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-75420-8_1
Köpcke, H.: Object Matching on real-world problems. Dissertation, Universität Leipzig, Leipzig (2014)
Köpcke, H., Thor, A., Rahm, E.: Evaluation of entity resolution approaches on real-world match problems. Proc. VLDB Endow. 3(1–2), 484–493 (2010). https://doi.org/10.14778/1920841.1920904
Köpcke, H., Thor, A., Thomas, S., Rahm, E.: Tailoring entity resolution for matching product offers. In: Rundensteiner, E., Markl, V., Manolescu, I., Amer-Yahia, S., Naumann, F., Ari, I. (eds.) Proceedings of the 15th International Conference on Extending Database Technology - EDBT 2012, p. 545. ACM Press, New York (2012). https://doi.org/10.1145/2247596.2247662
Kruse, F., Dmitriyev, V., Marx Gómez, J.: Building a connection between decision maker and data-driven decision process. Arch. Data Sci. Ser. A (Online First) 4(1), 16 (2018). https://doi.org/10.5445/KSP/1000085951/03
Lin, Y., Wang, H., Li, J., Gao, H.: Data source selection for information integration in big data era (2016)
Mayring, P.: Qualitative content analysis: theoretical foundation, basic procedures and software solution (2014)
Mudgal, S., et al.: Deep learning for entity matching. In: Das, G., Jermaine, C., Bernstein, P. (eds.) Proceedings of the 2018 International Conference on Management of Data - SIGMOD 2018, pp. 19–34. ACM Press, New York (2018). https://doi.org/10.1145/3183713.3196926
Pershina, M.: Graph-Based Approaches to Resolve Entity Ambiguity. Dissertation, New York University, New York (2016)
Rahm, E.: The case for holistic data integration. In: Pokorný, J., Ivanović, M., Thalheim, B., Šaloun, P. (eds.) ADBIS 2016. LNCS, vol. 9809, pp. 11–27. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-44039-2_2
Rahm, E., Hai Do, H.: Data cleaning: problems and current approaches. IEEE Data Eng. Bull. 23, 3–13 (2000)
Schild, C.J., Schultz, S.: Linking deutsche bundesbank company data using machine-learning-based classification. In: Proceedings of the Second International Workshop on Data Science for Macro-Modeling (DSMM 2016), pp. 1–3. The Association for Computing Machinery, New York (2016). https://doi.org/10.1145/2951894.2951896
Talburt, J.R.: Entity Resolution and Information Quality. Elsevier (2011). https://doi.org/10.1016/C2009-0-63396-1
Peng, T., Li, L., Kennedy, J.: A comparison of techniques for name matching. GSTF Int. J. Comput. 2(1) (2018)
Rekatsinas, T.I., Dong, X., Getoor, L., Srivastava, D.: Finding quality in quantity: the challenge of discovering valuable sources for integration. In: CIDR (2015)
Yin, R.K.: Case Study Research and Applications: Design and Methods, 6th edn. SAGE, Los Angeles (2018)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Kruse, F. (2019). Towards a Record Linkage Layer to Support Big Data Integration. In: Abramowicz, W., Corchuelo, R. (eds) Business Information Systems Workshops. BIS 2019. Lecture Notes in Business Information Processing, vol 373. Springer, Cham. https://doi.org/10.1007/978-3-030-36691-9_52
Download citation
DOI: https://doi.org/10.1007/978-3-030-36691-9_52
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-36690-2
Online ISBN: 978-3-030-36691-9
eBook Packages: Computer ScienceComputer Science (R0)