EnAli: entity alignment across multiple heterogeneous data sources

Kong, Chao; Gao, Ming; Xu, Chen; Fu, Yunbin; Qian, Weining; Zhou, Aoying

doi:10.1007/s11704-017-6561-3

EnAli: entity alignment across multiple heterogeneous data sources

Research Article
Published: 09 June 2018

Volume 13, pages 157–169, (2019)
Cite this article

Frontiers of Computer Science Aims and scope Submit manuscript

Chao Kong¹,
Ming Gao¹,
Chen Xu²,
Yunbin Fu¹,
Weining Qian¹ &
…
Aoying Zhou¹

145 Accesses
14 Citations
4 Altmetric
Explore all metrics

Abstract

Entity alignment is the problem of identifying which entities in a data source refer to the same real-world entity in the others. Identifying entities across heterogeneous data sources is paramount to many research fields, such as data cleaning, data integration, information retrieval and machine learning. The aligning process is not only overwhelmingly expensive for large data sources since it involves all tuples from two or more data sources, but also need to handle heterogeneous entity attributes. In this paper, we propose an unsupervised approach, called EnAli, to match entities across two or more heterogeneous data sources. EnAli employs a generative probabilistic model to incorporate the heterogeneous entity attributes via employing exponential family, handle missing values, and also utilize the locality sensitive hashing schema to reduce the candidate tuples and speed up the aligning process. EnAli is highly accurate and efficient even without any ground-truth tuples. We illustrate the performance of EnAli on re-identifying entities from the same data source, as well as aligning entities across three real data sources. Our experimental results manifest that our proposed approach outperforms the comparable baseline.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Entity Matching Across Multiple Heterogeneous Data Sources

Alignment of Schema-Only and Instance-Only Data Sources Using Large Language Models

Review of Deep Learning-Based Entity Alignment Methods

References

Scannapieco M, Figotin I, Bertino E, Elmagarmid A K. Privacy preserving schema and data matching. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 2007, 653–664
Google Scholar
Getoor L, Machanavajjhala A. Entity resolution: theory, practice & open challenges. Proceedings of the VLDB Endowment, 2012, 5(12): 2018–2019
Article Google Scholar
Zafarani R, Liu H. Connecting corresponding identities across communities. In: Proceedings of International Conference on Weblogs and Social Media. 2009, 354–357
Google Scholar
Tantipathananandh C, Berger-Wolf T Y. Constant-factor approximation algorithms for identifying dynamic communities. In: Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2009, 827–836
Chapter Google Scholar
Zhang JW, Yu P S. Integrated anchor and social link predictions across social networks. In: Proceedings of International Joint Conference on Artificial Intelligence. 2015, 2125–2131
Google Scholar
Zhang J W, Yu P S. PCT: partial co-alignment of social networks. In: Proceedings of International Conference on World Wide Web. 2016, 749–759
Chapter Google Scholar
Gao M, Lim E P, Lo D, Zhu F D, Prasetyo P K, Zhou A Y. CNL: collective network linkage across heterogeneous social network. In: Proceedings of IEEE International Conference on Data Mining. 2015, 757–762
Google Scholar
Kong C, Gao M, Xu C, Qian W N, Zhou A Y. Entity matching across multiple heterogeneous data sources. In: Proceedings of International Conference on Database Systems for Advanced Applications. 2016, 133–146
Chapter Google Scholar
Newcombe H B, Kennedy J M, Axford S J, James A P. Automatic linkage of vital records. Science, 1959, 130(3381): 954–959
Article Google Scholar
Sarawagi S, Bhamidipaty A. Interactive deduplication using active learning. In: Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2002, 269–278
Google Scholar
Wang Y R, Madnick S E. The inter-database instance identification problem in integrating autonomous systems. In: Proceedings of International Conference on Data Engineering. 1989, 46–55
Google Scholar
Hernandez M A, Stolfo S J. The merge/purge problem for large databases. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 1995, 127–138
Google Scholar
Jin L, Li C, Mehrotra S. Supporting efficient record linkage for large data sets using mapping techniques. World Wide Web-internet & Web Information Systems, 2006, 9(4): 557–584
Google Scholar
Whang S E, Garcia-Molina H. Incremental entity resolution on rules and data. The VLDB Journal, 2014, 23(1): 77–102
Article Google Scholar
Kolb L, Thor A, Rahm E. Block-based load balancing for entity resolution with MapReduce. In: Proceedings of ACM Conference on Information and Knowledge Management. 2011, 2397–2400
Google Scholar
Whang S E, Garcia-Molina H. Entity resolution with evolving rules. Proceedings of the VLDB Endowment, 2010, 3(1–2): 1326–1337
Article Google Scholar
Singla P, Domingos P M. Entity resolution with markov logic. In: Proceedings of IEEE International Conference on Data Mining. 2006, 572–582
Google Scholar
Tejada S, Knoblock C A, Minton S. Learning object identification rules for information integration. Information Systems, 2001, 26(8): 607–633
Article MATH Google Scholar
Christen P. Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution and Duplicate Detection. Berlin: Springer Heidelberg, 2012
Book Google Scholar
Elmagarmid A K, Ipeirotis P G, Verykios V S. Duplicate record detection: a survey. IEEE Transactions on Knowledge and Data Engineering, 2007, 19(1): 1–16
Article Google Scholar
Winkler W E. Overview of record linkage and current research directions. Bureau of the Census, 2006, 25(4): 603–623
Google Scholar
Wang J N, Li G L, Yu J X, Feng J H. Entity matching: how similar is similar. Proceedings of the VLDB Endowment, 2011, 4(10): 622–633
Article Google Scholar
Bilenko M, Mooney R. Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2003, 39–48
Google Scholar
Dong X, Halevy A Y, Madhavan J. Reference reconciliation in complex information spaces. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 2005, 85–96
Google Scholar
Roos L L, Wajda A. Record linkage strategies. Part I: estimating information and evaluating approaches. Methods of Information in Medicine, 1991, 30(2): 117
Google Scholar
Grannis S J, Overhage J M, McDonald C J. Analysis of identifier performance using a deterministic linkage algorithm. In: Proceedings of American Medical Informatics Association Annual Symposium. 2002, 305–309
Google Scholar
Rastogi V, Dalvi Ni N, Garofalakis M N. Large-scale collective entity matching. Proceedings of the VLDB Endowment, 2011, 4(4): 208–218
Article Google Scholar
Lee S, Lee J, Hwang S. Scalable entity matching computation with materialization. In: Proceedings of ACM Conference on Information and Knowledge Management. 2011, 2353–2356
Google Scholar
Liu J, Zhang F, Song X Y, Song Y I, Lin C Y, Hon H W. What’s in a name? an unsupervised approach to link users across communities. In: Proceedings of ACM International Conference on Web Search and Data Mining. 2013, 495–504
Google Scholar
Liu S Y, Wang S H, Zhu F D, Zhang J B, Krishnan R. HYDRA: largescale social identity linkage via heterogeneous behavior modeling. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 2014, 51–62
Google Scholar
Zafarani R, Liu H. Connecting users across social media sites: a behavioral-modeling approach. In: Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2013, 41–49
Chapter Google Scholar
Fellegi I P, Sunter A B. A theory for record linkage. Journal of the American Statistical Association, 1969, 64(328): 1183–1210
Article MATH Google Scholar
DuVall S L, Kerber R A, Thomas A. Extending the Fellegi-Sunter probabilistic record linkage method for approximate field comparators. Journal of Biomedical Informatics, 2010, 43(1): 24–30
Article Google Scholar
Sadinle M, Fienberg S E. A generalized fellegi-sunter framework for multiple record linkage with application to homicide record systems. Journal of the American Statistical Association, 2013, 108(502): 385–397
Article MathSciNet MATH Google Scholar
Christen P. A survey of indexing techniques for scalable record linkage and deduplication. IEEE Transactions on Knowledge and Data Engi neering, 2012, 24(9): 1537–1555
Article Google Scholar
Leskovec J, Rajaraman A, Ullman J D. Mining of Massive Datasets. Cambridge: Cambridge University Press, 2011
Google Scholar
Koudas N, Sarawagi S, Srivastava D. Record linkage: similarity measures and algorithms. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 2006, 802–803
Google Scholar
Zheng W G, Zou L, Feng Y S, Chen L, Zhao D Y. Efficient simrank-based similarity join over large graphs. Proceedings of the VLDB Endowment, 2013, 6(7): 493–504
Article Google Scholar
Zafarani R, Liu H. Connecting users across social media sites: a behavioral-modeling approach. In: Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2013, 41–49
Chapter Google Scholar
Blei D, Ng A, Jordan M. Latent dirichlet allocation. Journal of Machine Learning Research, 2003, 3: 993–1022
MATH Google Scholar

Download references

Acknowledgements

This work has been supported by the National Key Research and Development Program of China (2016YFB1000905), the National Natural Science Foundation of China (Grant Nos. U1401256, 61402177, 61672234, 61402180 and 61232002). This work was also supported by NSF of Shanghai (14ZR1412600).

Author information

Authors and Affiliations

School of Data Science and Engineering, East China Normal University, Shanghai, 200062, China
Chao Kong, Ming Gao, Yunbin Fu, Weining Qian & Aoying Zhou
Technische Universität Berlin, Berlin, 10623, Germany
Chen Xu

Authors

Chao Kong
View author publications
Search author on:PubMed Google Scholar
Ming Gao
View author publications
Search author on:PubMed Google Scholar
Chen Xu
View author publications
Search author on:PubMed Google Scholar
Yunbin Fu
View author publications
Search author on:PubMed Google Scholar
Weining Qian
View author publications
Search author on:PubMed Google Scholar
Aoying Zhou
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Ming Gao.

Additional information

Chao Kong is a PhD candidate majoring in Computer Science and Technology in East China Normal University, China. He received his Bachelor’s and Master’s degrees in Anhui Normal University, China in 2008 and 2012 respectively. His research interests include Web data management and data mining.

Ming Gao is an associate professor of Institute for Data Science and Engineering with East China Normal University (ECNU), China. Prior to joining ECNU, he worked as a postdoctoral fellow at LARC in School of Information Systems, Singapore Management University, Singapore. He received his PhD degree from the School of Computer Science, Fudan University, China in 2011. His research interests include uncertain data management, streaming data processing, social network analysis and data mining. His work appears in major international conferences including TKDE, DMKD, SIGIR, ICDE, ICDM, and DASFAA, etc.

Chen Xu is a senior researcher at Database Systems and Information Management (DIMA) Group, Technische University Berlin, Germany. He received his PhD degree from East China Normal University, China in 2014 and Bachelor’s degree from Hefei University of Technology, China in 2009. His research interest is large-scale distributed data management.

Yunbin Fu is a post-doctor at Institute for Data Science and Engineering in East China Normal University, China. He received his PhD in applied mathematics since from Shanghai University, China in 2013. His research interests include data science and machine learning.

Weining Qian is currently a professor in computer science at East China Normal University, China. He received his MS and PhD degrees in computer science from Fudan University, China in 2001 and 2004, respectively. He served as the co-chair of WISE 2012 Challenge, and program committee member of several international conferences, including ICDE 2009/2010/2012 and KDD 2013. His research interests include Web data management and mining of massive data sets.

Aoying Zhou is a professor of computer science at East China Normal University (ECNU), China where he is heading the Institute of Massive Computing. He is the winner of the National Science Fund for Distinguished Young Scholars supported by NSFC and the professorship appointment under Changjiang Scholarship Program of Ministry of Education. Before joining ECNU in 2008, he worked with Fudan University at the Computer Science Department from 1993 to 2007, where he served as the department chair from 1999 to 2002. He worked as a visiting scholar under the Berkeley Scholar Program in UC Berkeley in 2005. He is now acting as the vice-director of ACM SIGMOD China and Technology Committee on Database of China Computer Federation. He is serving as a member of the editorial boards of some prestigious academic journals, such as VLDB Journal, and WWW Journal. His research interests include Web data management, data management for data-intensive computing, and in-memory data analytics.

Electronic supplementary material

Supplementary material, approximately 284 KB.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kong, C., Gao, M., Xu, C. et al. EnAli: entity alignment across multiple heterogeneous data sources. Front. Comput. Sci. 13, 157–169 (2019). https://doi.org/10.1007/s11704-017-6561-3

Download citation

Received: 28 November 2016
Accepted: 06 June 2017
Published: 09 June 2018
Issue Date: February 2019
DOI: https://doi.org/10.1007/s11704-017-6561-3

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

EnAli: entity alignment across multiple heterogeneous data sources

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Entity Matching Across Multiple Heterogeneous Data Sources

Alignment of Schema-Only and Instance-Only Data Sources Using Large Language Models

Review of Deep Learning-Based Entity Alignment Methods

Explore related subjects

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Electronic supplementary material

Supplementary material, approximately 284 KB.

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now