Abstract
Entity alignment is the problem of identifying which entities in a data source refer to the same real-world entity in the others. Identifying entities across heterogeneous data sources is paramount to many research fields, such as data cleaning, data integration, information retrieval and machine learning. The aligning process is not only overwhelmingly expensive for large data sources since it involves all tuples from two or more data sources, but also need to handle heterogeneous entity attributes. In this paper, we propose an unsupervised approach, called EnAli, to match entities across two or more heterogeneous data sources. EnAli employs a generative probabilistic model to incorporate the heterogeneous entity attributes via employing exponential family, handle missing values, and also utilize the locality sensitive hashing schema to reduce the candidate tuples and speed up the aligning process. EnAli is highly accurate and efficient even without any ground-truth tuples. We illustrate the performance of EnAli on re-identifying entities from the same data source, as well as aligning entities across three real data sources. Our experimental results manifest that our proposed approach outperforms the comparable baseline.
Similar content being viewed by others
References
Scannapieco M, Figotin I, Bertino E, Elmagarmid A K. Privacy preserving schema and data matching. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 2007, 653–664
Getoor L, Machanavajjhala A. Entity resolution: theory, practice & open challenges. Proceedings of the VLDB Endowment, 2012, 5(12): 2018–2019
Zafarani R, Liu H. Connecting corresponding identities across communities. In: Proceedings of International Conference on Weblogs and Social Media. 2009, 354–357
Tantipathananandh C, Berger-Wolf T Y. Constant-factor approximation algorithms for identifying dynamic communities. In: Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2009, 827–836
Zhang JW, Yu P S. Integrated anchor and social link predictions across social networks. In: Proceedings of International Joint Conference on Artificial Intelligence. 2015, 2125–2131
Zhang J W, Yu P S. PCT: partial co-alignment of social networks. In: Proceedings of International Conference on World Wide Web. 2016, 749–759
Gao M, Lim E P, Lo D, Zhu F D, Prasetyo P K, Zhou A Y. CNL: collective network linkage across heterogeneous social network. In: Proceedings of IEEE International Conference on Data Mining. 2015, 757–762
Kong C, Gao M, Xu C, Qian W N, Zhou A Y. Entity matching across multiple heterogeneous data sources. In: Proceedings of International Conference on Database Systems for Advanced Applications. 2016, 133–146
Newcombe H B, Kennedy J M, Axford S J, James A P. Automatic linkage of vital records. Science, 1959, 130(3381): 954–959
Sarawagi S, Bhamidipaty A. Interactive deduplication using active learning. In: Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2002, 269–278
Wang Y R, Madnick S E. The inter-database instance identification problem in integrating autonomous systems. In: Proceedings of International Conference on Data Engineering. 1989, 46–55
Hernandez M A, Stolfo S J. The merge/purge problem for large databases. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 1995, 127–138
Jin L, Li C, Mehrotra S. Supporting efficient record linkage for large data sets using mapping techniques. World Wide Web-internet & Web Information Systems, 2006, 9(4): 557–584
Whang S E, Garcia-Molina H. Incremental entity resolution on rules and data. The VLDB Journal, 2014, 23(1): 77–102
Kolb L, Thor A, Rahm E. Block-based load balancing for entity resolution with MapReduce. In: Proceedings of ACM Conference on Information and Knowledge Management. 2011, 2397–2400
Whang S E, Garcia-Molina H. Entity resolution with evolving rules. Proceedings of the VLDB Endowment, 2010, 3(1–2): 1326–1337
Singla P, Domingos P M. Entity resolution with markov logic. In: Proceedings of IEEE International Conference on Data Mining. 2006, 572–582
Tejada S, Knoblock C A, Minton S. Learning object identification rules for information integration. Information Systems, 2001, 26(8): 607–633
Christen P. Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution and Duplicate Detection. Berlin: Springer Heidelberg, 2012
Elmagarmid A K, Ipeirotis P G, Verykios V S. Duplicate record detection: a survey. IEEE Transactions on Knowledge and Data Engineering, 2007, 19(1): 1–16
Winkler W E. Overview of record linkage and current research directions. Bureau of the Census, 2006, 25(4): 603–623
Wang J N, Li G L, Yu J X, Feng J H. Entity matching: how similar is similar. Proceedings of the VLDB Endowment, 2011, 4(10): 622–633
Bilenko M, Mooney R. Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2003, 39–48
Dong X, Halevy A Y, Madhavan J. Reference reconciliation in complex information spaces. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 2005, 85–96
Roos L L, Wajda A. Record linkage strategies. Part I: estimating information and evaluating approaches. Methods of Information in Medicine, 1991, 30(2): 117
Grannis S J, Overhage J M, McDonald C J. Analysis of identifier performance using a deterministic linkage algorithm. In: Proceedings of American Medical Informatics Association Annual Symposium. 2002, 305–309
Rastogi V, Dalvi Ni N, Garofalakis M N. Large-scale collective entity matching. Proceedings of the VLDB Endowment, 2011, 4(4): 208–218
Lee S, Lee J, Hwang S. Scalable entity matching computation with materialization. In: Proceedings of ACM Conference on Information and Knowledge Management. 2011, 2353–2356
Liu J, Zhang F, Song X Y, Song Y I, Lin C Y, Hon H W. What’s in a name? an unsupervised approach to link users across communities. In: Proceedings of ACM International Conference on Web Search and Data Mining. 2013, 495–504
Liu S Y, Wang S H, Zhu F D, Zhang J B, Krishnan R. HYDRA: largescale social identity linkage via heterogeneous behavior modeling. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 2014, 51–62
Zafarani R, Liu H. Connecting users across social media sites: a behavioral-modeling approach. In: Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2013, 41–49
Fellegi I P, Sunter A B. A theory for record linkage. Journal of the American Statistical Association, 1969, 64(328): 1183–1210
DuVall S L, Kerber R A, Thomas A. Extending the Fellegi-Sunter probabilistic record linkage method for approximate field comparators. Journal of Biomedical Informatics, 2010, 43(1): 24–30
Sadinle M, Fienberg S E. A generalized fellegi-sunter framework for multiple record linkage with application to homicide record systems. Journal of the American Statistical Association, 2013, 108(502): 385–397
Christen P. A survey of indexing techniques for scalable record linkage and deduplication. IEEE Transactions on Knowledge and Data Engi neering, 2012, 24(9): 1537–1555
Leskovec J, Rajaraman A, Ullman J D. Mining of Massive Datasets. Cambridge: Cambridge University Press, 2011
Koudas N, Sarawagi S, Srivastava D. Record linkage: similarity measures and algorithms. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 2006, 802–803
Zheng W G, Zou L, Feng Y S, Chen L, Zhao D Y. Efficient simrank-based similarity join over large graphs. Proceedings of the VLDB Endowment, 2013, 6(7): 493–504
Zafarani R, Liu H. Connecting users across social media sites: a behavioral-modeling approach. In: Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2013, 41–49
Blei D, Ng A, Jordan M. Latent dirichlet allocation. Journal of Machine Learning Research, 2003, 3: 993–1022
Acknowledgements
This work has been supported by the National Key Research and Development Program of China (2016YFB1000905), the National Natural Science Foundation of China (Grant Nos. U1401256, 61402177, 61672234, 61402180 and 61232002). This work was also supported by NSF of Shanghai (14ZR1412600).
Author information
Authors and Affiliations
Corresponding author
Additional information
Chao Kong is a PhD candidate majoring in Computer Science and Technology in East China Normal University, China. He received his Bachelor’s and Master’s degrees in Anhui Normal University, China in 2008 and 2012 respectively. His research interests include Web data management and data mining.
Ming Gao is an associate professor of Institute for Data Science and Engineering with East China Normal University (ECNU), China. Prior to joining ECNU, he worked as a postdoctoral fellow at LARC in School of Information Systems, Singapore Management University, Singapore. He received his PhD degree from the School of Computer Science, Fudan University, China in 2011. His research interests include uncertain data management, streaming data processing, social network analysis and data mining. His work appears in major international conferences including TKDE, DMKD, SIGIR, ICDE, ICDM, and DASFAA, etc.
Chen Xu is a senior researcher at Database Systems and Information Management (DIMA) Group, Technische University Berlin, Germany. He received his PhD degree from East China Normal University, China in 2014 and Bachelor’s degree from Hefei University of Technology, China in 2009. His research interest is large-scale distributed data management.
Yunbin Fu is a post-doctor at Institute for Data Science and Engineering in East China Normal University, China. He received his PhD in applied mathematics since from Shanghai University, China in 2013. His research interests include data science and machine learning.
Weining Qian is currently a professor in computer science at East China Normal University, China. He received his MS and PhD degrees in computer science from Fudan University, China in 2001 and 2004, respectively. He served as the co-chair of WISE 2012 Challenge, and program committee member of several international conferences, including ICDE 2009/2010/2012 and KDD 2013. His research interests include Web data management and mining of massive data sets.
Aoying Zhou is a professor of computer science at East China Normal University (ECNU), China where he is heading the Institute of Massive Computing. He is the winner of the National Science Fund for Distinguished Young Scholars supported by NSFC and the professorship appointment under Changjiang Scholarship Program of Ministry of Education. Before joining ECNU in 2008, he worked with Fudan University at the Computer Science Department from 1993 to 2007, where he served as the department chair from 1999 to 2002. He worked as a visiting scholar under the Berkeley Scholar Program in UC Berkeley in 2005. He is now acting as the vice-director of ACM SIGMOD China and Technology Committee on Database of China Computer Federation. He is serving as a member of the editorial boards of some prestigious academic journals, such as VLDB Journal, and WWW Journal. His research interests include Web data management, data management for data-intensive computing, and in-memory data analytics.
Electronic supplementary material
Rights and permissions
About this article
Cite this article
Kong, C., Gao, M., Xu, C. et al. EnAli: entity alignment across multiple heterogeneous data sources. Front. Comput. Sci. 13, 157–169 (2019). https://doi.org/10.1007/s11704-017-6561-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11704-017-6561-3