research-article

Name Disambiguation in AMiner: Clustering, Maintenance, and Human in the Loop.

Authors:

Jie TangAuthors Info & Claims

KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

Pages 1002 - 1011

https://doi.org/10.1145/3219819.3219859

Published: 19 July 2018 Publication History

Abstract

AMiner 1 is a free online academic search and mining system, having collected more than 130,000,000 researcher profiles and over 200,000,000 papers from multiple publication databases [25].

In this paper, we present the implementation and deployment of name disambiguation, a core component in AMiner. The problem has been studied for decades but remains largely unsolved. In AMiner, we did a systemic investigation into the problem and propose a comprehensive framework to address the problem. We propose a novel representation learning method by incorporating both global and local information and present an end-to-end cluster size estimation method that is significantly better than traditional BIC-based method. To improve accuracy, we involve human annotators into the disambiguation process. We carefully evaluate the proposed framework on real-world large data and experimental results show that the proposed solution achieves clearly better performance (+7-35% in terms of F1-score) than several state-of-the-art methods including GHOST [5], Zhang et al. [33], and Louppe et al. [17].

Finally, the algorithm has been deployed in AMiner to deal with the disambiguation problem at the billion scale, which further demonstrates both effectiveness and efficiency of the proposed framework.

Supplementary Material

MP4 File (zhang_aminer.mp4)

Download
289.21 MB

References

[1]

Ron Bekkerman and Andrew McCallum. 2005. Disambiguating Web Appearances of People in a Social Network. In WWW'05. 463--470.

Digital Library

[2]

Irwan Bello, Hieu Pham, Quoc V. Le, Mohammad Norouzi, and Samy Bengio. 2016. Neural Combinatorial Optimization with Reinforcement Learning. CoRR abs/1611.09940 (2016).

[3]

Omar Benjelloun, Hector Garcia-Molina, David Menestrina, Qi Su, Steven Euijong Whang, and Jennifer Widom. 2009. Swoosh: a generic approach to entity resolution. The VLDB Journal 18, 1 (2009), 255--276.

Digital Library

[4]

Indrajit Bhattacharya and Lise Getoor. 2007. Collective entity resolution in relational data. TKDE 1, 1 (2007), 5.

Digital Library

[5]

Xiaoming Fan, Jianyong Wang, Xu Pu, Lizhu Zhou, and Bing Lv. 2011. On graph-based name disambiguation. JDIQ 2, 2 (2011), 10.

Digital Library

[6]

Luis Galárraga, Geremy Heitz, Kevin Murphy, and Fabian M Suchanek. 2014. Canonicalizing open knowledge bases. In CIKM'14. ACM, 1679--1688.

Digital Library

[7]

Christophe Giraud. 2014. Introduction to high-dimensional statistics. Vol. 138. CRC Press.

[8]

Hui Han, Lee Giles, Hongyuan Zha, Cheng Li, and Kostas Tsioutsiouliklis. 2004. Two supervised learning approaches for name disambiguation in author citations. In JCDL'04. 296--305.

Digital Library

[9]

Linus Hermansson, Tommi Kerola, Fredrik Johansson, Vinay Jethava, and Devdatt Dubhashi. 2013. Entity disambiguation in anonymized graphs using graph kernels. In CIKM'13. 1037--1046.

Digital Library

[10]

Jian Huang, Seyda Ertekin, and C. Lee Giles. 2006. Efficient name disambiguation for large-scale databases. In PKDD'06. Springer, 536--544.

[11]

Lili Jiang, Jianyong Wang, Ning An, Shengyuan Wang, Jian Zhan, and Lian Li. 2009. Grape: A graph-based framework for disambiguating people appearances in web search. In ICDM'09. 199--208.

Digital Library

[12]

Pallika H. Kanani, Andrew McCallum, and Chris Pal. 2007. Improving Author Coreference by Resource-Bounded Information Gathering from the Web. In IJCAI. 429--434.

Digital Library

[13]

Thomas N. Kipf and Max Welling. 2016. Semi-Supervised Classification with Graph Convolutional Networks. (2016).

[14]

Thomas N. Kipf and Max Welling. 2016. Variational graph auto-encoders. arXiv preprint arXiv:1611.07308 (2016).

[15]

Harold W. Kuhn. 1955. The Hungarian method for the assignment problem. Naval Research Logistics (NRL) 2, 1-2 (1955), 83--97.

[16]

Xin Li, Paul Morie, and Dan Roth. 2004. Identification and tracing of ambiguous names: Discriminative and generative approaches. In AAAI'04. 419--424.

Digital Library

[17]

Gilles Louppe, Hussein T. Al-Natsheh, Mateusz Susik, and Eamonn James Maguire. 2016. Ethnicity sensitive author disambiguation using semi-supervised learning. In KESW'16. 272--287.

[18]

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In NIPS'13. 3111--3119.

Digital Library

[19]

Dan Pelleg and Andrew W. Moore. 2000. X-means: Extending K-means with Efficient Estimation of the Number of Clusters. In ICML'00. 727--734.

Digital Library

[20]

Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. Facenet: A unified embedding for face recognition and clustering. In CVPR'15. 815--823.

[21]

Sameer Singh, Amarnag Subramanya, Fernando Pereira, and Andrew McCallum. 2011. Large-scale cross-document coreference using distributed inference and hierarchical models. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1. Association for Computational Linguistics, 793--803.

Digital Library

[22]

Rebecca C. Steorts, Samuel L. Ventura, Mauricio Sadinle, and Stephen E. Fienberg. 2014. A Comparison of Blocking Methods for Record Linkage. In PSD'14. 253--268.

[23]

Jie Tang, A. C. M. Fong, Bo Wang, and Jing Zhang. 2012. A Unified Probabilistic Framework for Name Disambiguation in Digital Library. IEEE TKDE 24, 6 (2012), 975--987.

Digital Library

[24]

Jie Tang, Limin Yao, Duo Zhang, and Jing Zhang. 2010. A Combination Approach to Web User Profiling. ACM TKDD 5, 1 (2010), 1--44.

Digital Library

[25]

Jie Tang, Jing Zhang, Limin Yao, Juanzi Li, Li Zhang, and Zhong Su. 2008. Arnet-Miner: Extraction and Mining of Academic Social Networks. In KDD'08. 990--998.

Digital Library

[26]

Fei Tian, Bin Gao, Qing Cui, Enhong Chen, and Tie-Yan Liu. 2014. Learning deep representations for graph clustering. In AAAI'14. 1293--1299.

Digital Library

[27]

David Wagner. 2002. A generalized birthday problem. In Crypto'17. 288--304.

Digital Library

[28]

Michael Wick, Ari Kobren, and Andrew McCallum. 2013. Probabilistic Reasoning about Human Edits in Information Integration. In ICML Workshop: Machine Learning Meets Crowdsourcing, Atlanta.

[29]

Michael Wick, Sameer Singh, and Andrew McCallum. 2012. A discriminative hierarchical model for fast coreference at large scale. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers Volume 1. Association for Computational Linguistics, 379--388.

Digital Library

[30]

Jianmin Wu, Tea Vallenius, Kristian Ovaska, Jukka Westermarck, Tomi P. Mäkelä, and Sampsa Hautaniemi. 2009. Integrated network analysis platform for protein-protein interactions. Nature methods 6, 1 (2009), 75--77.

[31]

Xiaoxin Yin, Jiawei Han, and Philip S Yu. 2007. Object distinction: Distinguishing objects with identical names. In ICDE'07. 1242--1246.

[32]

Minoru Yoshida, Masaki Ikeda, Shingo Ono, Issei Sato, and Hiroshi Nakagawa. 2010. Person name disambiguation by bootstrapping. In SIGIR'10. ACM, 10--17.

Digital Library

[33]

Baichuan Zhang and Mohammad Al Hasan. 2017. Name disambiguation in anonymized graphs using network embedding. In CIKM'17. 1239--1248.

Digital Library

[34]

Yutao Zhang, Jie Tang, Zhilin Yang, Jian Pei, and Philip S. Yu. 2015. Cosnet: Connecting heterogeneous social networks with local and global consistency. In KDD'15. 1485--14.

Digital Library

Cited By

Wang GSun ZHU WCai M(2025)Author name disambiguation based on heterogeneous graph neural networkPLOS ONE10.1371/journal.pone.031099220:2(e0310992)Online publication date: 26-Feb-2025
https://doi.org/10.1371/journal.pone.0310992
Liu DZhang RChen JChen XChua TNgo CKa-Wei Lee RKumar RLauw H(2024)Author Name Disambiguation via Paper Association Refinement and Compositional Contrastive EmbeddingProceedings of the ACM Web Conference 202410.1145/3589334.3645596(2193-2203)Online publication date: 13-May-2024
https://dl.acm.org/doi/10.1145/3589334.3645596
Cheng YChen BZhang FTang JChua TNgo CKa-Wei Lee RKumar RLauw H(2024)BOND: Bootstrapping From-Scratch Name Disambiguation with Multi-task PromotingProceedings of the ACM Web Conference 202410.1145/3589334.3645580(4216-4226)Online publication date: 13-May-2024
https://dl.acm.org/doi/10.1145/3589334.3645580
Show More Cited By

Index Terms

Name Disambiguation in AMiner: Clustering, Maintenance, and Human in the Loop.
1. Information systems
  1. Data management systems
    1. Information integration
      1. Entity resolution
  2. Information systems applications
    1. Data mining
      1. Clustering

Recommendations

Name Disambiguation in Anonymized Graphs using Network Embedding
CIKM '17: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management

In real-world, our DNA is unique but many people share names. This phenomenon often causes erroneous aggregation of documents of multiple persons who are namesake of one another. Such mistakes deteriorate the performance of document retrieval, web ...
On Graph-Based Name Disambiguation

Name ambiguity stems from the fact that many people or objects share identical names in the real world. Such name ambiguity decreases the performance of document retrieval, Web search, information integration, and may cause confusion in other ...
Web personal name disambiguation based on reference entity tables mined from the web
WIDM '09: Proceedings of the eleventh international workshop on Web information and data management

Ambiguous personal names are common on the Web, which pose a challenge for many different tasks. The traditional disambiguation employs the clustering methods. However, without reference entity tables, the clustering method can only identify whether two ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

July 2018

2925 pages

ISBN:9781450355520

DOI:10.1145/3219819

General Chairs:
Yike Guo
Imperial College London
,
Faisal Farooq
IBM

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 July 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Natural Science Foundation of China
National High-tech R\&D Program

Conference

KDD '18

Sponsor:

KDD '18: The 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

August 19 - 23, 2018

London, United Kingdom

Acceptance Rates

KDD '18 Paper Acceptance Rate 107 of 983 submissions, 11%;

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

89
Total Citations
View Citations
1,466
Total Downloads

Downloads (Last 12 months)84
Downloads (Last 6 weeks)14

Reflects downloads up to 28 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Wang GSun ZHU WCai M(2025)Author name disambiguation based on heterogeneous graph neural networkPLOS ONE10.1371/journal.pone.031099220:2(e0310992)Online publication date: 26-Feb-2025
https://doi.org/10.1371/journal.pone.0310992
Liu DZhang RChen JChen XChua TNgo CKa-Wei Lee RKumar RLauw H(2024)Author Name Disambiguation via Paper Association Refinement and Compositional Contrastive EmbeddingProceedings of the ACM Web Conference 202410.1145/3589334.3645596(2193-2203)Online publication date: 13-May-2024
https://dl.acm.org/doi/10.1145/3589334.3645596
Cheng YChen BZhang FTang JChua TNgo CKa-Wei Lee RKumar RLauw H(2024)BOND: Bootstrapping From-Scratch Name Disambiguation with Multi-task PromotingProceedings of the ACM Web Conference 202410.1145/3589334.3645580(4216-4226)Online publication date: 13-May-2024
https://dl.acm.org/doi/10.1145/3589334.3645580
Latypova V(2024)Preparation of Publication Data for Mining in Scientific Review Process Management2024 IEEE 25th International Conference of Young Professionals in Electron Devices and Materials (EDM)10.1109/EDM61683.2024.10615203(2620-2623)Online publication date: 28-Jun-2024
https://doi.org/10.1109/EDM61683.2024.10615203
Zhou AShi MYuan R(2024)An Effective Author Name Disambiguation Framework for Large-Scale PublicationsIEEE Access10.1109/ACCESS.2024.351103712(182086-182100)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3511037
Tian SChen QComeau DWilbur WLu Z(2024)PubMed Computed Authors in 2024: an open resource of disambiguated author names in biomedical literatureBioinformatics10.1093/bioinformatics/btae67240:11Online publication date: 9-Nov-2024
https://doi.org/10.1093/bioinformatics/btae672
Ju WFang ZGu YLiu ZLong QQiao ZQin YShen JSun FXiao ZYang JYuan JZhao YWang YLuo XZhang M(2024)A Comprehensive Survey on Deep Graph Representation LearningNeural Networks10.1016/j.neunet.2024.106207173(106207)Online publication date: May-2024
https://doi.org/10.1016/j.neunet.2024.106207
Huang ZZhang HHao CYang HWu H(2024)A cross-domain transfer learning model for author name disambiguation on heterogeneous graph with pretrained language modelKnowledge-Based Systems10.1016/j.knosys.2024.112624305(112624)Online publication date: Dec-2024
https://doi.org/10.1016/j.knosys.2024.112624
Shao ZYuan SJin YWang Y(2024)Scholar's career switch from academia to industry: Mining and analysis from AMinerBig Data Research10.1016/j.bdr.2024.100441(100441)Online publication date: Feb-2024
https://doi.org/10.1016/j.bdr.2024.100441
Zhou QChen WZhao PLiu AXu JQu JZhao L(2024)Towards Effective Author Name Disambiguation by Hybrid AttentionJournal of Computer Science and Technology10.1007/s11390-023-2070-z39:4(929-950)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.1007/s11390-023-2070-z
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten