skip to main content
10.1145/3219819.3219859acmotherconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Name Disambiguation in AMiner: Clustering, Maintenance, and Human in the Loop.

Published: 19 July 2018 Publication History

Abstract

AMiner 1 is a free online academic search and mining system, having collected more than 130,000,000 researcher profiles and over 200,000,000 papers from multiple publication databases [25].
In this paper, we present the implementation and deployment of name disambiguation, a core component in AMiner. The problem has been studied for decades but remains largely unsolved. In AMiner, we did a systemic investigation into the problem and propose a comprehensive framework to address the problem. We propose a novel representation learning method by incorporating both global and local information and present an end-to-end cluster size estimation method that is significantly better than traditional BIC-based method. To improve accuracy, we involve human annotators into the disambiguation process. We carefully evaluate the proposed framework on real-world large data and experimental results show that the proposed solution achieves clearly better performance (+7-35% in terms of F1-score) than several state-of-the-art methods including GHOST [5], Zhang et al. [33], and Louppe et al. [17].
Finally, the algorithm has been deployed in AMiner to deal with the disambiguation problem at the billion scale, which further demonstrates both effectiveness and efficiency of the proposed framework.

Supplementary Material

MP4 File (zhang_aminer.mp4)

References

[1]
Ron Bekkerman and Andrew McCallum. 2005. Disambiguating Web Appearances of People in a Social Network. In WWW'05. 463--470.
[2]
Irwan Bello, Hieu Pham, Quoc V. Le, Mohammad Norouzi, and Samy Bengio. 2016. Neural Combinatorial Optimization with Reinforcement Learning. CoRR abs/1611.09940 (2016).
[3]
Omar Benjelloun, Hector Garcia-Molina, David Menestrina, Qi Su, Steven Euijong Whang, and Jennifer Widom. 2009. Swoosh: a generic approach to entity resolution. The VLDB Journal 18, 1 (2009), 255--276.
[4]
Indrajit Bhattacharya and Lise Getoor. 2007. Collective entity resolution in relational data. TKDE 1, 1 (2007), 5.
[5]
Xiaoming Fan, Jianyong Wang, Xu Pu, Lizhu Zhou, and Bing Lv. 2011. On graph-based name disambiguation. JDIQ 2, 2 (2011), 10.
[6]
Luis Galárraga, Geremy Heitz, Kevin Murphy, and Fabian M Suchanek. 2014. Canonicalizing open knowledge bases. In CIKM'14. ACM, 1679--1688.
[7]
Christophe Giraud. 2014. Introduction to high-dimensional statistics. Vol. 138. CRC Press.
[8]
Hui Han, Lee Giles, Hongyuan Zha, Cheng Li, and Kostas Tsioutsiouliklis. 2004. Two supervised learning approaches for name disambiguation in author citations. In JCDL'04. 296--305.
[9]
Linus Hermansson, Tommi Kerola, Fredrik Johansson, Vinay Jethava, and Devdatt Dubhashi. 2013. Entity disambiguation in anonymized graphs using graph kernels. In CIKM'13. 1037--1046.
[10]
Jian Huang, Seyda Ertekin, and C. Lee Giles. 2006. Efficient name disambiguation for large-scale databases. In PKDD'06. Springer, 536--544.
[11]
Lili Jiang, Jianyong Wang, Ning An, Shengyuan Wang, Jian Zhan, and Lian Li. 2009. Grape: A graph-based framework for disambiguating people appearances in web search. In ICDM'09. 199--208.
[12]
Pallika H. Kanani, Andrew McCallum, and Chris Pal. 2007. Improving Author Coreference by Resource-Bounded Information Gathering from the Web. In IJCAI. 429--434.
[13]
Thomas N. Kipf and Max Welling. 2016. Semi-Supervised Classification with Graph Convolutional Networks. (2016).
[14]
Thomas N. Kipf and Max Welling. 2016. Variational graph auto-encoders. arXiv preprint arXiv:1611.07308 (2016).
[15]
Harold W. Kuhn. 1955. The Hungarian method for the assignment problem. Naval Research Logistics (NRL) 2, 1-2 (1955), 83--97.
[16]
Xin Li, Paul Morie, and Dan Roth. 2004. Identification and tracing of ambiguous names: Discriminative and generative approaches. In AAAI'04. 419--424.
[17]
Gilles Louppe, Hussein T. Al-Natsheh, Mateusz Susik, and Eamonn James Maguire. 2016. Ethnicity sensitive author disambiguation using semi-supervised learning. In KESW'16. 272--287.
[18]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In NIPS'13. 3111--3119.
[19]
Dan Pelleg and Andrew W. Moore. 2000. X-means: Extending K-means with Efficient Estimation of the Number of Clusters. In ICML'00. 727--734.
[20]
Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. Facenet: A unified embedding for face recognition and clustering. In CVPR'15. 815--823.
[21]
Sameer Singh, Amarnag Subramanya, Fernando Pereira, and Andrew McCallum. 2011. Large-scale cross-document coreference using distributed inference and hierarchical models. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1. Association for Computational Linguistics, 793--803.
[22]
Rebecca C. Steorts, Samuel L. Ventura, Mauricio Sadinle, and Stephen E. Fienberg. 2014. A Comparison of Blocking Methods for Record Linkage. In PSD'14. 253--268.
[23]
Jie Tang, A. C. M. Fong, Bo Wang, and Jing Zhang. 2012. A Unified Probabilistic Framework for Name Disambiguation in Digital Library. IEEE TKDE 24, 6 (2012), 975--987.
[24]
Jie Tang, Limin Yao, Duo Zhang, and Jing Zhang. 2010. A Combination Approach to Web User Profiling. ACM TKDD 5, 1 (2010), 1--44.
[25]
Jie Tang, Jing Zhang, Limin Yao, Juanzi Li, Li Zhang, and Zhong Su. 2008. Arnet-Miner: Extraction and Mining of Academic Social Networks. In KDD'08. 990--998.
[26]
Fei Tian, Bin Gao, Qing Cui, Enhong Chen, and Tie-Yan Liu. 2014. Learning deep representations for graph clustering. In AAAI'14. 1293--1299.
[27]
David Wagner. 2002. A generalized birthday problem. In Crypto'17. 288--304.
[28]
Michael Wick, Ari Kobren, and Andrew McCallum. 2013. Probabilistic Reasoning about Human Edits in Information Integration. In ICML Workshop: Machine Learning Meets Crowdsourcing, Atlanta.
[29]
Michael Wick, Sameer Singh, and Andrew McCallum. 2012. A discriminative hierarchical model for fast coreference at large scale. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers Volume 1. Association for Computational Linguistics, 379--388.
[30]
Jianmin Wu, Tea Vallenius, Kristian Ovaska, Jukka Westermarck, Tomi P. Mäkelä, and Sampsa Hautaniemi. 2009. Integrated network analysis platform for protein-protein interactions. Nature methods 6, 1 (2009), 75--77.
[31]
Xiaoxin Yin, Jiawei Han, and Philip S Yu. 2007. Object distinction: Distinguishing objects with identical names. In ICDE'07. 1242--1246.
[32]
Minoru Yoshida, Masaki Ikeda, Shingo Ono, Issei Sato, and Hiroshi Nakagawa. 2010. Person name disambiguation by bootstrapping. In SIGIR'10. ACM, 10--17.
[33]
Baichuan Zhang and Mohammad Al Hasan. 2017. Name disambiguation in anonymized graphs using network embedding. In CIKM'17. 1239--1248.
[34]
Yutao Zhang, Jie Tang, Zhilin Yang, Jian Pei, and Philip S. Yu. 2015. Cosnet: Connecting heterogeneous social networks with local and global consistency. In KDD'15. 1485--14.

Cited By

View all
  • (2025)Author name disambiguation based on heterogeneous graph neural networkPLOS ONE10.1371/journal.pone.031099220:2(e0310992)Online publication date: 26-Feb-2025
  • (2024)Author Name Disambiguation via Paper Association Refinement and Compositional Contrastive EmbeddingProceedings of the ACM Web Conference 202410.1145/3589334.3645596(2193-2203)Online publication date: 13-May-2024
  • (2024)BOND: Bootstrapping From-Scratch Name Disambiguation with Multi-task PromotingProceedings of the ACM Web Conference 202410.1145/3589334.3645580(4216-4226)Online publication date: 13-May-2024
  • Show More Cited By

Index Terms

  1. Name Disambiguation in AMiner: Clustering, Maintenance, and Human in the Loop.

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Other conferences
      KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining
      July 2018
      2925 pages
      ISBN:9781450355520
      DOI:10.1145/3219819
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 19 July 2018

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. clustering
      2. entity resolution
      3. metric learning
      4. name disambiguation

      Qualifiers

      • Research-article

      Funding Sources

      • National Natural Science Foundation of China
      • National High-tech R\&D Program

      Conference

      KDD '18
      Sponsor:

      Acceptance Rates

      KDD '18 Paper Acceptance Rate 107 of 983 submissions, 11%;
      Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)84
      • Downloads (Last 6 weeks)14
      Reflects downloads up to 28 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2025)Author name disambiguation based on heterogeneous graph neural networkPLOS ONE10.1371/journal.pone.031099220:2(e0310992)Online publication date: 26-Feb-2025
      • (2024)Author Name Disambiguation via Paper Association Refinement and Compositional Contrastive EmbeddingProceedings of the ACM Web Conference 202410.1145/3589334.3645596(2193-2203)Online publication date: 13-May-2024
      • (2024)BOND: Bootstrapping From-Scratch Name Disambiguation with Multi-task PromotingProceedings of the ACM Web Conference 202410.1145/3589334.3645580(4216-4226)Online publication date: 13-May-2024
      • (2024)Preparation of Publication Data for Mining in Scientific Review Process Management2024 IEEE 25th International Conference of Young Professionals in Electron Devices and Materials (EDM)10.1109/EDM61683.2024.10615203(2620-2623)Online publication date: 28-Jun-2024
      • (2024)An Effective Author Name Disambiguation Framework for Large-Scale PublicationsIEEE Access10.1109/ACCESS.2024.351103712(182086-182100)Online publication date: 2024
      • (2024)PubMed Computed Authors in 2024: an open resource of disambiguated author names in biomedical literatureBioinformatics10.1093/bioinformatics/btae67240:11Online publication date: 9-Nov-2024
      • (2024)A Comprehensive Survey on Deep Graph Representation LearningNeural Networks10.1016/j.neunet.2024.106207173(106207)Online publication date: May-2024
      • (2024)A cross-domain transfer learning model for author name disambiguation on heterogeneous graph with pretrained language modelKnowledge-Based Systems10.1016/j.knosys.2024.112624305(112624)Online publication date: Dec-2024
      • (2024)Scholar's career switch from academia to industry: Mining and analysis from AMinerBig Data Research10.1016/j.bdr.2024.100441(100441)Online publication date: Feb-2024
      • (2024)Towards Effective Author Name Disambiguation by Hybrid AttentionJournal of Computer Science and Technology10.1007/s11390-023-2070-z39:4(929-950)Online publication date: 1-Jul-2024
      • Show More Cited By

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media