research-article

A relevance feedback approach for the author name disambiguation problem

Authors:
Thiago A. Godoi

University of Campinas, Campinas, Brazil

University of Campinas, Campinas, Brazil
View Profile

,
Ricardo da S. Torres

University of Campinas, Campinas, Brazil

University of Campinas, Campinas, Brazil
View Profile

,
Ariadne M.B.R. Carvalho

University of Campinas, Campinas, Brazil

University of Campinas, Campinas, Brazil
View Profile

,
Marcos A. Gonçalves

Federal University of Minas Gerais, Belo Horizonte, Brazil

Federal University of Minas Gerais, Belo Horizonte, Brazil
View Profile

,
Anderson A. Ferreira

Federal University of Ouro Preto, Ouro Preto, Brazil

Federal University of Ouro Preto, Ouro Preto, Brazil
View Profile

,
Weiguo Fan

Virginia Tech, Blacksburg, VA, USA

Virginia Tech, Blacksburg, VA, USA
View Profile

,
Edward A. Fox

Virginia Tech, Blacksburg, VA, USA

Virginia Tech, Blacksburg, VA, USA
View Profile

JCDL '13: Proceedings of the 13th ACM/IEEE-CS joint conference on Digital librariesJuly 2013Pages 209–218https://doi.org/10.1145/2467696.2467709

Published:22 July 2013Publication History

JCDL '13: Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries

Pages 209–218

ABSTRACT

This paper presents a new name disambiguation method that exploits user feedback on ambiguous references across iterations. An unsupervised step is used to define pure training samples, and a hybrid supervised step is employed to learn a classification model for assigning references to authors. Our classification scheme combines the Optimum-Path Forest (OPF) classifier with complex reference similarity functions generated by a Genetic Programming framework. Experiments demonstrate that the proposed method yields better results than state-of-the-art disambiguation methods on two traditional datasets.

References

Byung-Won On, Dongwon Lee, Jaewoo Kang, and Prasenjit Mitra. Comparative study of name disambiguation problem using a scalable blocking-based framework. In Proceedings of the 5th ACM/IEEE Joint Conference on Digital Libraries, pages 344--353, Denver, CO, USA, 2005. Google ScholarDigital Library
Anderson A. Ferreira, Marcos Andre Gonçalves, and Alberto H. F. Laender. A brief survey of automatic methods for author name disambiguation. SIGMOD Record, 41(2):15--26, 2012. Google ScholarDigital Library
J. P. Papa, A. X. Falc\ ao, and C. T. N. Suzuki. Supervised pattern classification based on optimum-path forest. International Journal of Imaging Systems and Technology, 19(2):120--131, 2009. Google ScholarDigital Library
J. P. Papa, A. X. Falc\ ao, V. H. C. Albuquerque, and J. M. R. S. Tavares. Efficient supervised optimum-path forest classification for large datasets. Pattern Recognition, 45(1):512--520, 2012. Google ScholarDigital Library
Hui Han, Hongyuan Zha, and C. Lee Giles. Name disambiguation in author citations using a k-way spectral clustering method. In Proceedings of the 5th ACM/IEEE Joint Conference on Digital Libraries, pages 334--343, Denver, CO, USA, 2005. Google ScholarDigital Library
Jian Huang, Seyda Ertekin, and C. Lee Giles. Efficient name disambiguation for large-scale databases. In Proceedings of the European Conference on Principles and Practice of Knowledge Discovery in Databases, pages 536--544, Berlin, Germany, 2006. Google ScholarDigital Library
Byung-Won On, Ergin Elmacioglu, Dongwon Lee, Jaewoo Kang, and Jian Pei. Improving grouped-entity resolution using quasi-cliques. In Proceedings of the 6th IEEE International Conference on Data Mining, pages 1008--015, 2006. Google ScholarDigital Library
Indrajit Bhattacharya and Lise Getoor. Collective entity resolution in relational data. ACM Transactions on Knowledge Discovery from Data, 1(1), 2007. Google ScholarDigital Library
Aron Culotta, Pallika Kanani, Robert Hall, Michael Wick, and Andrew McCallum. Author disambiguation using error-driven machine learning with a ranking loss function. In Proceedings of the International Workshop on Information Integration on the Web, Vancouver, Canada, 2007.Google Scholar
In-Su Kang, Seung-Hoon Na, Seungwoo Lee, Hanmin Jung, Pyung Kim, Won-Kyung Sung, and Jong-Hyeok Lee. On co-authorship for author disambiguation. Information Processing & Management, 45(1):84--97, 2009. Google ScholarDigital Library
Byung-Won On and Dongwon Lee. Scalable name disambiguation using multi-level graph partition. In Proceedings of the 7th SIAM International Conference on Data Mining, pages 575--580, Minneapolis, Minnesota, USA, 2007.Google ScholarCross Ref
José M. Soler. Separating the articles of authors with the same name. Scientometrics, 72(2):281--290, 2007.Google ScholarCross Ref
Yang Song, Jian Huang, Isaac G. Councill, Jia Li, and C. Lee Giles. Efficient topic-based unsupervised name disambiguation. In Proceedings of the 7th ACM/IEEE Joint Conference on Digital Libraries, pages 342--351, Vancouver, BC, Canada, 2007. Google ScholarDigital Library
Denilson Alves Pereira, Berthier A. Ribeiro-Neto, Nivio Ziviani, Alberto H. F. Laender, Marcos André Gonçalves, and Anderson A. Ferreira. Using web information for author name disambiguation. In Proceedings of the 2009 ACM/IEEE Joint Conference on Digital Libraries, pages 49--58, 2009. Google ScholarDigital Library
Vetle I. Torvik and Neil R. Smalheiser. Author name disambiguation in medline. ACM Transactions on Knowledge Discovery from Data, 3(3):1--29, 2009. Google ScholarDigital Library
Pucktada Treeratpituk and C. Lee Giles. Disambiguating authors in academic publications using random forests. In Proceedings of the 2009 ACM/IEEE Joint Conference on Digital Libraries, pages 39--48, Austin, TX, USA, 2009. Google ScholarDigital Library
Ricardo G. Cota, Anderson Almeida Ferreira, Marcos André Gonçalves, Alberto H. F. Laender, and Cristiano Nascimento. An unsupervised heuristic-based hierarchical method for name disambiguation in bibliographic citations. Journal of the American Society for Information Science and Technology, 61(9):1853--1870, 2010. Google ScholarDigital Library
A.A. Ferreira, A. Veloso, M.A. Gonçalves, and A.H.F. Laender. Effective self-training author name disambiguation in scholarly digital libraries. In Proceedings of the 10th annual joint conference on Digital libraries, pages 39--48. ACM, 2010. Google ScholarDigital Library
Xiaoming Fan, Jianyong Wang, Xu Pu, Lizhu Zhou, and Bing Lv. On graph-based name disambiguation. ACM Journal of Data and Information Quality, 2:10:1--10:23, February 2011. Google ScholarDigital Library
Ana Paula Carvalho, Anderson A. Ferreira, Alberto H. F. Laender, and Marcos André Gonçalves. Incremental unsupervised name disambiguation in cleaned digital libraries. Journal of Information and Data Management, 2(3):289--304, 2011.Google Scholar
Michael Levin, Stefan Krawzyk, Steven Bethard, and Dan Jurafsky. Citation-based bootstrapping for large-scale author disambiguation. Journal of the American Society for Information Science and Technology, 63(5):1030--1047, 2012. Google ScholarDigital Library
Felipe H. Levin and Carlos A. Heuser. Evaluating the use of social networks in author name disambiguation in digital libraries. Journal of Information and Data Management, 1(2):183--197, 2010.Google Scholar
Hui Han, C. Lee Giles, Hongyuan Zha, Cheng Li, and Kostas Tsioutsiouliklis. Two supervised learning approaches for name disambiguation in author citations. In Proceedings of the 4th ACM/IEEE-CS Joint Conference on Digital Libraries, pages 296--305, Tuscon, USA, 2004. Google ScholarDigital Library
Hui Han, Wei Xu, Hongyuan Zha, and C. Lee Giles. A hierarchical naive Bayes mixture model for name disambiguation in author citations. In Proceedings of the 2005 ACM Symposium on Applied Computing, pages 1065--1069, Santa Fe, New Mexico, USA, 2005. Google ScholarDigital Library
Indrajit Bhattacharya and Lise Getoor. A latent dirichlet model for unsupervised entity resolution. In Proceedings of the Sixth SIAM International Conference on Data Mining, Bethesda, MD, USA, 2006.Google ScholarCross Ref
Jie Tang, Auvis C. M. Fong, Bo Wang, and Jing Zhang. A unified probabilistic framework for name disambiguation in digital library. IEEE Transactions on Knowledge and Data Engineering, 24(6):975--987, 2012. Google ScholarDigital Library
Adriano Veloso, Anderson A. Ferreira, Marcos A. Gonçalves, Alberto H.F. Laender, and Wagner Meira Jr. Cost-effective on-demand associative author name disambiguation. Information Processing & Management, 48(4):680 -- 697, 2012. Google ScholarDigital Library
A.A. Ferreira, T.M. Machado, and M.A. Gonçalves. Improving author name disambiguation with user relevance feedback. Journal of Information and Data Management, 3(3):332, 2012.Google Scholar
Ricardo Baeza-Yates and Berthier Ribeiro-Neto. Modern Information Retrieval. Addison-Wesley Publishing Company, 2nd edition, 2008. Google ScholarDigital Library
X. Wang, J. Tang, H. Cheng, and P.S. Yu. Adana: Active name disambiguation. In Proceedings of the 11th International Conference on Data Mining, pages 794--803, Vancouver,Canada, 2011. Google ScholarDigital Library
Yuhua Li, Aiming Wen, Quan Lin, Ruixuan Li, and Zhengding Lu. Incorporating user feedback into name disambiguation of scientific cooperation network. In Proceedings of the 12th international conference on Web-age information management, WAIM'11, pages 454--466, 2011. Google ScholarDigital Library
A.T. da Silva, J.A. dos Santos, A.X. Falc\ ao, R.S. Torres, and L.P. Magalh\ aes. Incorporating multiple distance spaces in optimum-path forest classification to improve feedback-based learning. Computer Vision and Image Understanding, 116(4):510--523, 2012. Google ScholarDigital Library
A.T. da Silva, AX Falc\ ao, and L.P. Magalh\ aes. A new cbir approach based on relevance feedback and optimum-path forest classification. Journal of WSCG, 18(1--3):73--80, 2010.Google Scholar
Jefersson Alex dos Santos, André Tavares da Silva, Ricardo da Silva Torres, Alexandre X. Falcão, Léo Pini Magalhães, and Rubens A. C. Lamparelli. Interactive classification of remote sensing images by using optimum-path forest and genetic programming. In 14th International Conference on Computer Analysis of Images and Patterns (CAIP), pages 300--307, 2011. Google ScholarDigital Library
R. Calumby, R. da S. Torres, and M. A. Gonçalves. Multimodal retrieval with relevance feedback based on genetic programming. Multimedia Tools and Applications, pages 1--29, 2012.Google Scholar
F. S. P. Andrade, J. Almeida, H. Pedrini, and R. da S. Torres. Fusion of local and global descriptors for content-based image and video retrieval. In Iberoamerican Congress on Pattern Recognition, pages 845--853, 2012.Google ScholarCross Ref
F. F. Faria, A. Veloso, H. M. Almeida, E. Valle, R. da S. Torres, M. A. Gonçalves, and W. Meira Jr. Learning to rank for content-based image retrieval. In ACM MIR, pages 285--294, 2010. Google ScholarDigital Library
R. da S. Torres, A. X. Falc\ ao, M. A. Gonçalves, J. P. Papa, B. Zhang, W. Fan, and E. A. Fox. A genetic programming framework for content-based image retrieval. Pattern Recognition, 42(2):283--292, 2009. Google ScholarDigital Library
C. D. Ferreira, J. A. Santos, R. da S. Torres, M. A. Gonçalves, R. C. Rezende, and W. Fan. Relevance feedback based on genetic programming for image retrieval. Pattern Recognition Letters, 32(1):27--37, 2011. Google ScholarDigital Library
Weiguo Fan, Praveen Pathak, and Mi Zhou. Genetic-based approaches in ranking function discovery and optimization in information retrieval - a framework. Decision Support Systems, 47(4):398--407, 2009. Google ScholarDigital Library
H. M. de Almeida, M. A. Gonçalves, M. Cristo, and P. P. Calado. A combined component approach for finding collection-adapted ranking functions based on genetic programming. In ACM SIGIR, pages 399--406, 2007. Google ScholarDigital Library
T.H. Cormen, C.E. Leiserson, R.L. Rivest, and C. Stein. Introduction to algorithms. MIT press, 2001. Google ScholarDigital Library
A.A. Ferreira, R. Silva, M.A. Gonçalves, A. Veloso, and A.H.F. Laender. Active associative sampling for author name disambiguation. In Proceedings of the 12th ACM/IEEE-CS joint conference on Digital Libraries, pages 175--184. ACM, 2012. Google ScholarDigital Library
In-Su Kang, Pyung Kim, Seungwoo Lee, Hanmin Jung, and Beom-Jong You. Construction of a large-scale test set for author disambiguation. Information Processing and Management, 47(3):452--465, May 2011. Google ScholarDigital Library
Itshak Lapidot. Self-Organizing-Maps with BIC for Speaker Clustering. Technical report, IDIAP Research Institute, Martigny, Switzerland, 2002.Google Scholar
C. J. Van Rijsbergen. Information Retrieval, 2nd edition. Butterworths, London, 1979. Google ScholarDigital Library
Robert Feldt and Peter Nordin. Using factorial experiments to evaluate the effect of genetic programming parameters. In EuroGP, pages 271--282, 2000. Google ScholarDigital Library

Index Terms

A relevance feedback approach for the author name disambiguation problem
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Supervised learning
        Supervised learning by classification
    2. Machine learning approaches
      1. Classification and regression trees
2. Information systems
  1. Information retrieval
    1. Evaluation of retrieval results
      1. Relevance assessment
    2. Retrieval tasks and goals
      1. Clustering and classification
  2. Information systems applications
    1. Data mining
      1. Clustering

Recommendations

Active associative sampling for author name disambiguation
JCDL '12: Proceedings of the 12th ACM/IEEE-CS joint conference on Digital Libraries

One of the hardest problems faced by current scholarly digital libraries is author name ambiguity. This problem occurs when, in a set of citation records, there are records of a same author under distinct names, or citation records belonging to distinct ...
Read More
Name Disambiguation Using Semantic Association Clustering
ICEBE '09: Proceedings of the 2009 IEEE International Conference on e-Business Engineering

Due to homonyms, abbreviations, etc., name ambiguity is widely available in web and e-document. For example, when integrating heterogeneous literature databases, because there are different name specifications, different authors may be thought of as the ...
Read More
Web personal name disambiguation based on reference entity tables mined from the web
WIDM '09: Proceedings of the eleventh international workshop on Web information and data management

Ambiguous personal names are common on the Web, which pose a challenge for many different tasks. The traditional disambiguation employs the clustering methods. However, without reference entity tables, the clustering method can only identify whether two ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
JCDL '13: Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries
July 2013
480 pages
ISBN:9781450320771
DOI:10.1145/2467696
General Chairs:
J. Stephen Downie
University of Illinois at Urbana-Champaign, USA
,
Robert H. McDonald
Indiana University Bloomington, USA
,
Program Chairs:
Timothy W. Cole
University of Illinois at Urbana-Champaign, USA
,
Robert Sanderson
Los Alamos National Laboratory, USA
,
Frank Shipman
Texas A&M University, USA
Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 22 July 2013
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
genetic programming
name disambiguation
optimum-path forest classifier
relevance feedback
Qualifiers
- research-article
Conference

Acceptance Rates
JCDL '13 Paper Acceptance Rate28of95submissions,29%Overall Acceptance Rate415of1,482submissions,28%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 6
  Total Citations
  View Citations
- 263
  Total Downloads
- Downloads (Last 12 months)1
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A relevance feedback approach for the author name disambiguation problem

JCDL '13: Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries

ABSTRACT

References

Cited By

Index Terms

Recommendations

Active associative sampling for author name disambiguation

Name Disambiguation Using Semantic Association Clustering

Web personal name disambiguation based on reference entity tables mined from the web