ABSTRACT
This paper presents a new name disambiguation method that exploits user feedback on ambiguous references across iterations. An unsupervised step is used to define pure training samples, and a hybrid supervised step is employed to learn a classification model for assigning references to authors. Our classification scheme combines the Optimum-Path Forest (OPF) classifier with complex reference similarity functions generated by a Genetic Programming framework. Experiments demonstrate that the proposed method yields better results than state-of-the-art disambiguation methods on two traditional datasets.
- Byung-Won On, Dongwon Lee, Jaewoo Kang, and Prasenjit Mitra. Comparative study of name disambiguation problem using a scalable blocking-based framework. In Proceedings of the 5th ACM/IEEE Joint Conference on Digital Libraries, pages 344--353, Denver, CO, USA, 2005. Google ScholarDigital Library
- Anderson A. Ferreira, Marcos Andre Gonçalves, and Alberto H. F. Laender. A brief survey of automatic methods for author name disambiguation. SIGMOD Record, 41(2):15--26, 2012. Google ScholarDigital Library
- J. P. Papa, A. X. Falc\ ao, and C. T. N. Suzuki. Supervised pattern classification based on optimum-path forest. International Journal of Imaging Systems and Technology, 19(2):120--131, 2009. Google ScholarDigital Library
- J. P. Papa, A. X. Falc\ ao, V. H. C. Albuquerque, and J. M. R. S. Tavares. Efficient supervised optimum-path forest classification for large datasets. Pattern Recognition, 45(1):512--520, 2012. Google ScholarDigital Library
- Hui Han, Hongyuan Zha, and C. Lee Giles. Name disambiguation in author citations using a k-way spectral clustering method. In Proceedings of the 5th ACM/IEEE Joint Conference on Digital Libraries, pages 334--343, Denver, CO, USA, 2005. Google ScholarDigital Library
- Jian Huang, Seyda Ertekin, and C. Lee Giles. Efficient name disambiguation for large-scale databases. In Proceedings of the European Conference on Principles and Practice of Knowledge Discovery in Databases, pages 536--544, Berlin, Germany, 2006. Google ScholarDigital Library
- Byung-Won On, Ergin Elmacioglu, Dongwon Lee, Jaewoo Kang, and Jian Pei. Improving grouped-entity resolution using quasi-cliques. In Proceedings of the 6th IEEE International Conference on Data Mining, pages 1008--015, 2006. Google ScholarDigital Library
- Indrajit Bhattacharya and Lise Getoor. Collective entity resolution in relational data. ACM Transactions on Knowledge Discovery from Data, 1(1), 2007. Google ScholarDigital Library
- Aron Culotta, Pallika Kanani, Robert Hall, Michael Wick, and Andrew McCallum. Author disambiguation using error-driven machine learning with a ranking loss function. In Proceedings of the International Workshop on Information Integration on the Web, Vancouver, Canada, 2007.Google Scholar
- In-Su Kang, Seung-Hoon Na, Seungwoo Lee, Hanmin Jung, Pyung Kim, Won-Kyung Sung, and Jong-Hyeok Lee. On co-authorship for author disambiguation. Information Processing & Management, 45(1):84--97, 2009. Google ScholarDigital Library
- Byung-Won On and Dongwon Lee. Scalable name disambiguation using multi-level graph partition. In Proceedings of the 7th SIAM International Conference on Data Mining, pages 575--580, Minneapolis, Minnesota, USA, 2007.Google ScholarCross Ref
- José M. Soler. Separating the articles of authors with the same name. Scientometrics, 72(2):281--290, 2007.Google ScholarCross Ref
- Yang Song, Jian Huang, Isaac G. Councill, Jia Li, and C. Lee Giles. Efficient topic-based unsupervised name disambiguation. In Proceedings of the 7th ACM/IEEE Joint Conference on Digital Libraries, pages 342--351, Vancouver, BC, Canada, 2007. Google ScholarDigital Library
- Denilson Alves Pereira, Berthier A. Ribeiro-Neto, Nivio Ziviani, Alberto H. F. Laender, Marcos André Gonçalves, and Anderson A. Ferreira. Using web information for author name disambiguation. In Proceedings of the 2009 ACM/IEEE Joint Conference on Digital Libraries, pages 49--58, 2009. Google ScholarDigital Library
- Vetle I. Torvik and Neil R. Smalheiser. Author name disambiguation in medline. ACM Transactions on Knowledge Discovery from Data, 3(3):1--29, 2009. Google ScholarDigital Library
- Pucktada Treeratpituk and C. Lee Giles. Disambiguating authors in academic publications using random forests. In Proceedings of the 2009 ACM/IEEE Joint Conference on Digital Libraries, pages 39--48, Austin, TX, USA, 2009. Google ScholarDigital Library
- Ricardo G. Cota, Anderson Almeida Ferreira, Marcos André Gonçalves, Alberto H. F. Laender, and Cristiano Nascimento. An unsupervised heuristic-based hierarchical method for name disambiguation in bibliographic citations. Journal of the American Society for Information Science and Technology, 61(9):1853--1870, 2010. Google ScholarDigital Library
- A.A. Ferreira, A. Veloso, M.A. Gonçalves, and A.H.F. Laender. Effective self-training author name disambiguation in scholarly digital libraries. In Proceedings of the 10th annual joint conference on Digital libraries, pages 39--48. ACM, 2010. Google ScholarDigital Library
- Xiaoming Fan, Jianyong Wang, Xu Pu, Lizhu Zhou, and Bing Lv. On graph-based name disambiguation. ACM Journal of Data and Information Quality, 2:10:1--10:23, February 2011. Google ScholarDigital Library
- Ana Paula Carvalho, Anderson A. Ferreira, Alberto H. F. Laender, and Marcos André Gonçalves. Incremental unsupervised name disambiguation in cleaned digital libraries. Journal of Information and Data Management, 2(3):289--304, 2011.Google Scholar
- Michael Levin, Stefan Krawzyk, Steven Bethard, and Dan Jurafsky. Citation-based bootstrapping for large-scale author disambiguation. Journal of the American Society for Information Science and Technology, 63(5):1030--1047, 2012. Google ScholarDigital Library
- Felipe H. Levin and Carlos A. Heuser. Evaluating the use of social networks in author name disambiguation in digital libraries. Journal of Information and Data Management, 1(2):183--197, 2010.Google Scholar
- Hui Han, C. Lee Giles, Hongyuan Zha, Cheng Li, and Kostas Tsioutsiouliklis. Two supervised learning approaches for name disambiguation in author citations. In Proceedings of the 4th ACM/IEEE-CS Joint Conference on Digital Libraries, pages 296--305, Tuscon, USA, 2004. Google ScholarDigital Library
- Hui Han, Wei Xu, Hongyuan Zha, and C. Lee Giles. A hierarchical naive Bayes mixture model for name disambiguation in author citations. In Proceedings of the 2005 ACM Symposium on Applied Computing, pages 1065--1069, Santa Fe, New Mexico, USA, 2005. Google ScholarDigital Library
- Indrajit Bhattacharya and Lise Getoor. A latent dirichlet model for unsupervised entity resolution. In Proceedings of the Sixth SIAM International Conference on Data Mining, Bethesda, MD, USA, 2006.Google ScholarCross Ref
- Jie Tang, Auvis C. M. Fong, Bo Wang, and Jing Zhang. A unified probabilistic framework for name disambiguation in digital library. IEEE Transactions on Knowledge and Data Engineering, 24(6):975--987, 2012. Google ScholarDigital Library
- Adriano Veloso, Anderson A. Ferreira, Marcos A. Gonçalves, Alberto H.F. Laender, and Wagner Meira Jr. Cost-effective on-demand associative author name disambiguation. Information Processing & Management, 48(4):680 -- 697, 2012. Google ScholarDigital Library
- A.A. Ferreira, T.M. Machado, and M.A. Gonçalves. Improving author name disambiguation with user relevance feedback. Journal of Information and Data Management, 3(3):332, 2012.Google Scholar
- Ricardo Baeza-Yates and Berthier Ribeiro-Neto. Modern Information Retrieval. Addison-Wesley Publishing Company, 2nd edition, 2008. Google ScholarDigital Library
- X. Wang, J. Tang, H. Cheng, and P.S. Yu. Adana: Active name disambiguation. In Proceedings of the 11th International Conference on Data Mining, pages 794--803, Vancouver,Canada, 2011. Google ScholarDigital Library
- Yuhua Li, Aiming Wen, Quan Lin, Ruixuan Li, and Zhengding Lu. Incorporating user feedback into name disambiguation of scientific cooperation network. In Proceedings of the 12th international conference on Web-age information management, WAIM'11, pages 454--466, 2011. Google ScholarDigital Library
- A.T. da Silva, J.A. dos Santos, A.X. Falc\ ao, R.S. Torres, and L.P. Magalh\ aes. Incorporating multiple distance spaces in optimum-path forest classification to improve feedback-based learning. Computer Vision and Image Understanding, 116(4):510--523, 2012. Google ScholarDigital Library
- A.T. da Silva, AX Falc\ ao, and L.P. Magalh\ aes. A new cbir approach based on relevance feedback and optimum-path forest classification. Journal of WSCG, 18(1--3):73--80, 2010.Google Scholar
- Jefersson Alex dos Santos, André Tavares da Silva, Ricardo da Silva Torres, Alexandre X. Falcão, Léo Pini Magalhães, and Rubens A. C. Lamparelli. Interactive classification of remote sensing images by using optimum-path forest and genetic programming. In 14th International Conference on Computer Analysis of Images and Patterns (CAIP), pages 300--307, 2011. Google ScholarDigital Library
- R. Calumby, R. da S. Torres, and M. A. Gonçalves. Multimodal retrieval with relevance feedback based on genetic programming. Multimedia Tools and Applications, pages 1--29, 2012.Google Scholar
- F. S. P. Andrade, J. Almeida, H. Pedrini, and R. da S. Torres. Fusion of local and global descriptors for content-based image and video retrieval. In Iberoamerican Congress on Pattern Recognition, pages 845--853, 2012.Google ScholarCross Ref
- F. F. Faria, A. Veloso, H. M. Almeida, E. Valle, R. da S. Torres, M. A. Gonçalves, and W. Meira Jr. Learning to rank for content-based image retrieval. In ACM MIR, pages 285--294, 2010. Google ScholarDigital Library
- R. da S. Torres, A. X. Falc\ ao, M. A. Gonçalves, J. P. Papa, B. Zhang, W. Fan, and E. A. Fox. A genetic programming framework for content-based image retrieval. Pattern Recognition, 42(2):283--292, 2009. Google ScholarDigital Library
- C. D. Ferreira, J. A. Santos, R. da S. Torres, M. A. Gonçalves, R. C. Rezende, and W. Fan. Relevance feedback based on genetic programming for image retrieval. Pattern Recognition Letters, 32(1):27--37, 2011. Google ScholarDigital Library
- Weiguo Fan, Praveen Pathak, and Mi Zhou. Genetic-based approaches in ranking function discovery and optimization in information retrieval - a framework. Decision Support Systems, 47(4):398--407, 2009. Google ScholarDigital Library
- H. M. de Almeida, M. A. Gonçalves, M. Cristo, and P. P. Calado. A combined component approach for finding collection-adapted ranking functions based on genetic programming. In ACM SIGIR, pages 399--406, 2007. Google ScholarDigital Library
- T.H. Cormen, C.E. Leiserson, R.L. Rivest, and C. Stein. Introduction to algorithms. MIT press, 2001. Google ScholarDigital Library
- A.A. Ferreira, R. Silva, M.A. Gonçalves, A. Veloso, and A.H.F. Laender. Active associative sampling for author name disambiguation. In Proceedings of the 12th ACM/IEEE-CS joint conference on Digital Libraries, pages 175--184. ACM, 2012. Google ScholarDigital Library
- In-Su Kang, Pyung Kim, Seungwoo Lee, Hanmin Jung, and Beom-Jong You. Construction of a large-scale test set for author disambiguation. Information Processing and Management, 47(3):452--465, May 2011. Google ScholarDigital Library
- Itshak Lapidot. Self-Organizing-Maps with BIC for Speaker Clustering. Technical report, IDIAP Research Institute, Martigny, Switzerland, 2002.Google Scholar
- C. J. Van Rijsbergen. Information Retrieval, 2nd edition. Butterworths, London, 1979. Google ScholarDigital Library
- Robert Feldt and Peter Nordin. Using factorial experiments to evaluate the effect of genetic programming parameters. In EuroGP, pages 271--282, 2000. Google ScholarDigital Library
Index Terms
- A relevance feedback approach for the author name disambiguation problem
Recommendations
Active associative sampling for author name disambiguation
JCDL '12: Proceedings of the 12th ACM/IEEE-CS joint conference on Digital LibrariesOne of the hardest problems faced by current scholarly digital libraries is author name ambiguity. This problem occurs when, in a set of citation records, there are records of a same author under distinct names, or citation records belonging to distinct ...
Name Disambiguation Using Semantic Association Clustering
ICEBE '09: Proceedings of the 2009 IEEE International Conference on e-Business EngineeringDue to homonyms, abbreviations, etc., name ambiguity is widely available in web and e-document. For example, when integrating heterogeneous literature databases, because there are different name specifications, different authors may be thought of as the ...
Web personal name disambiguation based on reference entity tables mined from the web
WIDM '09: Proceedings of the eleventh international workshop on Web information and data managementAmbiguous personal names are common on the Web, which pose a challenge for many different tasks. The traditional disambiguation employs the clustering methods. However, without reference entity tables, the clustering method can only identify whether two ...
Comments