ABSTRACT
In this paper, we consider the problem of ambiguous author names in bibliographic citations, and comparatively study alternative approaches to identify and correct such name variants (e.g., "Vannevar Bush" and "V. Vush"). Our study is based on a scalable two-step framework, where step 1 is to substantially reduce the number of candidates via blocking, and step 2 is to measure the distance of two names via coauthor information. Combining four blocking methods and seven distance measures on four data sets, we present extensive experimental results, and identify combinations that are scalable and effective to disambiguate author names in citations.
- R. Ananthakrishna, S. Chaudhuri, and V. Ganti. "Eliminating Fuzzy Duplicates in Data Warehouses". In VLDB, 2002. Google ScholarDigital Library
- arXiv.org e Print archive. http://arxiv.org/.Google Scholar
- M. Bilenko, R. Mooney, W. Cohen, P. Ravikumar, and S. Fienberg. "Adaptive Name-Matching in Information Integration". IEEE Intelligent System, 18(5):16--23, 2003. Google ScholarDigital Library
- V. R. Borkar, K. Deshmukh, and S. Sarawagi. "Automatic Segmentation of Text into Structured Records". In ACM SIGMOD, Santa Barbara, CA, May 2001. Google ScholarDigital Library
- S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani. "Robust and Efficient Fuzzy Match for Online Data Cleaning". In ACM SIGMOD, 2003. Google ScholarDigital Library
- W. Cohen, P. Ravikumar, and S. Fienberg. "A Comparison of String Distance Metrics for Name-matching tasks". In IIWeb Workshop held in conjunction with IJCAI, 2003.Google Scholar
- N. Cristianini and J. Shawe-Taylor. "An Introduction to Support Vector Machines". Cambridge University Press, 2000. Google ScholarDigital Library
- I. P. Fellegi and A. B. Sunter. "A Theory for Record Linkage". J. of the American Statistical Society, 64:1183--1210, 1969.Google ScholarCross Ref
- A Library for Support Vector Machines. http://www.csie.ntu.edu.tw/~cjlin/libsvm/.Google Scholar
- L. Gravano, P. G. Ipeirotis, N. Koudas, and D. Srivastava. "Text Joins in an RDBMS for Web Data Integration". In Int'l World Wide Web Conf. (WWW), 2003. Google ScholarDigital Library
- H. Han, C. L. Giles, and H. Zha et al. "Two Supervised Learning Approaches for Name Disambiguation in Author Citations". In ACM/IEEE Joint Conf. on Digital Libraries (JCDL), Jun. 2004. Google ScholarDigital Library
- M. A. Hernandez and S. J. Stolfo. "The Merge/Purge Problem for Large Databases". In ACM SIGMOD, 1995. Google ScholarDigital Library
- J. A. Hylton. "Identifying and Merging Related Bibliographic Records". PhD thesis, Dept. of EECS, MIT, 1996. LCS Technical Report MIT/LCS/TR-678. Google ScholarDigital Library
- M. A. Jaro. "Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida". J. of the American Statistical Association, 84(406), Jun. 1989.Google Scholar
- R. P. Kelley. "Blocking Considerations for Record Linkage Under Conditions of Uncertainty". In Proc. of Social Statistics Section, pages 602--605, 1984.Google Scholar
- S. Lawrence, C. L. Giles, and K. Bollacker. "Digital Libraries and Autonomous Citation Indexing". IEEE Computer, 32(6):67--71, 1999. Google ScholarDigital Library
- M. Ley. "The DBLP Computer Science Bibliography: Evolution, Research Issues, Perspectives". In Int'l Symp. on String Processing and Information Retrieval (SPIRE), Lisbon, Portugal, Sep. 2002. Google ScholarDigital Library
- CiteSeer: Scientific Literature Digital Library. http://www.citeseer.org/.Google Scholar
- B. Majoros. "Naive Bayes Models for Classification". http://www.geocities.com/ResearchTriangle/Forum/1203/NaiveBayes.html.Google Scholar
- A. McCallum, K. Nigam, and L. H. Ungar. "Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching". In ACM KDD, Boston, MA, Aug. 2000. Google ScholarDigital Library
- A. E. Monge. "Adaptive Detection of Approximately Duplicate Database Records and the Database Integration Approach to Information Discovery". PhD thesis, University of California, San Diego, 1997. Google ScholarDigital Library
- H. Pasula, B. Marthi, B. Milch, S. Russell, and I. Shpitser. "Identity Uncertainty and Citation Matching". In Advances in Neural Information Processing Systems. MIT Press, 2003.Google Scholar
- S. Sarawagi and A. Bhamidipaty. "Interactive Deduplication using Active Learning". In ACM SIGMOD, 2002. Google ScholarDigital Library
- SecondString: Open source Java-based Package of Approximate String-Matching. http://secondstring.sourceforge.net/.Google Scholar
- S. Tejada, C. A. Knoblock, and S. Minton. "Learning Object Identification Rules for Information Integration". Information Systems, 26(8):607--633, 2001. Google ScholarDigital Library
- W. E. Winkler and Y. Thibaudeau. "An Application of the Fellegi-Sunter Model of Record Linkage to the 1990 U.S. Decennial Census". Technical report, US Bureau of the Census, 1991.Google Scholar
Index Terms
- Comparative study of name disambiguation problem using a scalable blocking-based framework
Recommendations
Name disambiguation in author citations using a K-way spectral clustering method
JCDL '05: Proceedings of the 5th ACM/IEEE-CS joint conference on Digital librariesAn author may have multiple names and multiple authors may share the same name simply due to name abbreviations, identical names, or name misspellings in publications or bibliographies 1. This can produce name ambiguity which can affect the performance ...
On Graph-Based Name Disambiguation
Name ambiguity stems from the fact that many people or objects share identical names in the real world. Such name ambiguity decreases the performance of document retrieval, Web search, information integration, and may cause confusion in other ...
Author name disambiguation in MEDLINE
Background: We recently described “Author-ity,” a model for estimating the probability that two articles in MEDLINE, sharing the same author name, were written by the same individual. Features include shared title words, journal name, coauthors, medical ...
Comments