skip to main content
10.1145/1065385.1065463acmconferencesArticle/Chapter ViewAbstractPublication PagesjcdlConference Proceedingsconference-collections
Article

Comparative study of name disambiguation problem using a scalable blocking-based framework

Authors Info & Claims
Published:07 June 2005Publication History

ABSTRACT

In this paper, we consider the problem of ambiguous author names in bibliographic citations, and comparatively study alternative approaches to identify and correct such name variants (e.g., "Vannevar Bush" and "V. Vush"). Our study is based on a scalable two-step framework, where step 1 is to substantially reduce the number of candidates via blocking, and step 2 is to measure the distance of two names via coauthor information. Combining four blocking methods and seven distance measures on four data sets, we present extensive experimental results, and identify combinations that are scalable and effective to disambiguate author names in citations.

References

  1. R. Ananthakrishna, S. Chaudhuri, and V. Ganti. "Eliminating Fuzzy Duplicates in Data Warehouses". In VLDB, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. arXiv.org e Print archive. http://arxiv.org/.Google ScholarGoogle Scholar
  3. M. Bilenko, R. Mooney, W. Cohen, P. Ravikumar, and S. Fienberg. "Adaptive Name-Matching in Information Integration". IEEE Intelligent System, 18(5):16--23, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. V. R. Borkar, K. Deshmukh, and S. Sarawagi. "Automatic Segmentation of Text into Structured Records". In ACM SIGMOD, Santa Barbara, CA, May 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani. "Robust and Efficient Fuzzy Match for Online Data Cleaning". In ACM SIGMOD, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. W. Cohen, P. Ravikumar, and S. Fienberg. "A Comparison of String Distance Metrics for Name-matching tasks". In IIWeb Workshop held in conjunction with IJCAI, 2003.Google ScholarGoogle Scholar
  7. N. Cristianini and J. Shawe-Taylor. "An Introduction to Support Vector Machines". Cambridge University Press, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. I. P. Fellegi and A. B. Sunter. "A Theory for Record Linkage". J. of the American Statistical Society, 64:1183--1210, 1969.Google ScholarGoogle ScholarCross RefCross Ref
  9. A Library for Support Vector Machines. http://www.csie.ntu.edu.tw/~cjlin/libsvm/.Google ScholarGoogle Scholar
  10. L. Gravano, P. G. Ipeirotis, N. Koudas, and D. Srivastava. "Text Joins in an RDBMS for Web Data Integration". In Int'l World Wide Web Conf. (WWW), 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. H. Han, C. L. Giles, and H. Zha et al. "Two Supervised Learning Approaches for Name Disambiguation in Author Citations". In ACM/IEEE Joint Conf. on Digital Libraries (JCDL), Jun. 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. M. A. Hernandez and S. J. Stolfo. "The Merge/Purge Problem for Large Databases". In ACM SIGMOD, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. J. A. Hylton. "Identifying and Merging Related Bibliographic Records". PhD thesis, Dept. of EECS, MIT, 1996. LCS Technical Report MIT/LCS/TR-678. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. M. A. Jaro. "Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida". J. of the American Statistical Association, 84(406), Jun. 1989.Google ScholarGoogle Scholar
  15. R. P. Kelley. "Blocking Considerations for Record Linkage Under Conditions of Uncertainty". In Proc. of Social Statistics Section, pages 602--605, 1984.Google ScholarGoogle Scholar
  16. S. Lawrence, C. L. Giles, and K. Bollacker. "Digital Libraries and Autonomous Citation Indexing". IEEE Computer, 32(6):67--71, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. M. Ley. "The DBLP Computer Science Bibliography: Evolution, Research Issues, Perspectives". In Int'l Symp. on String Processing and Information Retrieval (SPIRE), Lisbon, Portugal, Sep. 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. CiteSeer: Scientific Literature Digital Library. http://www.citeseer.org/.Google ScholarGoogle Scholar
  19. B. Majoros. "Naive Bayes Models for Classification". http://www.geocities.com/ResearchTriangle/Forum/1203/NaiveBayes.html.Google ScholarGoogle Scholar
  20. A. McCallum, K. Nigam, and L. H. Ungar. "Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching". In ACM KDD, Boston, MA, Aug. 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. A. E. Monge. "Adaptive Detection of Approximately Duplicate Database Records and the Database Integration Approach to Information Discovery". PhD thesis, University of California, San Diego, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. H. Pasula, B. Marthi, B. Milch, S. Russell, and I. Shpitser. "Identity Uncertainty and Citation Matching". In Advances in Neural Information Processing Systems. MIT Press, 2003.Google ScholarGoogle Scholar
  23. S. Sarawagi and A. Bhamidipaty. "Interactive Deduplication using Active Learning". In ACM SIGMOD, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. SecondString: Open source Java-based Package of Approximate String-Matching. http://secondstring.sourceforge.net/.Google ScholarGoogle Scholar
  25. S. Tejada, C. A. Knoblock, and S. Minton. "Learning Object Identification Rules for Information Integration". Information Systems, 26(8):607--633, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. W. E. Winkler and Y. Thibaudeau. "An Application of the Fellegi-Sunter Model of Record Linkage to the 1990 U.S. Decennial Census". Technical report, US Bureau of the Census, 1991.Google ScholarGoogle Scholar

Index Terms

  1. Comparative study of name disambiguation problem using a scalable blocking-based framework

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      JCDL '05: Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
      June 2005
      450 pages
      ISBN:1581138768
      DOI:10.1145/1065385
      • General Chair:
      • Mary Marlino,
      • Program Chairs:
      • Tamara Sumner,
      • Frank Shipman

      Copyright © 2005 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 7 June 2005

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • Article

      Acceptance Rates

      Overall Acceptance Rate415of1,482submissions,28%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader