skip to main content
10.1145/1255175.1255243acmconferencesArticle/Chapter ViewAbstractPublication PagesjcdlConference Proceedingsconference-collections
Article

Efficient topic-based unsupervised name disambiguation

Authors Info & Claims
Published:18 June 2007Publication History

ABSTRACT

Name ambiguity is a special case of identity uncertainty where one person can be referenced by multiple name variations in different situations or even share the same name with other people. In this paper, we focus on the problem of disambiguating person names within web pages and scientific documents. We present an efficient and effective two-stage approach to disambiguate names. In the first stage, two novel topic-based models are proposed by extending two hierarchical Bayesian text models, namely Probabilistic Latent Semantic Analysis (PLSA) and Latent Dirichlet Allocation (LDA). Our models explicitly introduce a new variable for persons and learn the distribution of topics with regard to persons and words. After learning an initial model, the topic distributions are treated as feature sets and names are disambiguated by leveraging a hierarchical agglomerative clustering method. Experiments on web data and scientific documents from CiteSeer indicate that our approach consistently outperforms other unsupervised learning methods such as spectral clustering and DBSCAN clustering and could be extended to other research fields. We empirically addressed the issue of scalability by disambiguating authors in over 750,000 papers from the entire CiteSeer dataset.

References

  1. R. Bekkerman and A. McCallum. Disambiguating web appearances of people in a social network. In WWW '05: Proceedings of the 14th international conference on World Wide Web, pages 463--470, New York, NY, USA, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. I. Bhattacharya and L. Getoor. A latent dirichlet model for unsupervised entity resolution. In SDM, 2006.Google ScholarGoogle ScholarCross RefCross Ref
  3. D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. J. Mach. Learn. Res., 3:993--1022, 2003. Google ScholarGoogle ScholarCross RefCross Ref
  4. C. H. Brooks and N. Montanez. Improved annotation of the blogosphere via autotagging and hierarchical clustering. In WWW '06: Proceedings of the 15th international conference on World Wide Web, pages 625--632, New York, NY, USA, 2006. ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. C. Charalambous. Maximum likelihood parameter estimation from incomplete data via the sensitivity equations: The continuous-time case, 1998.Google ScholarGoogle Scholar
  6. S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman. Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41(6):391--407, 1990.Google ScholarGoogle ScholarCross RefCross Ref
  7. C. L. Giles, K. D. Bollacker, and S. Lawrence. Citeseer: an automatic citation indexing system. In DL '98: Proceedings of the third ACM conference on Digital libraries, pages 89--98, New York, NY, USA, 1998. ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. H. Han, L. Giles, H. Zha, C. Li, and K. Tsioutsiouliklis. Two supervised learning approaches for name disambiguation in author citations. In JCDL '04: Proceedings of the 4th ACM/IEEE joint conference on Digital libraries, pages 296--305, New York, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. H. Han, H. Zha, and C. L. Giles. Name disambiguation in author citations using a k-way spectral clustering method. In JCDL '05: Proceedings of the 5th ACM/IEEE joint conference on Digital libraries, pages 334--343, New York,NY, USA, 2005. ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. T. Hofmann. Probabilistic Latent Semantic Indexing. In SIGIR '99: Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pages 50--57, Berkeley, California. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. T. Hofmann. Collaborative filtering via gaussian probabilistic latent semantic analysis. In SIGIR '03: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in information retrieval, pages 259--266, New York, NY, USA, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. J. Huang, S. Ertekin, and C. L. Giles. Efficient name disambiguation for large-scale databases. In the 10th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD), pages 536--544. Springer-Verlag Berlin Heidelberg, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. W. J. H. Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association, 58:236--244, 1963.Google ScholarGoogle ScholarCross RefCross Ref
  14. X. Jin, Y. Zhou, and B. Mobasher. Web usage mining based on probabilistic latent semantic analysis. In KDD '04: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 197--205, New York, NY, USA, 2004. ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. B. King. Step-wise clustering procedures. Journal of the American Statistical Association, 69:86--101, 1967.Google ScholarGoogle ScholarCross RefCross Ref
  16. D. Lee, B.-W. On, J. Kang, and S. Park. Effective and scalable solutions for mixed and split citation problems in digital libraries. In IQIS '05: Proceedings of the 2nd international workshop on Information quality in information systems, pages 69--76, New York, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. V. I. Levenshtein. Binary codes capable of correcting spurious insertions and deletions of ones. Probl. Inform. Transmiss., 1:8--17, 1965.Google ScholarGoogle Scholar
  18. F.-F. Li and P. Perona. A bayesian hierarchical model for learning natural scene categories. In CVPR '05: Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05) - Volume 2, pages 524--531, Washington, DC, USA, 2005. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. G. S. Mann and D. Yarowsky. Unsupervised personal name disambiguation. In Proceedings of the seventh conference on Natural language learning at HL-NAACL 2003, pages 33--40, Morristown, NJ, USA, 2003. Association for Computational Linguistics. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. M. Rosen-Zvi, T. Griffiths, M. Steyvers, and P. Smyth. The author-topic model for authors and documents. In AUAI '04: Proceedings of the 20th conference on Uncertainty in artificial intelligence, pages 487--494, Arlington, Virginia, United States, 2004. AUAI Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. J. Sivic, B. C. Russell, A. A. Efros, A. Zisserman, and W. T. Freeman. Discovering objects and their localization in images. In ICCV '05: Proceedings of the Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1, pages 370--377, Washington, DC, USA, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. E. B. Sudderth, A. Torralba, W. T. Freeman, and A. S. Willsky. Learning hierarchical models of scenes, objects, and parts. In ICCV '05: Proceedings of the Tenth IEEE International Conference on Computer Vision, pages 1331--1338, Washington, DC, USA, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. X. Wang and A. McCallum. Topics over time: a non-markov continuous-time model of topical trends. In KDD '06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 424--433, New York, NY, USA, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. X. Wei and W. B. Croft. Lda-based document models for ad--hocretrieval. In SIGIR '06: Proceedings of the 29th annual international ACMSIGIR conference on Research and development in information retrieval, pages178--185,New York, NY, USA, 2006. ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. B. Wellner, A. McCallum, F. Peng, and M. Hay. An integrated, conditionalmodel of information extraction and coreference with application to citationmatching. In AUAI '04: Proceedings of the 20th conference on Uncertainty inartificial intelligence, pages 593601, Arlington, Virginia, United States,2004. AUAI Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. G. Xu, Y. Zhang, J. Ma, and X. Zhou. Discovering user access patternbased on probabilistic latent factor model. In ADC '05: Proceedings of thesixteenth Australasian database conference, pages 27--35, Darlinghurst,Australia, Australia, 2005. Australian Computer Society, Inc. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Efficient topic-based unsupervised name disambiguation

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          JCDL '07: Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
          June 2007
          534 pages
          ISBN:9781595936448
          DOI:10.1145/1255175

          Copyright © 2007 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 18 June 2007

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • Article

          Acceptance Rates

          Overall Acceptance Rate415of1,482submissions,28%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader