ABSTRACT
Name ambiguity is a special case of identity uncertainty where one person can be referenced by multiple name variations in different situations or even share the same name with other people. In this paper, we focus on the problem of disambiguating person names within web pages and scientific documents. We present an efficient and effective two-stage approach to disambiguate names. In the first stage, two novel topic-based models are proposed by extending two hierarchical Bayesian text models, namely Probabilistic Latent Semantic Analysis (PLSA) and Latent Dirichlet Allocation (LDA). Our models explicitly introduce a new variable for persons and learn the distribution of topics with regard to persons and words. After learning an initial model, the topic distributions are treated as feature sets and names are disambiguated by leveraging a hierarchical agglomerative clustering method. Experiments on web data and scientific documents from CiteSeer indicate that our approach consistently outperforms other unsupervised learning methods such as spectral clustering and DBSCAN clustering and could be extended to other research fields. We empirically addressed the issue of scalability by disambiguating authors in over 750,000 papers from the entire CiteSeer dataset.
- R. Bekkerman and A. McCallum. Disambiguating web appearances of people in a social network. In WWW '05: Proceedings of the 14th international conference on World Wide Web, pages 463--470, New York, NY, USA, 2005. Google ScholarDigital Library
- I. Bhattacharya and L. Getoor. A latent dirichlet model for unsupervised entity resolution. In SDM, 2006.Google ScholarCross Ref
- D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. J. Mach. Learn. Res., 3:993--1022, 2003. Google ScholarCross Ref
- C. H. Brooks and N. Montanez. Improved annotation of the blogosphere via autotagging and hierarchical clustering. In WWW '06: Proceedings of the 15th international conference on World Wide Web, pages 625--632, New York, NY, USA, 2006. ACM Press. Google ScholarDigital Library
- C. Charalambous. Maximum likelihood parameter estimation from incomplete data via the sensitivity equations: The continuous-time case, 1998.Google Scholar
- S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman. Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41(6):391--407, 1990.Google ScholarCross Ref
- C. L. Giles, K. D. Bollacker, and S. Lawrence. Citeseer: an automatic citation indexing system. In DL '98: Proceedings of the third ACM conference on Digital libraries, pages 89--98, New York, NY, USA, 1998. ACM Press. Google ScholarDigital Library
- H. Han, L. Giles, H. Zha, C. Li, and K. Tsioutsiouliklis. Two supervised learning approaches for name disambiguation in author citations. In JCDL '04: Proceedings of the 4th ACM/IEEE joint conference on Digital libraries, pages 296--305, New York, 2004. Google ScholarDigital Library
- H. Han, H. Zha, and C. L. Giles. Name disambiguation in author citations using a k-way spectral clustering method. In JCDL '05: Proceedings of the 5th ACM/IEEE joint conference on Digital libraries, pages 334--343, New York,NY, USA, 2005. ACM Press. Google ScholarDigital Library
- T. Hofmann. Probabilistic Latent Semantic Indexing. In SIGIR '99: Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pages 50--57, Berkeley, California. Google ScholarDigital Library
- T. Hofmann. Collaborative filtering via gaussian probabilistic latent semantic analysis. In SIGIR '03: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in information retrieval, pages 259--266, New York, NY, USA, 2003. Google ScholarDigital Library
- J. Huang, S. Ertekin, and C. L. Giles. Efficient name disambiguation for large-scale databases. In the 10th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD), pages 536--544. Springer-Verlag Berlin Heidelberg, 2006. Google ScholarDigital Library
- W. J. H. Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association, 58:236--244, 1963.Google ScholarCross Ref
- X. Jin, Y. Zhou, and B. Mobasher. Web usage mining based on probabilistic latent semantic analysis. In KDD '04: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 197--205, New York, NY, USA, 2004. ACM Press. Google ScholarDigital Library
- B. King. Step-wise clustering procedures. Journal of the American Statistical Association, 69:86--101, 1967.Google ScholarCross Ref
- D. Lee, B.-W. On, J. Kang, and S. Park. Effective and scalable solutions for mixed and split citation problems in digital libraries. In IQIS '05: Proceedings of the 2nd international workshop on Information quality in information systems, pages 69--76, New York, 2005. Google ScholarDigital Library
- V. I. Levenshtein. Binary codes capable of correcting spurious insertions and deletions of ones. Probl. Inform. Transmiss., 1:8--17, 1965.Google Scholar
- F.-F. Li and P. Perona. A bayesian hierarchical model for learning natural scene categories. In CVPR '05: Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05) - Volume 2, pages 524--531, Washington, DC, USA, 2005. IEEE Computer Society. Google ScholarDigital Library
- G. S. Mann and D. Yarowsky. Unsupervised personal name disambiguation. In Proceedings of the seventh conference on Natural language learning at HL-NAACL 2003, pages 33--40, Morristown, NJ, USA, 2003. Association for Computational Linguistics. Google ScholarDigital Library
- M. Rosen-Zvi, T. Griffiths, M. Steyvers, and P. Smyth. The author-topic model for authors and documents. In AUAI '04: Proceedings of the 20th conference on Uncertainty in artificial intelligence, pages 487--494, Arlington, Virginia, United States, 2004. AUAI Press. Google ScholarDigital Library
- J. Sivic, B. C. Russell, A. A. Efros, A. Zisserman, and W. T. Freeman. Discovering objects and their localization in images. In ICCV '05: Proceedings of the Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1, pages 370--377, Washington, DC, USA, 2005. Google ScholarDigital Library
- E. B. Sudderth, A. Torralba, W. T. Freeman, and A. S. Willsky. Learning hierarchical models of scenes, objects, and parts. In ICCV '05: Proceedings of the Tenth IEEE International Conference on Computer Vision, pages 1331--1338, Washington, DC, USA, 2005. Google ScholarDigital Library
- X. Wang and A. McCallum. Topics over time: a non-markov continuous-time model of topical trends. In KDD '06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 424--433, New York, NY, USA, 2006. Google ScholarDigital Library
- X. Wei and W. B. Croft. Lda-based document models for ad--hocretrieval. In SIGIR '06: Proceedings of the 29th annual international ACMSIGIR conference on Research and development in information retrieval, pages178--185,New York, NY, USA, 2006. ACM Press. Google ScholarDigital Library
- B. Wellner, A. McCallum, F. Peng, and M. Hay. An integrated, conditionalmodel of information extraction and coreference with application to citationmatching. In AUAI '04: Proceedings of the 20th conference on Uncertainty inartificial intelligence, pages 593601, Arlington, Virginia, United States,2004. AUAI Press. Google ScholarDigital Library
- G. Xu, Y. Zhang, J. Ma, and X. Zhou. Discovering user access patternbased on probabilistic latent factor model. In ADC '05: Proceedings of thesixteenth Australasian database conference, pages 27--35, Darlinghurst,Australia, Australia, 2005. Australian Computer Society, Inc. Google ScholarDigital Library
Index Terms
- Efficient topic-based unsupervised name disambiguation
Recommendations
On Graph-Based Name Disambiguation
Name ambiguity stems from the fact that many people or objects share identical names in the real world. Such name ambiguity decreases the performance of document retrieval, Web search, information integration, and may cause confusion in other ...
Author name disambiguation in MEDLINE
Background: We recently described “Author-ity,” a model for estimating the probability that two articles in MEDLINE, sharing the same author name, were written by the same individual. Features include shared title words, journal name, coauthors, medical ...
Generative models for name disambiguation
WWW '07: Proceedings of the 16th international conference on World Wide WebName ambiguity is a special case of identity uncertainty where one person can be referenced by multiple name variations in different situations or evenshare the same name with other people. In this paper, we present an efficient framework by using two ...
Comments