Article

Efficient topic-based unsupervised name disambiguation

Authors:
Yang Song

The Pennsylvania State University, University Park, PA

The Pennsylvania State University, University Park, PA
View Profile

,
Jian Huang

The Pennsylvania State University, University Park, PA

The Pennsylvania State University, University Park, PA
View Profile

,
Isaac G. Councill

The Pennsylvania State University, University Park, PA

The Pennsylvania State University, University Park, PA
View Profile

,
Jia Li

The Pennsylvania State University, University Park, PA

The Pennsylvania State University, University Park, PA
View Profile

,
C. Lee Giles

The Pennsylvania State University, University Park, PA

The Pennsylvania State University, University Park, PA
View Profile

JCDL '07: Proceedings of the 7th ACM/IEEE-CS joint conference on Digital librariesJune 2007Pages 342–351https://doi.org/10.1145/1255175.1255243

Published:18 June 2007Publication History

JCDL '07: Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries

Pages 342–351

ABSTRACT

Name ambiguity is a special case of identity uncertainty where one person can be referenced by multiple name variations in different situations or even share the same name with other people. In this paper, we focus on the problem of disambiguating person names within web pages and scientific documents. We present an efficient and effective two-stage approach to disambiguate names. In the first stage, two novel topic-based models are proposed by extending two hierarchical Bayesian text models, namely Probabilistic Latent Semantic Analysis (PLSA) and Latent Dirichlet Allocation (LDA). Our models explicitly introduce a new variable for persons and learn the distribution of topics with regard to persons and words. After learning an initial model, the topic distributions are treated as feature sets and names are disambiguated by leveraging a hierarchical agglomerative clustering method. Experiments on web data and scientific documents from CiteSeer indicate that our approach consistently outperforms other unsupervised learning methods such as spectral clustering and DBSCAN clustering and could be extended to other research fields. We empirically addressed the issue of scalability by disambiguating authors in over 750,000 papers from the entire CiteSeer dataset.

References

R. Bekkerman and A. McCallum. Disambiguating web appearances of people in a social network. In WWW '05: Proceedings of the 14th international conference on World Wide Web, pages 463--470, New York, NY, USA, 2005. Google ScholarDigital Library
I. Bhattacharya and L. Getoor. A latent dirichlet model for unsupervised entity resolution. In SDM, 2006.Google ScholarCross Ref
D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. J. Mach. Learn. Res., 3:993--1022, 2003. Google ScholarCross Ref
C. H. Brooks and N. Montanez. Improved annotation of the blogosphere via autotagging and hierarchical clustering. In WWW '06: Proceedings of the 15th international conference on World Wide Web, pages 625--632, New York, NY, USA, 2006. ACM Press. Google ScholarDigital Library
C. Charalambous. Maximum likelihood parameter estimation from incomplete data via the sensitivity equations: The continuous-time case, 1998.Google Scholar
S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman. Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41(6):391--407, 1990.Google ScholarCross Ref
C. L. Giles, K. D. Bollacker, and S. Lawrence. Citeseer: an automatic citation indexing system. In DL '98: Proceedings of the third ACM conference on Digital libraries, pages 89--98, New York, NY, USA, 1998. ACM Press. Google ScholarDigital Library
H. Han, L. Giles, H. Zha, C. Li, and K. Tsioutsiouliklis. Two supervised learning approaches for name disambiguation in author citations. In JCDL '04: Proceedings of the 4th ACM/IEEE joint conference on Digital libraries, pages 296--305, New York, 2004. Google ScholarDigital Library
H. Han, H. Zha, and C. L. Giles. Name disambiguation in author citations using a k-way spectral clustering method. In JCDL '05: Proceedings of the 5th ACM/IEEE joint conference on Digital libraries, pages 334--343, New York,NY, USA, 2005. ACM Press. Google ScholarDigital Library
T. Hofmann. Probabilistic Latent Semantic Indexing. In SIGIR '99: Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pages 50--57, Berkeley, California. Google ScholarDigital Library
T. Hofmann. Collaborative filtering via gaussian probabilistic latent semantic analysis. In SIGIR '03: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in information retrieval, pages 259--266, New York, NY, USA, 2003. Google ScholarDigital Library
J. Huang, S. Ertekin, and C. L. Giles. Efficient name disambiguation for large-scale databases. In the 10th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD), pages 536--544. Springer-Verlag Berlin Heidelberg, 2006. Google ScholarDigital Library
W. J. H. Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association, 58:236--244, 1963.Google ScholarCross Ref
X. Jin, Y. Zhou, and B. Mobasher. Web usage mining based on probabilistic latent semantic analysis. In KDD '04: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 197--205, New York, NY, USA, 2004. ACM Press. Google ScholarDigital Library
B. King. Step-wise clustering procedures. Journal of the American Statistical Association, 69:86--101, 1967.Google ScholarCross Ref
D. Lee, B.-W. On, J. Kang, and S. Park. Effective and scalable solutions for mixed and split citation problems in digital libraries. In IQIS '05: Proceedings of the 2nd international workshop on Information quality in information systems, pages 69--76, New York, 2005. Google ScholarDigital Library
V. I. Levenshtein. Binary codes capable of correcting spurious insertions and deletions of ones. Probl. Inform. Transmiss., 1:8--17, 1965.Google Scholar
F.-F. Li and P. Perona. A bayesian hierarchical model for learning natural scene categories. In CVPR '05: Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05) - Volume 2, pages 524--531, Washington, DC, USA, 2005. IEEE Computer Society. Google ScholarDigital Library
G. S. Mann and D. Yarowsky. Unsupervised personal name disambiguation. In Proceedings of the seventh conference on Natural language learning at HL-NAACL 2003, pages 33--40, Morristown, NJ, USA, 2003. Association for Computational Linguistics. Google ScholarDigital Library
M. Rosen-Zvi, T. Griffiths, M. Steyvers, and P. Smyth. The author-topic model for authors and documents. In AUAI '04: Proceedings of the 20th conference on Uncertainty in artificial intelligence, pages 487--494, Arlington, Virginia, United States, 2004. AUAI Press. Google ScholarDigital Library
J. Sivic, B. C. Russell, A. A. Efros, A. Zisserman, and W. T. Freeman. Discovering objects and their localization in images. In ICCV '05: Proceedings of the Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1, pages 370--377, Washington, DC, USA, 2005. Google ScholarDigital Library
E. B. Sudderth, A. Torralba, W. T. Freeman, and A. S. Willsky. Learning hierarchical models of scenes, objects, and parts. In ICCV '05: Proceedings of the Tenth IEEE International Conference on Computer Vision, pages 1331--1338, Washington, DC, USA, 2005. Google ScholarDigital Library
X. Wang and A. McCallum. Topics over time: a non-markov continuous-time model of topical trends. In KDD '06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 424--433, New York, NY, USA, 2006. Google ScholarDigital Library
X. Wei and W. B. Croft. Lda-based document models for ad--hocretrieval. In SIGIR '06: Proceedings of the 29th annual international ACMSIGIR conference on Research and development in information retrieval, pages178--185,New York, NY, USA, 2006. ACM Press. Google ScholarDigital Library
B. Wellner, A. McCallum, F. Peng, and M. Hay. An integrated, conditionalmodel of information extraction and coreference with application to citationmatching. In AUAI '04: Proceedings of the 20th conference on Uncertainty inartificial intelligence, pages 593601, Arlington, Virginia, United States,2004. AUAI Press. Google ScholarDigital Library
G. Xu, Y. Zhang, J. Ma, and X. Zhou. Discovering user access patternbased on probabilistic latent factor model. In ADC '05: Proceedings of thesixteenth Australasian database conference, pages 27--35, Darlinghurst,Australia, Australia, 2005. Australian Computer Society, Inc. Google ScholarDigital Library

Index Terms

Efficient topic-based unsupervised name disambiguation
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Unsupervised learning
        Cluster analysis
2. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Document filtering
      2. Information extraction

Recommendations

On Graph-Based Name Disambiguation

Name ambiguity stems from the fact that many people or objects share identical names in the real world. Such name ambiguity decreases the performance of document retrieval, Web search, information integration, and may cause confusion in other ...
Read More
Author name disambiguation in MEDLINE

Background: We recently described “Author-ity,” a model for estimating the probability that two articles in MEDLINE, sharing the same author name, were written by the same individual. Features include shared title words, journal name, coauthors, medical ...
Read More
Generative models for name disambiguation
WWW '07: Proceedings of the 16th international conference on World Wide Web

Name ambiguity is a special case of identity uncertainty where one person can be referenced by multiple name variations in different situations or evenshare the same name with other people. In this paper, we present an efficient framework by using two ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
JCDL '07: Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
June 2007
534 pages
ISBN:9781595936448
DOI:10.1145/1255175
General Chair:
Edie Rasmussen
University of British Columbia, Canada
,
Program Chairs:
Ray R. Larson
University of California, Berkeley
,
Elaine Toms
Dalhousie University, Canada
,
Shigeo Sugimoto
University of Tsukuba, Japan
Copyright © 2007 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 18 June 2007
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
bayesian models
hierarchical clustering methods
name disambiguation
probability analysis
unsupervised machine learning
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate415of1,482submissions,28%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 94
  Total Citations
  View Citations
- 894
  Total Downloads
- Downloads (Last 12 months)15
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Efficient topic-based unsupervised name disambiguation

JCDL '07: Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries

ABSTRACT

References

Cited By

Index Terms

Recommendations

On Graph-Based Name Disambiguation

Author name disambiguation in MEDLINE

Generative models for name disambiguation

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Efficient topic-based unsupervised name disambiguation

JCDL '07: Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries

ABSTRACT

References

Cited By

Index Terms

Recommendations

On Graph-Based Name Disambiguation

Author name disambiguation in MEDLINE

Generative models for name disambiguation

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media