Abstract
The World Wide Web (WWW) provides much information about persons, and in recent years WWW search engines have been commonly used for learning about persons. However, many persons have the same name and that ambiguity typically causes the search results of one person name to include Web pages about several different persons. We propose a novel framework for person name disambiguation that has the following three components processes. Extraction of social network information by finding co-occurrences of named entities, Measurement of document similarities based on occurrences of key compound words, Inference of topic information from documents based on the Dirichlet process unigram mixture model. Experiments using an actual Web document dataset show that the result of our framework is promising.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Antoniak, C.E.: Mixtures of Dirichlet Processes with Applications to Bayesian Nonparametric Problems. The Annals of Statistics 2(6) (1974)
Artiles, J., Gonzalo, J., Sekine, S.: The SemEval-2007 WePS Evaluation: Establishing a benchmark for the Web People Search Task. In: Proceedings of the 4th International Workshop on Semantic Evaluations (SemEval-2007), pp. 64–69 (2007)
Attias, H.: Learning parameters and structure of latent variable models by Variational Bayes. In: Proceedings of Uncertainty in Artificial Intelligence (1999)
Bagga, A., Baldwin, B.: Entity-Based Cross-Document Coreferencing Using the Vector Space Model. In: Proceedings of COLING-ACL 1998, pp. 79–85 (1998)
Bekkerman, R., McCallum, A.: Disambiguating Web Appearances of People in a Social Network. In: Proceedings of WWW 2005, pp. 463–470 (2005)
Blei, D.M., Jordan, M.I.: Variational inference for Dirichlet process mixtures. Journal of Bayesian Analysis 1(1), 121–144 (2005)
Ferguson, T.S.: A Bayesian Analysis of Some Nonparametric Problems. The Annals of Statistics 1(2) (1973)
Larsen, B., Aone, C.: Fast and effective text mining using linear-time document clustering. In: Proceedings of the 5th ACM SIGKDD, pp. 16–22 (1999)
Mann, G.S., Yarowsky, D.: Unsupervised Personal Name Disambiguation. In: Proceedings of CoNLL 2003, pp. 33–40 (2003)
Morton, T.S.: Coreference for NLP Applications. In: Proceedings of ACL-2000, pp. 173–180 (2000)
Nakagawa, H., Mori, T.: Automatic Term Recognition based on Statistics of Compound Nouns and their Components. Terminology 9(2), 201–219 (2003)
Nigam, K., McCallum, A.K., Thrun, S., Mitchell, T.M.: Text Classification from Labeled and Unlabeled Documents using EM. Machine Learning 39, 103–134 (2000)
Niu, C., Li, W., Srihari, R.K.: Weakly Supervised Learning for Cross-document Person Name Disambiguation Supported by Information Extraction. In: Proceedings of ACL-2004, pp. 598–605 (2004)
Ono, S., Yoshida, M., Nakagawa, H.: NAYOSE: A System for Reference Disambiguation of Proper Nouns Appearing on Web Pages. In: Ng, H.T., Leong, M.-K., Kan, M.-Y., Ji, D. (eds.) AIRS 2006. LNCS, vol. 4182, pp. 338–349. Springer, Heidelberg (2006)
Sethuraman, J.: A Constructive Definition of Dirichlet Priors. Statistica Sinica 4, 639–650 (1994)
Wan, X., Gao, J., Li, M., Ding, B.: Person Resolution in Person Search Results: WebHawk. In: Proceedings of CIKM 2005, pp. 163–170 (2005)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Ono, S., Sato, I., Yoshida, M., Nakagawa, H. (2008). Person Name Disambiguation in Web Pages Using Social Network, Compound Words and Latent Topics. In: Washio, T., Suzuki, E., Ting, K.M., Inokuchi, A. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2008. Lecture Notes in Computer Science(), vol 5012. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-68125-0_24
Download citation
DOI: https://doi.org/10.1007/978-3-540-68125-0_24
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-68124-3
Online ISBN: 978-3-540-68125-0
eBook Packages: Computer ScienceComputer Science (R0)