Abstract
Name disambiguation in databases is a non-trivial task because people’s names are often not unique and usually only a limited information is associated with each name in the database. For example, in DBLP many authors share the same name, whereas we do not have any unique identifier to distinguish them. To make it worst, we may not always be able to access the full contents of the materials, unless we have joined those organizations (e.g. ACM) who publish them. As such, how to disambiguate different names with a very limited information is a very challenging task. In this paper, we focus ourselves on such situation. We propose a term-based driven clustering approach for solving it. Specifically, we first construct some term-based taxonomies to mimic the expert knowledge of the domain by linking the related terms that appear in there automatically. Each taxonomy is then transformed into a graph, and we group the entries that belong to the same author by using either of the two novel models, namely, graph-based similarity model and graph-based random walk model. The former model aims at computing the similarity among terms, whereas the later model aims at investigating how likely would a set of terms be transformed to another set of terms. Extensive experiments are conducted by using the entries in DBLP. The favorable results indicated that our proposed approach is highly effective.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Fellegi, I.P., Sunter, A.B.: A theory for record linkage. Journal of the American Statistical Association 21 (1969)
Bitton, D., Dewitt, D.J.: Duplicate record elimination in large data files. ACM TODS 8 (1983)
Monge, A.E., Elkan, C.P.: An efficient domain-independent algorithm for detecting approximately duplicate database records. In: SIGMOD Wshp. on Research Issues on Data Mining and Knowledge Discovery (1997)
McCallum, A., Nigam, K., Ungar, L.H.: Efficient clustering of high-dimensional data sets with application to reference matching. In: 6th ACM SIGKDD (2000)
Yin, X.X., Han, J.W.: Object Distinction: Distinguishing Objects with Identical Names. In: IEEE 23rd ICDE. ACM Press, New York (2007)
Han, H., Giles, C.L., Hong, Y.Z.: Two supervised learning approaches for name disambiguation in author citations. In: 4th ACM/IEEE Joint Conference on Digital Libraries (2004)
Han, H., Zhang, H., Giles, C.L.: Name disambiguation in author citations using a k-way spectral clustering method. In: 5th ACM/IEEE Joint Conference on Digital Libraries (2005)
Park, Y., Byrd, R.J., Boguraev, B.K.: Automatic Glossary Extraction:Beyond Terminology Identification. In: 19th International Conference on Computational Linguistics (2002)
Hliaoutakis, A., Zervanou, K., Petrakis, E.G., Milios, E.E.: Automatic document indexing in large medical collections. In: International Workshop on Healthcare information and Knowledge Management (2006)
Aleman-Meza, B., Decker, S., Cameron, D., Arpinar, I.B.: Association Analytics for Network Connectivity in a Bibliographic and Expertise Dataset. In: Semantic Web Engineering in the Knowledge Society (2008)
Wang, H., Teng, J.W., Lu, W.H., Chien, L.F.: Translating Unknown Cross-Lingual Queries in Digital Libraries Using a Web-based Approach. In: 4th ACM/IEEE Joint Conference on Digital Libraries (2004)
Rion, S., Daniel, J., Andrew, N.: Learning syntactic patterns for automatic hypernym discovery. In: Advances in Neural Information Processing Systems, vol. 17 (2004)
Bast, H., Durpret, G., Piwowarski, B.: Discovering a term taxonomy from term similarities using principal component analysis. In: Ackermann, M., Berendt, B., Grobelnik, M., Hotho, A., Mladenič, D., Semeraro, G., Spiliopoulou, M., Stumme, G., Svátek, V., van Someren, M. (eds.) EWMF 2005 and KDO 2005. LNCS, vol. 4289, pp. 103–120. Springer, Heidelberg (2006)
Arpinar, B., Hassell, J., Aleman-Meza, B.: Ontology-driven automatic entity disambiguation in unstructured text. In: Cruz, I., Decker, S., Allemang, D., Preist, C., Schwabe, D., Mika, P., Uschold, M., Aroyo, L.M. (eds.) ISWC 2006. LNCS, vol. 4273, pp. 44–57. Springer, Heidelberg (2006)
Rion, S., Daniel, J., Andrew, N.: Semantic taxonomy induction from heterogenous evidence. In: 21st International Conference on Computational Linguistics (2006)
Velardi, P., Cucchiarelli, A., Petit, M.: A Taxonomy Learning Method and Its Application to Characterize a Scientific Web Community. In: IEEE TKDE, vol. 19 (2007)
Yang, S., Jian, H., Isaac, G.C., Jia, L., Lee, G.: Efficient topic-based unsupervised name disambiguation. In: 7th ACM/IEEE Joint Conference on Digital Libraries (2007)
Dunning, T.: Accurate Methods for the Statistics of Surprise and Coincidence. Computational Linguistics 19 (1994)
Breaux, T.D., Reed, J.W.: Using Ontology in Hierarchical Information Clustering. In: 38th Annual Hawaii International Conference (2005)
Luján-Mora, S., Palomar, M.: Comparing string similarity measures for reducing inconsistency in integrating data from different sources. In: Wang, X.S., Yu, G., Lu, H. (eds.) WAIM 2001. LNCS, vol. 2118, p. 191. Springer, Heidelberg (2001)
Aldous, D.J.: Low bounds for covering times for reversible markov chains and random walks on graph. J. Theoretical probability 2 (1989)
Coppersmith, D., Feige, U., Shearer, J.: Random walks on regular and irregular graphs. SIAM J. Discret. Math. 9 (1996)
Lewis, D.: Naive Bayes at Forty: The Independence Assumption in Information Retrieval. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398. Springer, Heidelberg (1998)
Kalashnikov, D.V., Mehrotra, S.: Domain-independent data cleaning via analysis of entity-relationship graph. ACM Trans. Database Syst. 31 (2006)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Zhu, J., Zhou, X., Fung, G.P.C. (2009). A Term-Based Driven Clustering Approach for Name Disambiguation. In: Li, Q., Feng, L., Pei, J., Wang, S.X., Zhou, X., Zhu, QM. (eds) Advances in Data and Web Management. APWeb WAIM 2009 2009. Lecture Notes in Computer Science, vol 5446. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-00672-2_29
Download citation
DOI: https://doi.org/10.1007/978-3-642-00672-2_29
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-00671-5
Online ISBN: 978-3-642-00672-2
eBook Packages: Computer ScienceComputer Science (R0)