A Term-Based Driven Clustering Approach for Name Disambiguation

Zhu, Jia; Zhou, Xiaofang; Fung, Gabriel Pui Cheong

doi:10.1007/978-3-642-00672-2_29

Jia Zhu²²,
Xiaofang Zhou²² &
Gabriel Pui Cheong Fung²²

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 5446))

Included in the following conference series:

1186 Accesses
11 Citations

Abstract

Name disambiguation in databases is a non-trivial task because people’s names are often not unique and usually only a limited information is associated with each name in the database. For example, in DBLP many authors share the same name, whereas we do not have any unique identifier to distinguish them. To make it worst, we may not always be able to access the full contents of the materials, unless we have joined those organizations (e.g. ACM) who publish them. As such, how to disambiguate different names with a very limited information is a very challenging task. In this paper, we focus ourselves on such situation. We propose a term-based driven clustering approach for solving it. Specifically, we first construct some term-based taxonomies to mimic the expert knowledge of the domain by linking the related terms that appear in there automatically. Each taxonomy is then transformed into a graph, and we group the entries that belong to the same author by using either of the two novel models, namely, graph-based similarity model and graph-based random walk model. The former model aims at computing the similarity among terms, whereas the later model aims at investigating how likely would a set of terms be transformed to another set of terms. Extensive experiments are conducted by using the entries in DBLP. The favorable results indicated that our proposed approach is highly effective.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Fellegi, I.P., Sunter, A.B.: A theory for record linkage. Journal of the American Statistical Association 21 (1969)
Google Scholar
Bitton, D., Dewitt, D.J.: Duplicate record elimination in large data files. ACM TODS 8 (1983)
Google Scholar
Monge, A.E., Elkan, C.P.: An efficient domain-independent algorithm for detecting approximately duplicate database records. In: SIGMOD Wshp. on Research Issues on Data Mining and Knowledge Discovery (1997)
Google Scholar
McCallum, A., Nigam, K., Ungar, L.H.: Efficient clustering of high-dimensional data sets with application to reference matching. In: 6th ACM SIGKDD (2000)
Google Scholar
Yin, X.X., Han, J.W.: Object Distinction: Distinguishing Objects with Identical Names. In: IEEE 23rd ICDE. ACM Press, New York (2007)
Google Scholar
Han, H., Giles, C.L., Hong, Y.Z.: Two supervised learning approaches for name disambiguation in author citations. In: 4th ACM/IEEE Joint Conference on Digital Libraries (2004)
Google Scholar
Han, H., Zhang, H., Giles, C.L.: Name disambiguation in author citations using a k-way spectral clustering method. In: 5th ACM/IEEE Joint Conference on Digital Libraries (2005)
Google Scholar
Park, Y., Byrd, R.J., Boguraev, B.K.: Automatic Glossary Extraction:Beyond Terminology Identification. In: 19th International Conference on Computational Linguistics (2002)
Google Scholar
Hliaoutakis, A., Zervanou, K., Petrakis, E.G., Milios, E.E.: Automatic document indexing in large medical collections. In: International Workshop on Healthcare information and Knowledge Management (2006)
Google Scholar
Aleman-Meza, B., Decker, S., Cameron, D., Arpinar, I.B.: Association Analytics for Network Connectivity in a Bibliographic and Expertise Dataset. In: Semantic Web Engineering in the Knowledge Society (2008)
Google Scholar
Wang, H., Teng, J.W., Lu, W.H., Chien, L.F.: Translating Unknown Cross-Lingual Queries in Digital Libraries Using a Web-based Approach. In: 4th ACM/IEEE Joint Conference on Digital Libraries (2004)
Google Scholar
Rion, S., Daniel, J., Andrew, N.: Learning syntactic patterns for automatic hypernym discovery. In: Advances in Neural Information Processing Systems, vol. 17 (2004)
Google Scholar
Bast, H., Durpret, G., Piwowarski, B.: Discovering a term taxonomy from term similarities using principal component analysis. In: Ackermann, M., Berendt, B., Grobelnik, M., Hotho, A., Mladenič, D., Semeraro, G., Spiliopoulou, M., Stumme, G., Svátek, V., van Someren, M. (eds.) EWMF 2005 and KDO 2005. LNCS, vol. 4289, pp. 103–120. Springer, Heidelberg (2006)
Chapter Google Scholar
Arpinar, B., Hassell, J., Aleman-Meza, B.: Ontology-driven automatic entity disambiguation in unstructured text. In: Cruz, I., Decker, S., Allemang, D., Preist, C., Schwabe, D., Mika, P., Uschold, M., Aroyo, L.M. (eds.) ISWC 2006. LNCS, vol. 4273, pp. 44–57. Springer, Heidelberg (2006)
Chapter Google Scholar
Rion, S., Daniel, J., Andrew, N.: Semantic taxonomy induction from heterogenous evidence. In: 21st International Conference on Computational Linguistics (2006)
Google Scholar
Velardi, P., Cucchiarelli, A., Petit, M.: A Taxonomy Learning Method and Its Application to Characterize a Scientific Web Community. In: IEEE TKDE, vol. 19 (2007)
Google Scholar
Yang, S., Jian, H., Isaac, G.C., Jia, L., Lee, G.: Efficient topic-based unsupervised name disambiguation. In: 7th ACM/IEEE Joint Conference on Digital Libraries (2007)
Google Scholar
Dunning, T.: Accurate Methods for the Statistics of Surprise and Coincidence. Computational Linguistics 19 (1994)
Google Scholar
Breaux, T.D., Reed, J.W.: Using Ontology in Hierarchical Information Clustering. In: 38th Annual Hawaii International Conference (2005)
Google Scholar
Luján-Mora, S., Palomar, M.: Comparing string similarity measures for reducing inconsistency in integrating data from different sources. In: Wang, X.S., Yu, G., Lu, H. (eds.) WAIM 2001. LNCS, vol. 2118, p. 191. Springer, Heidelberg (2001)
Chapter Google Scholar
Aldous, D.J.: Low bounds for covering times for reversible markov chains and random walks on graph. J. Theoretical probability 2 (1989)
Google Scholar
Coppersmith, D., Feige, U., Shearer, J.: Random walks on regular and irregular graphs. SIAM J. Discret. Math. 9 (1996)
Google Scholar
Lewis, D.: Naive Bayes at Forty: The Independence Assumption in Information Retrieval. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398. Springer, Heidelberg (1998)
Google Scholar
Kalashnikov, D.V., Mehrotra, S.: Domain-independent data cleaning via analysis of entity-relationship graph. ACM Trans. Database Syst. 31 (2006)
Google Scholar

Download references

Author information

Authors and Affiliations

School of ITEE, The University of Queensland, Australia
Jia Zhu, Xiaofang Zhou & Gabriel Pui Cheong Fung

Authors

Jia Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Xiaofang Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Gabriel Pui Cheong Fung
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, City University of Hong Kong, 83 Tat Chee Avenue, Kowloon, Hong Kong, China
Qing Li
Department of Computer Science & Technology, Tsinghua University, Beijing, China
Ling Feng
School of Computing Science, Simon Fraser University, 8888 University Drive, V5A 1S6, Burnaby BC, Canada
Jian Pei
Department of Computer Science, University of Vermont, VT 05405, Burlington, USA
Sean X. Wang
School of Information Technology and Electrical Engineering, The University of Queensland, QLD 4072, Brisbane, Australia
Xiaofang Zhou
Jiangsu Provincial Key Lab of Computer Information Processing Technology School of Computer Science & Technology, Soochow University China, 1 shizi Street Suzhou, 215006, Jiangsu, China
Qiao-Ming Zhu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhu, J., Zhou, X., Fung, G.P.C. (2009). A Term-Based Driven Clustering Approach for Name Disambiguation. In: Li, Q., Feng, L., Pei, J., Wang, S.X., Zhou, X., Zhu, QM. (eds) Advances in Data and Web Management. APWeb WAIM 2009 2009. Lecture Notes in Computer Science, vol 5446. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-00672-2_29

Download citation

DOI: https://doi.org/10.1007/978-3-642-00672-2_29
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-00671-5
Online ISBN: 978-3-642-00672-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics