Skip to main content

A Term-Based Driven Clustering Approach for Name Disambiguation

  • Conference paper
Advances in Data and Web Management (APWeb 2009, WAIM 2009)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 5446))

Abstract

Name disambiguation in databases is a non-trivial task because people’s names are often not unique and usually only a limited information is associated with each name in the database. For example, in DBLP many authors share the same name, whereas we do not have any unique identifier to distinguish them. To make it worst, we may not always be able to access the full contents of the materials, unless we have joined those organizations (e.g. ACM) who publish them. As such, how to disambiguate different names with a very limited information is a very challenging task. In this paper, we focus ourselves on such situation. We propose a term-based driven clustering approach for solving it. Specifically, we first construct some term-based taxonomies to mimic the expert knowledge of the domain by linking the related terms that appear in there automatically. Each taxonomy is then transformed into a graph, and we group the entries that belong to the same author by using either of the two novel models, namely, graph-based similarity model and graph-based random walk model. The former model aims at computing the similarity among terms, whereas the later model aims at investigating how likely would a set of terms be transformed to another set of terms. Extensive experiments are conducted by using the entries in DBLP. The favorable results indicated that our proposed approach is highly effective.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Fellegi, I.P., Sunter, A.B.: A theory for record linkage. Journal of the American Statistical Association 21 (1969)

    Google Scholar 

  2. Bitton, D., Dewitt, D.J.: Duplicate record elimination in large data files. ACM TODS 8 (1983)

    Google Scholar 

  3. Monge, A.E., Elkan, C.P.: An efficient domain-independent algorithm for detecting approximately duplicate database records. In: SIGMOD Wshp. on Research Issues on Data Mining and Knowledge Discovery (1997)

    Google Scholar 

  4. McCallum, A., Nigam, K., Ungar, L.H.: Efficient clustering of high-dimensional data sets with application to reference matching. In: 6th ACM SIGKDD (2000)

    Google Scholar 

  5. Yin, X.X., Han, J.W.: Object Distinction: Distinguishing Objects with Identical Names. In: IEEE 23rd ICDE. ACM Press, New York (2007)

    Google Scholar 

  6. Han, H., Giles, C.L., Hong, Y.Z.: Two supervised learning approaches for name disambiguation in author citations. In: 4th ACM/IEEE Joint Conference on Digital Libraries (2004)

    Google Scholar 

  7. Han, H., Zhang, H., Giles, C.L.: Name disambiguation in author citations using a k-way spectral clustering method. In: 5th ACM/IEEE Joint Conference on Digital Libraries (2005)

    Google Scholar 

  8. Park, Y., Byrd, R.J., Boguraev, B.K.: Automatic Glossary Extraction:Beyond Terminology Identification. In: 19th International Conference on Computational Linguistics (2002)

    Google Scholar 

  9. Hliaoutakis, A., Zervanou, K., Petrakis, E.G., Milios, E.E.: Automatic document indexing in large medical collections. In: International Workshop on Healthcare information and Knowledge Management (2006)

    Google Scholar 

  10. Aleman-Meza, B., Decker, S., Cameron, D., Arpinar, I.B.: Association Analytics for Network Connectivity in a Bibliographic and Expertise Dataset. In: Semantic Web Engineering in the Knowledge Society (2008)

    Google Scholar 

  11. Wang, H., Teng, J.W., Lu, W.H., Chien, L.F.: Translating Unknown Cross-Lingual Queries in Digital Libraries Using a Web-based Approach. In: 4th ACM/IEEE Joint Conference on Digital Libraries (2004)

    Google Scholar 

  12. Rion, S., Daniel, J., Andrew, N.: Learning syntactic patterns for automatic hypernym discovery. In: Advances in Neural Information Processing Systems, vol. 17 (2004)

    Google Scholar 

  13. Bast, H., Durpret, G., Piwowarski, B.: Discovering a term taxonomy from term similarities using principal component analysis. In: Ackermann, M., Berendt, B., Grobelnik, M., Hotho, A., Mladenič, D., Semeraro, G., Spiliopoulou, M., Stumme, G., Svátek, V., van Someren, M. (eds.) EWMF 2005 and KDO 2005. LNCS, vol. 4289, pp. 103–120. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  14. Arpinar, B., Hassell, J., Aleman-Meza, B.: Ontology-driven automatic entity disambiguation in unstructured text. In: Cruz, I., Decker, S., Allemang, D., Preist, C., Schwabe, D., Mika, P., Uschold, M., Aroyo, L.M. (eds.) ISWC 2006. LNCS, vol. 4273, pp. 44–57. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  15. Rion, S., Daniel, J., Andrew, N.: Semantic taxonomy induction from heterogenous evidence. In: 21st International Conference on Computational Linguistics (2006)

    Google Scholar 

  16. Velardi, P., Cucchiarelli, A., Petit, M.: A Taxonomy Learning Method and Its Application to Characterize a Scientific Web Community. In: IEEE TKDE, vol. 19 (2007)

    Google Scholar 

  17. Yang, S., Jian, H., Isaac, G.C., Jia, L., Lee, G.: Efficient topic-based unsupervised name disambiguation. In: 7th ACM/IEEE Joint Conference on Digital Libraries (2007)

    Google Scholar 

  18. Dunning, T.: Accurate Methods for the Statistics of Surprise and Coincidence. Computational Linguistics 19 (1994)

    Google Scholar 

  19. Breaux, T.D., Reed, J.W.: Using Ontology in Hierarchical Information Clustering. In: 38th Annual Hawaii International Conference (2005)

    Google Scholar 

  20. Luján-Mora, S., Palomar, M.: Comparing string similarity measures for reducing inconsistency in integrating data from different sources. In: Wang, X.S., Yu, G., Lu, H. (eds.) WAIM 2001. LNCS, vol. 2118, p. 191. Springer, Heidelberg (2001)

    Chapter  Google Scholar 

  21. Aldous, D.J.: Low bounds for covering times for reversible markov chains and random walks on graph. J. Theoretical probability 2 (1989)

    Google Scholar 

  22. Coppersmith, D., Feige, U., Shearer, J.: Random walks on regular and irregular graphs. SIAM J. Discret. Math. 9 (1996)

    Google Scholar 

  23. Lewis, D.: Naive Bayes at Forty: The Independence Assumption in Information Retrieval. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398. Springer, Heidelberg (1998)

    Google Scholar 

  24. Kalashnikov, D.V., Mehrotra, S.: Domain-independent data cleaning via analysis of entity-relationship graph. ACM Trans. Database Syst. 31 (2006)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Zhu, J., Zhou, X., Fung, G.P.C. (2009). A Term-Based Driven Clustering Approach for Name Disambiguation. In: Li, Q., Feng, L., Pei, J., Wang, S.X., Zhou, X., Zhu, QM. (eds) Advances in Data and Web Management. APWeb WAIM 2009 2009. Lecture Notes in Computer Science, vol 5446. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-00672-2_29

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-00672-2_29

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-00671-5

  • Online ISBN: 978-3-642-00672-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics