On co-authorship for author disambiguation

https://doi.org/10.1016/j.ipm.2008.06.006Get rights and content

Abstract

Author name disambiguation deals with clustering the same-name authors into different individuals. To attack the problem, many studies have employed a variety of disambiguation features such as coauthors, titles of papers/publications, topics of articles, emails/affiliations, etc. Among these, co-authorship is the most easily accessible and influential, since inter-person acquaintances represented by co-authorship could discriminate the identities of authors more clearly than other features. This study attempts to explore the net effects of co-authorship on author clustering in bibliographic data. First, to handle the shortage of explicit coauthors listed in known citations, a web-assisted technique of acquiring implicit coauthors of the target author to be disambiguated is proposed. Then, a coauthor disambiguation hypothesis that the identity of an author can be determined by his/her coauthors is examined and confirmed through a variety of author disambiguation experiments.

Introduction

As a tremendous amount of person-related information is available on the web or other electronic media, people want to find persons of interest frequently using the names as queries (Guha & Garg, 2004). However, there exists many-to-many mapping relationships between persons and their names. A person may have multiple names, and different persons may share the same name. For example, a person named David S. Johnson can write his name as David S. Johnson, David Johnson, D. S. Johnson, or D. Johnson, etc., and there may exist two or more different persons named David S. Johnson. When a single person is viewed as an object having a unique meaning, different names signifying the same person can be considered as synonyms, and the same names indicating different persons homonyms.1

The above many-to-many mapping between persons and their names may severely deteriorate the effectiveness of the person search. When a person name is submitted to a web people search system (Wan, Gao, Li, & Ding, 2005) that finds persons in web pages, the web pages that have synonymous names of the input query may not be retrieved, lowering the recall of the person search, and the retrieved pages may contain numerous homonymous names of the input query, decreasing the precision of the search system. Moreover, synonymous and/or homonymous person names may hinder the correct representation of person-related social networks such as coauthor and citation networks.

Thus, identifying synonymous person names and resolving homonymous ones are crucial to any person-related search or representation systems. Thus far, much research has been conducted to attack the identification of synonymous person names under the terms of record linkage, merge/purge, duplicate detection and elimination (Elmagarmid et al., 2007, Gu et al., 2003, Winkler, 2006). Matching of synonymous names employs similarity measures which mainly rely on name-internal features such as overlapping characters or tokens (first, middle, last name) of the two names to be compared. On the other hand, disambiguation of homonymous names has relatively recently received a growing attention with the advent of the semantic web and social networks. It is highly dependent on name-external features: domain-independent biographic features (Guha and Garg, 2004, Mann and Yarowsky, 2003, Wan et al., 2005) such as birth data/place and e-mail/postal addresses, and/or domain-specific contextual features such as coauthors and paper titles in case of bibliographic data.

This study addresses the resolution of homonymous author names appearing in citation data. As disambiguation features, previous works have employed coauthors, titles of articles, titles of publications, and years of publications that constitute basic citation data. Titles of articles may epitomize research areas of their authors, thus under the assumption that namesakes do not heavily share their research areas, title similarity between articles may be employed in resolving homonymous author names. The same also applies to the case of titles of books. In addition to the above citation-internal features, some have utilized citation-external features such as abstracts, self-citations, and citation URLs. When the full text of articles is available, additional features such as e-mail addresses, affiliation, and keywords can be extracted and applied to the author name disambiguation.

We believe that of the above features, co-authorship is the most reliable and decisive from the viewpoint of discriminating the identities of authors, since it implies real-world acquaintances among authors. For instance, when one citation contains D. S. Johnson and C. J. Date as its authors, we can say that a D. S. Johnson in the citation indicates the D. S. Johnson whom C. J. Date knows. When another citation has also D. S. Johnson and C. J. Date, we can further state that “D. S. Johnson”s in the two citations are the same individual, namely, the D. S. Johnson whom C. J. Date knows, under the assumption that “C. J. Date”s in the two citations are not different persons. Other citation features are not as much person-related as co-authorship.

This study concentrates on investigating the sole effect of co-authorship information on the resolution of homonymous author names in bibliographic data. In particular, we start with a hypothesis stating that the identity of an author is characterized by his/her coauthors. Next, a web-based technique of gathering coauthors is proposed to supplement implicit coauthors not found in known citations. Then, our hypothesis is investigated using large-scale test data from the following viewpoints:

  • Is it helpful to enrich coauthors with web-assisted collaborators in disambiguating namesakes?

  • Is a person identified more accurately by more coauthors?

  • How probable is it that co-authorship succeeds in disambiguating authors?

  • Is it beneficial to use coauthors of coauthors in resolving authors?

The scope of co-authorship that this study handles is limited to academic co-authorship. In addition, the evaluation of co-authorship is performed exclusively using a test set in Korean. The remainder of this article is organized as follows. Section 2 summarizes related works. Section 3 describes the usage of co-authorship for author name disambiguation. Next, a web-assisted method of expanding coauthors is proposed in Section 4. Then, evaluations and concluding remarks are given in Sections 4 Web-based acquisition of coauthors, 5 Evaluation setup.

Section snippets

Overview

Research on the processing of person names can be divided into personal name matching and personal name disambiguation. The former handles the identification of synonymous person names, while the latter involves the discrimination of homonymous ones. One may argue that person name disambiguation inherently includes the problem of personal name matching, since there may exist many namesakes who have a variety of name variants. This study, however, assumes that personal name matching precedes

Co-authorship

In general, coauthors who are listed in the same paper know each other. This acquaintance among coauthors may provide clues on disambiguating homonymous author names. For example, suppose that we want to discriminate appearances of ‘A. Cohen’ as authors in the five citations C1 through C5.

C1:A. Cohen, S. Draper, E. Martinian, G. Wornell (2006). Stealing bits from a quantized …
C2:A. Cohen, S. Draper, E. Martinian, G. Wornell (2002). Source requantization: successive …
C3:A. Cohen, J. Siegel, P.

Web-based acquisition of coauthors

When we collect coauthors of a person named ‘T. Mitchell’, we need to specify a particular ‘T. Mitchell’ among numerous persons named ‘T. Mitchell’. For this, we use a known coauthor of the particular ‘T. Mitchell’. For instance, ‘T. Mitchell’ in the following artificial citation C10 can be specified as the ‘T. Mitchell’ known to ‘R. Niculescu’ or ‘R. Rao’ or ‘K. Patrick’.

C10:R. Niculescu, T. Mitchell, R. Rao, K. Patrick (2006). Bayesian network learning with parameter constraints. Journal of

Test set

As a test set to assess namesake resolution, we created a gold standard for a total of 8675 IT-related conference papers which were published in Korean during 1999–2006. Korean does not suffer from name-variant problems. In other words, a person’s name is written in Korean in a single form, that is, a surname followed by a given name without delimiters or middle names. Owing to this feature, Korean author disambiguation system may skip matching of synonymous names. To manually discriminate the

Evaluation results and discussion

Table 3 shows the statistics of coauthor features. On average, 2.85 explicit coauthors were found in the test set, and 19.75 aICs and 17.65 cICs were augmented by the web-based coauthor expansion algorithm which was executed with top 20 documents retrieved through the Google search engine. In order to block the inclusion of erroneous coauthors from irrelevant web pages, the algorithm was actually stopped after at most 50 new coauthors were gathered for each pair of author names. We have

Conclusions

This study attempted to investigate the influence of co-authorship on author name disambiguation. To deal with the scarcity of explicit coauthors appearing in known citation records, a web-rendered technique of gleaning implicit coauthors not listed in known citations was proposed. Under the assumption that a pair of person names mutually determines the identities of each other, a string of two author names in known citations was submitted to a web search engine to gather their unrevealed

In-Su Kang received his M.S. and Ph.D. degrees in Computer Science from POSTECH in 1999 and 2006. Currently, he teaches students in Kyungsung University. His research interests include IR, NLP, and semantic web.

References (29)

  • H. Alani et al.

    Identifying communities of practice through ontology network analysis

    IEEE Intelligent Systems

    (2003)
  • Aswani, N., Bontcheva, K., & Cunningham, H. (2006). Mining information for instance unification. In Proceedings of the...
  • Culotta, A., Kanani, P., Hall, R., Wick, M., & McCallum, A. (2007). Author disambiguation using error-driven machine...
  • A.K. Elmagarmid et al.

    Duplicate record detection: A survey

    IEEE Transactions on Knowledge and Data Engineering

    (2007)
  • J.R. Firth

    A synopsis of linguistic theory 1930–1955. Studies in linguistic analysis

    (1957)
  • Gu, L., Baxter, R., Vickers, D., & Rainsford, C. (2003). Record linkage: current practice and future directions....
  • Guha, R., & Garg, A. (2004). Disambiguating people in search. In Proceedings of the 13th international conference on...
  • Han, H., Giles, C. L., & Zha, H. (2003). A model-based k-means algorithm for name disambiguation. In Proceedings of...
  • Han, H., Giles, C. L., Zha, H., Li, C., & Tsioutsiouliklis, K. (2004). Two supervised learning approaches for name...
  • Han, H., Xu, W., Zha, H., & Giles, C. L. (2005). A hierarchical naive Bayes mixture model for name disambiguation in...
  • Han, H., Zha, H., & Giles, C. L. (2005). Name disambiguation in author citations using a k-way spectral clustering...
  • Z. Harris

    Distributional structure

    Word

    (1954)
  • Huang, J., Ertekin, S., & Giles, C. L. (2006). Efficient name disambiguation for large scale databases. In Proceedings...
  • A. Jain et al.

    Data clustering: A review

    ACM Computing Surveys

    (1999)
  • Cited by (124)

    • An in-depth study of similarity predicate committee

      2019, Information Processing and Management
    • Academic social networks: Modeling, analysis, mining and applications

      2019, Journal of Network and Computer Applications
      Citation Excerpt :

      Treeratpituk and Giles (2009) used the Random Forest Model to disambiguate names by considering the author's name, affiliation, collaborator, and related factors. Based on the hypothesis that one author can be identified by his co-authors, Kang et al. (2009) proposed a way to eliminate ambiguity by implicit co-authors of target authors. In addition, Kim et al. (2014) used the three above-mentioned methods for the DBLP dataset to eliminate ambiguity and they found that author name disambiguation have a significant effect on analyzing data effectively.

    • Authorship Trends in Carbon Footprint Research

      2023, Science and Technology Libraries
    • Visual Analysis for Name Disambiguation of Academic Papers

      2022, Jisuanji Fuzhu Sheji Yu Tuxingxue Xuebao/Journal of Computer-Aided Design and Computer Graphics
    View all citing articles on Scopus

    In-Su Kang received his M.S. and Ph.D. degrees in Computer Science from POSTECH in 1999 and 2006. Currently, he teaches students in Kyungsung University. His research interests include IR, NLP, and semantic web.

    Seung-Hoon Na received his M.S. and Ph.D. degrees in Computer Science from POSTECH in 2003 and 2008. Currently, he works for POSTECH as a post-doctoral researcher. His research interests include IR and NLP.

    Seungwoo Lee received his M.S. and Ph.D. degrees in Computer Science from POSTECH in 1999 and 2005. Currently, he works for KISTI as a senior researcher. His research interests include IR, NLP, and semantic web.

    Hanmin Jung received his M.S. and Ph.D. degrees in Computer Science from POSTECH in 1994 and 2003. Currently, he works for KISTI as a senior researcher. His research interests include information extraction, IR, NLP, and semantic web.

    Pyung Kim received his M.S. and Ph.D. degrees in Computer Science from Chungnam National University in 1999 and 2004. Currently, he works for KISTI as a senior researcher. His research interests include IR, text mining, and semantic web.

    Won-Kyung Sung received his Ph.D. degrees in Linguistics from University of Paris 7 in 1996. Currently, he works for KISTI as a senior researcher. His research interests include NLP and semantic web.

    Jong-Hyeok Lee received his M.S. and Ph.D. degrees in Computer Science from KAIST in 1982 and 1988. Currently, he is a full professor in POSTECH. His research interests include machine translation, IR, and NLP.

    View full text