Generation of topic evolution trees from heterogeneous bibliographic networks

https://doi.org/10.1016/j.joi.2016.04.002Get rights and content

Highlights

  • Heterogeneous paper networks are useful for construction of topic evolution trees.

  • Meta-path restrictions based on contribution enable context-dependent results.

  • Context-sensitive topic evolution trees enhanced information retrieval results.

Abstract

The volume of the existing research literature is such it can make it difficult to find highly relevant information and to develop an understanding of how a scientific topic has evolved. Prior research on topic evolution has often leveraged refinements to Latent Dirichlet Allocation (LDA) to identify emerging topics. However, such methods do not answer the question of which studies contributed to the evolution of a topic. In this paper we show that meta-paths over a heterogeneous bibliographic network (consisting of papers, authors and venues) can be used to identify the network elements that made the greatest contributions to a topic. In particular, by adding derived edges that capture the contribution of papers, authors, and venues to a topic (using PageRank algorithm), a restricted meta-path over the bibliographic network can be used to restrict the evolution of topics to the context of interest to a researcher. We use such restricted meta-paths to construct a topic evolution tree that can provide researchers with a web-based visualization of the evolution of a scientific topic in the context of interest to them. Compared to baseline networks without restrictions, we find that restricted networks provide more useful topic evolution trees.

Introduction

An exponential growth in the output of scientific literature (Bornmann and Mutz, 2015, Price, 1961, Price, 1963, Van Raan, 2000) is one of the staples of contemporary science. In such an environment, scientists are becoming more specialized with narrower expertise (Jones, 2005, Jones, 2010), while the solutions of difficult problems in science and industry often require interdisciplinary approaches (Wagner et al., 2011) and team-based research (Falk-Krzesinski et al., 2010). Thus, ideally, scientists need to have deep knowledge in their own specialty and broad knowledge across a range of domains, i.e., be “T-shaped” (Barile, Franco, Nota, & Saviano, 2012; Donofrio, Spohrer, & Zadeh, 2010). While the Internet and electronic publications have made it easier to access an unprecedented volume and range of resources, this wealth of information can make it difficult to identify the best resources for learning the foundations of a specific topic or to identify the researchers, papers, and venues that have made the greatest contribution to a specific topic. As an example, each year the U.S. National Library of Medicine database of biomedical literature (MEDLINE) is growing by approximately 700,000 articles from 21,000 different journals and as of 2015 contains reference data on over 25 million resources.1 While this abundance of resources provides an incredible opportunity, it can also overwhelm researchers, whether they are searching for information as part of learning a specialization or trying to develop a broad understanding of different fields.

While the current data deluge has exacerbated the problem of retrieving relevant documents, the problem itself is not new. The field of information retrieval has been proposing mostly topical solutions to the problem of retrieving relevant documents since the 1950s. At the same time, the field of bibliometrics/informetrics/scientometrics started harvesting “vast quantities of knowledge about knowledge, or metaknowledge” (Evans & Foster, 2011; p. 721) from journal articles. While the two subfields of information science mostly developed in parallel (Glänzel, 2015, Wolfram, 2015), more recently there have been efforts to bring these two subfields closer in ways that would benefit both (e.g., Glänzel, 2015, Mayr and Scharnhorst, 2015, Mutschke and Mayr, 2015, White, 2007a, White, 2007b, White, 2015, Wolfram, 2015). This paper aims to contribute to those efforts, not only by utilizing knowledge from both subfields, but by proposing solutions for identifying related research that would be useful in building information systems and delineating reference sets for informetrics research.

In this research, we propose a topic evolution tree (TET) that builds on prior research to present different evolutionary paths for topics to different individuals, based on the context that is relevant to their research. To achieve this we use a heterogeneous bibliographic network (Sun, Han, Yan, Yu, & Wu, 2011) constructed from four types of entities present in a scientific papers repository: papers (P), venues (V), authors (A), topics or keywords (K), and the relationships between them. In TET we utilize multiple meta-paths2 between topic nodes in the heterogeneous graph to identify the topics that a given topic has evolved from. In constructing the TET for a topic, we calculate a score for each meta-path instance based on the edge weights of the relationships. This approach shares similarities with the calculation of path instance scores as proposed in PathSim (Sun et al., 2011). However, we propose that for topic evolution, better performance results can be obtained by adding meta-path restrictions based on a new “contribution” edge as proposed for citation recommendation (Liu, Yu, Guo, & Sun, 2014b). Namely, we use not only the edges previously used in bibliographic networks, such as “written by”, “cited by”, “published in”, or “used (topic)” (Lee & Adorna, 2012; Shi, Kong, Yu, Xie, & Wu, 2012; Sun et al., 2011;), but also contribution edges derived from PageRank calculations for each type of node (Liu et al., 2014b). For example, the “contributed-by-author” edge is calculated for each topic over a graph of authors citing other authors. We employ contribution edges as a means to restrict the context of a meta-path so that a random walk of the heterogeneous graph generates a TET showing the evolution of a topic in a particular context. In other words, we obtain the evolution of each branch of the TET based on the context or query topic that is specifically of interest to each scholar. The workflow used to generate a topic evolution tree is shown in Fig. 1.

As an example, in the information retrieval domain, a researcher might be interested in the evolution of the topic “Cloud computing”3 (user query) and the papers, authors, or venues that made the most significant contribution to the evolution of that topic. The topic “Cloud computing” would be the root node of the TET and each edge from that root to a child node represents the evolution of “Cloud computing” from a contributing topic. One of the topics contributing to “Cloud computing” is the Big Data topic “MapReduce”, so there would be an edge in the TET from “Cloud computing” to a child node labeled “MapReduce”. Since the restricted meta-path uses the contribution edge from the bibliographic network, the subtree in the TET for “MapReduce” focuses on the evolution of that topic in the context of “Cloud computing”. Thus, the evolution of the topic “MapReduce” could be different in the context of cloud computing than in the context of data security, since the papers, authors, or venues covering “MapReduce” will have made at least slightly different contributions in one context versus the other. It should be noted that this does not imply that the history of a topic such as cloud computing or MapReduce cannot be ascertained, but that the contribution edge added to a heterogeneous graph can be used to generate a TET containing the history of a topic from the viewpoint of the context of interest to the user. In the above example, the evolution of MapReduce would be included only in the context of its contribution to cloud computing. Conceptually this is similar to the research by Small, Boyack, and Klavans (2014) where the emergence of topics can be viewed from different levels of granularity.

Section snippets

Related work

In this paper we argue for the usefulness of restricted meta-paths over heterogeneous bibliographic networks both for finding highly relevant information and for developing an understanding of how a scientific topic has evolved. Such bibliographic networks are one example of a heterogeneous information networks (HINs) that capture the semantics of a real-world network (Sun & Han, 2012). Research on mining HINs has explored clustering, classification, and relationship prediction, but the

Data and methods

In this study, we investigate the topic evolution tree (TET) generation problem. Given a topic k* of interest to a researcher and one of the K topics covered by papers in a scientific papers repository (in this case, a user query), a number of topics covered in the repository can contribute to k*. Each of those topics can in turn be contributed to by other topics. Based on this kind of relationship, we can generate a tree structure for a query on the evolution of topic k*, where k* is used as

Results

In order to validate the method proposed in this study, we constructed separate heterogeneous graphs for biomedical research (PMC) and computer science (ACM). The number of vertices and edges in each graph are listed in Table 1 as discussed in Section 3.1. The direct relationships, such as the authors of each paper, were extracted from the bibliographic metadata and the PageRank calculations used to generate the contribution edges were performed using version 2.0.1 of the JUNG graph library.

Discussion and conclusions

Recent research has applied the mining of relationships in heterogeneous graphs such as bibliographic networks to purposes such as path similarity, identifying emerging research topics, and citation recommendation. In this paper we examined mining heterogeneous graphs from different scientific domains to map the evolution of topics as a TET. Prior research has focused on heterogeneous networks composed of objects and relationships extracted directly from bibliographic metadata, and while we

Authors’ contributions

Conceived and designed the analysis, collected the data, contributed data or analysis tools, performed the analysis and wrote the paper: Scott Jensen.

Conceived and designed the analysis, collected the data, contributed data or analysis tools, performed the analysis and wrote the paper: Xiaozhong Liu.

Collected the data and contributed data or analysis tools: Yingying Yu.

Wrote the paper: Staša Milojevic.

Acknowledgements

The authors wish to thank the National Center for Biotechnology Information for making bibliographical metadata on biomedical literature available through PubMed and also wish to thank the Association for Computing Machinery for making bibliographic metadata on computer science literature in the ACM digital library available for this project.

References (72)

  • Ll. Bornmann et al.

    Growth rates of modern science: a bibliometric analysis based on the number of publications and cited references

    Journal of the Association for Information Science and Technology

    (2015)
  • C. Chen

    CiteSpace II: Detecting and visualizing emerging trends and transient patterns in scientific literature

    Journal of the American Society for Information Science and Technology

    (2006)
  • W.A. Cheung et al.

    Quantitative biomedical annotation using medical subject heading overrepresentation profiles (meshops)

    BMC Bioinformatics

    (2012)
  • N. Donofrio et al.

    Research-driven medical education and practice: a case for T-shaped professionals

    IBM working document

    (2010)
  • J.A. Evans et al.

    Metaknowledge

    Science

    (2011)
  • H.J. Falk-Krzesinski et al.

    Advancing the science of team science

    Clinical and Translational Science

    (2010)
  • R. Fidel

    User-centered indexing

    Journal of the American Society for Information Science

    (1994)
  • Garfield, E. (2001). From bibliographic coupling to co-citation analysis via algorithmic historio-bibliography....
  • E. Garfield et al.

    Why do we need algorithmic historiography

    Journal of the American Society for Information Science and Technology

    (2003)
  • E. Garfield et al.

    The use of citation data in writing the history of science

    (1964)
  • W. Glänzel

    Bibliometrics-aided retrieval: where information retrieval meets scientometrics

    Scientometrics

    (2015)
  • T.L. Griffiths et al.

    Finding scientific topics

    Proceedings of the National Academy of Sciences

    (2004)
  • A. Halevy et al.

    The unreasonable effectiveness of data

    IEEE Intelligent Systems

    (2009)
  • S.P. Harter

    Psychological relevance and information science

    Journal of the American Society for Information Science

    (1992)
  • Q. He et al.

    Detecting topic evolution in scientific literature: how can citations help?

    Proceedings of the 18th ACM conference on Information and knowledge management (CIKM '09), ACM

    (2009)
  • J. Hendler

    Avoiding Another AI Winter

    IEEE Intelligent Systems

    (2008)
  • B. Hjorland

    The concept of subject in information science

    Journal of Documentation

    (1992)
  • Z. Jiang et al.

    Chronological Citation Recommendation with Information-Need Shifting

    Proceedings of the 24th ACM international conference on information and knowledge management (CIKM ’15), ACM

    (2015)
  • Y. Jo et al.

    Detecting research topics via the correlation between graphs and texts

    Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD '07), ACM

    (2007)
  • Jones, B. F. (2005). The burden of knowledge and the ‘death of the renaissance man': is innovation getting harder? NBER...
  • Jones, B. F. (2010). As science evolves, how can science policy? NBER Working Paper 16002....
  • R. Koopman et al.

    Ariadne’s thread—interactive navigation in a world of networked information

    Proceedings of the 33rd Annual ACM Conference Extended Abstracts on Human Factors in Computing Systems (CHI EA ’15), ACM

    (2015)
  • J.B. Lee et al.

    Link prediction in a modified heterogeneous bibliographic network

    Proceedings of the international conference on advances in social networks analysis and mining (ASONAM), IEEE

    (2012)
  • X. Liu et al.

    Cluster-based retrieval using language models

    Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR '04), ACM

    (2004)
  • Liu, X., Yu, Y., Guo, C., Sun, Y., & Gao, L. (2014a). Full-text based context-rich heterogeneous network mining...
  • X. Liu et al.

    Meta-path-based ranking with pseudo relevance feedback on heterogeneous graph for citation recommendation

    Proceedings of the 23rd ACM international conference on conference on information and knowledge management (CIKM ‘14), ACM

    (2014)
  • Cited by (25)

    • Hotness prediction of scientific topics based on a bibliographic knowledge graph

      2022, Information Processing and Management
      Citation Excerpt :

      A bibliographic network is composed of various objects and links. Through this heterogeneous bibliographic network, we can use a network analysis to identify articles, authors, and venues that contribute to the advances of research topics, generate topic evolution trees, and identify relevant experts for given topics (Caschili, 2014; Jensen, 2016; Neshati et al., 2014). Although gradually evolving from a homogeneous network to a heterogeneous network, the context of a topic is still insufficiently represented in the complex science system.

    • Extracting a core structure from heterogeneous information network using h-subnet and meta-path strength

      2021, Journal of Informetrics
      Citation Excerpt :

      Jiang et al. (2016) and Yu et al. (2017) investigated ranking by constructing heterogeneous bibliographic networks, considering multiple links among papers, authors, and journals. Jensen et al. (2016) developed a method for constructing topic revolution trees containing scholarly entities, which represented a major contribution to topics based on heterogeneous bibliographic networks. Additionally, Wu (2019) proposed a network framework to represent relationships in scientific data and utilized network analysis to handle scientometric questions.

    • Understanding hierarchical structural evolution in a scientific discipline: A case study of artificial intelligence

      2020, Journal of Informetrics
      Citation Excerpt :

      Topic-Rose-Tree is proposed to construct the topic taxonomy by a visual framework (Dou et al., 2013). Jensen, Liu, Yu, and Milojevic (2016) use the unrestricted meta-path to construct and analyze hierarchical structures. Tu et al. (2018) propose a NMF-based method to decompose documents from top to bottom to build the hierarchy.

    • Topic-linked innovation paths in science and technology

      2020, Journal of Informetrics
      Citation Excerpt :

      The multi-relationship fusion provides a more effective analysis basis for innovation evolution research, through combining multiple relationships among different entities based on different attributes of scientific papers into a new relationship (Xu et al., 2017). Jensen, Liu, Yu, and Milojevic (2016) correlated many attributes such as documents, topic words, authors, and citations, using the meta path method, presented the relatedness and similarity of different bibliometric entities, and initially applied it to the exploration of topic evolution. They also applied this method to topic evolution exploration.

    View all citing articles on Scopus
    View full text