Generation of topic evolution trees from heterogeneous bibliographic networks

doi:10.1016/j.joi.2016.04.002

Journal of Informetrics

Volume 10, Issue 2, May 2016, Pages 606-621

https://doi.org/10.1016/j.joi.2016.04.002 Get rights and content

Highlights

•
Heterogeneous paper networks are useful for construction of topic evolution trees.
•
Meta-path restrictions based on contribution enable context-dependent results.
•
Context-sensitive topic evolution trees enhanced information retrieval results.

Abstract

The volume of the existing research literature is such it can make it difficult to find highly relevant information and to develop an understanding of how a scientific topic has evolved. Prior research on topic evolution has often leveraged refinements to Latent Dirichlet Allocation (LDA) to identify emerging topics. However, such methods do not answer the question of which studies contributed to the evolution of a topic. In this paper we show that meta-paths over a heterogeneous bibliographic network (consisting of papers, authors and venues) can be used to identify the network elements that made the greatest contributions to a topic. In particular, by adding derived edges that capture the contribution of papers, authors, and venues to a topic (using PageRank algorithm), a restricted meta-path over the bibliographic network can be used to restrict the evolution of topics to the context of interest to a researcher. We use such restricted meta-paths to construct a topic evolution tree that can provide researchers with a web-based visualization of the evolution of a scientific topic in the context of interest to them. Compared to baseline networks without restrictions, we find that restricted networks provide more useful topic evolution trees.

Introduction

An exponential growth in the output of scientific literature (Bornmann and Mutz, 2015, Price, 1961, Price, 1963, Van Raan, 2000) is one of the staples of contemporary science. In such an environment, scientists are becoming more specialized with narrower expertise (Jones, 2005, Jones, 2010), while the solutions of difficult problems in science and industry often require interdisciplinary approaches (Wagner et al., 2011) and team-based research (Falk-Krzesinski et al., 2010). Thus, ideally, scientists need to have deep knowledge in their own specialty and broad knowledge across a range of domains, i.e., be “T-shaped” (Barile, Franco, Nota, & Saviano, 2012; Donofrio, Spohrer, & Zadeh, 2010). While the Internet and electronic publications have made it easier to access an unprecedented volume and range of resources, this wealth of information can make it difficult to identify the best resources for learning the foundations of a specific topic or to identify the researchers, papers, and venues that have made the greatest contribution to a specific topic. As an example, each year the U.S. National Library of Medicine database of biomedical literature (MEDLINE) is growing by approximately 700,000 articles from 21,000 different journals and as of 2015 contains reference data on over 25 million resources.¹ While this abundance of resources provides an incredible opportunity, it can also overwhelm researchers, whether they are searching for information as part of learning a specialization or trying to develop a broad understanding of different fields.

While the current data deluge has exacerbated the problem of retrieving relevant documents, the problem itself is not new. The field of information retrieval has been proposing mostly topical solutions to the problem of retrieving relevant documents since the 1950s. At the same time, the field of bibliometrics/informetrics/scientometrics started harvesting “vast quantities of knowledge about knowledge, or metaknowledge” (Evans & Foster, 2011; p. 721) from journal articles. While the two subfields of information science mostly developed in parallel (Glänzel, 2015, Wolfram, 2015), more recently there have been efforts to bring these two subfields closer in ways that would benefit both (e.g., Glänzel, 2015, Mayr and Scharnhorst, 2015, Mutschke and Mayr, 2015, White, 2007a, White, 2007b, White, 2015, Wolfram, 2015). This paper aims to contribute to those efforts, not only by utilizing knowledge from both subfields, but by proposing solutions for identifying related research that would be useful in building information systems and delineating reference sets for informetrics research.

In this research, we propose a topic evolution tree (TET) that builds on prior research to present different evolutionary paths for topics to different individuals, based on the context that is relevant to their research. To achieve this we use a heterogeneous bibliographic network (Sun, Han, Yan, Yu, & Wu, 2011) constructed from four types of entities present in a scientific papers repository: papers (P), venues (V), authors (A), topics or keywords (K), and the relationships between them. In TET we utilize multiple meta-paths² between topic nodes in the heterogeneous graph to identify the topics that a given topic has evolved from. In constructing the TET for a topic, we calculate a score for each meta-path instance based on the edge weights of the relationships. This approach shares similarities with the calculation of path instance scores as proposed in PathSim (Sun et al., 2011). However, we propose that for topic evolution, better performance results can be obtained by adding meta-path restrictions based on a new “contribution” edge as proposed for citation recommendation (Liu, Yu, Guo, & Sun, 2014b). Namely, we use not only the edges previously used in bibliographic networks, such as “written by”, “cited by”, “published in”, or “used (topic)” (Lee & Adorna, 2012; Shi, Kong, Yu, Xie, & Wu, 2012; Sun et al., 2011;), but also contribution edges derived from PageRank calculations for each type of node (Liu et al., 2014b). For example, the “contributed-by-author” edge is calculated for each topic over a graph of authors citing other authors. We employ contribution edges as a means to restrict the context of a meta-path so that a random walk of the heterogeneous graph generates a TET showing the evolution of a topic in a particular context. In other words, we obtain the evolution of each branch of the TET based on the context or query topic that is specifically of interest to each scholar. The workflow used to generate a topic evolution tree is shown in Fig. 1.

As an example, in the information retrieval domain, a researcher might be interested in the evolution of the topic “Cloud computing”³ (user query) and the papers, authors, or venues that made the most significant contribution to the evolution of that topic. The topic “Cloud computing” would be the root node of the TET and each edge from that root to a child node represents the evolution of “Cloud computing” from a contributing topic. One of the topics contributing to “Cloud computing” is the Big Data topic “MapReduce”, so there would be an edge in the TET from “Cloud computing” to a child node labeled “MapReduce”. Since the restricted meta-path uses the contribution edge from the bibliographic network, the subtree in the TET for “MapReduce” focuses on the evolution of that topic in the context of “Cloud computing”. Thus, the evolution of the topic “MapReduce” could be different in the context of cloud computing than in the context of data security, since the papers, authors, or venues covering “MapReduce” will have made at least slightly different contributions in one context versus the other. It should be noted that this does not imply that the history of a topic such as cloud computing or MapReduce cannot be ascertained, but that the contribution edge added to a heterogeneous graph can be used to generate a TET containing the history of a topic from the viewpoint of the context of interest to the user. In the above example, the evolution of MapReduce would be included only in the context of its contribution to cloud computing. Conceptually this is similar to the research by Small, Boyack, and Klavans (2014) where the emergence of topics can be viewed from different levels of granularity.

Section snippets

Related work

In this paper we argue for the usefulness of restricted meta-paths over heterogeneous bibliographic networks both for finding highly relevant information and for developing an understanding of how a scientific topic has evolved. Such bibliographic networks are one example of a heterogeneous information networks (HINs) that capture the semantics of a real-world network (Sun & Han, 2012). Research on mining HINs has explored clustering, classification, and relationship prediction, but the

Data and methods

In this study, we investigate the topic evolution tree (TET) generation problem. Given a topic k* of interest to a researcher and one of the K topics covered by papers in a scientific papers repository (in this case, a user query), a number of topics covered in the repository can contribute to k*. Each of those topics can in turn be contributed to by other topics. Based on this kind of relationship, we can generate a tree structure for a query on the evolution of topic k*, where k* is used as

Results

In order to validate the method proposed in this study, we constructed separate heterogeneous graphs for biomedical research (PMC) and computer science (ACM). The number of vertices and edges in each graph are listed in Table 1 as discussed in Section 3.1. The direct relationships, such as the authors of each paper, were extracted from the bibliographic metadata and the PageRank calculations used to generate the contribution edges were performed using version 2.0.1 of the JUNG graph library.

Discussion and conclusions

Recent research has applied the mining of relationships in heterogeneous graphs such as bibliographic networks to purposes such as path similarity, identifying emerging research topics, and citation recommendation. In this paper we examined mining heterogeneous graphs from different scientific domains to map the evolution of topics as a TET. Prior research has focused on heterogeneous networks composed of objects and relationships extracted directly from bibliographic metadata, and while we

Authors’ contributions

Conceived and designed the analysis, collected the data, contributed data or analysis tools, performed the analysis and wrote the paper: Scott Jensen.

Conceived and designed the analysis, collected the data, contributed data or analysis tools, performed the analysis and wrote the paper: Xiaozhong Liu.

Collected the data and contributed data or analysis tools: Yingying Yu.

Wrote the paper: Staša Milojevic.

Acknowledgements

The authors wish to thank the National Center for Biotechnology Information for making bibliographical metadata on biomedical literature available through PubMed and also wish to thank the Association for Computing Machinery for making bibliographic metadata on computer science literature in the ACM digital library available for this project.

References (72)

H. Small et al.
Identifying emerging topics in science and technology
Research Policy
(2014)
A.G. Sutcliffe et al.
Evaluating the effectiveness of visual user interfaces for information retrieval
International Journal of Human-Computer Studies
(2000)
N.J. Van Eck et al.
CitNetExplorer: a new software tool for analyzing and visualizing citation networks
Journal of Informetrics
(2014)
C.S. Wagner et al.
Approaches to understanding and measuring interdisciplinary scientific research: a review of the literature
Journal of Informetrics
(2011)
S. Barile et al.
Structure and dynamics of a T-shaped knowledge: from individuals to cooperating communities of practice
Service Science
(2012)
M.J. Bates
Indexing and access for digital libraries and the Internet: human, database, and domain factors
Journal of the American Society for Information Science
(1998)
J. Beck et al.
NLM DTD to NISO JATS Z39. 96-2012
The NCBI Handbook [Internet]
(2013)
D.M. Blei et al.
Dynamic topic models
Proceedings of the 23rd international conference on machine learning, ACM
(2006)
D.M. Blei et al.
Latent Dirichlet allocation
The Journal of Machine Learning Research
(2003)
L. Bolelli et al.
Finding topic trends in digital libraries
Proceedings of the 9th ACM/IEEE-CS joint conference on digital libraries (JCDL ‘09), ACM
(2009)

Ll. Bornmann et al.

Growth rates of modern science: a bibliometric analysis based on the number of publications and cited references

Journal of the Association for Information Science and Technology

(2015)

C. Chen

CiteSpace II: Detecting and visualizing emerging trends and transient patterns in scientific literature

Journal of the American Society for Information Science and Technology

(2006)

W.A. Cheung et al.

Quantitative biomedical annotation using medical subject heading overrepresentation profiles (meshops)

BMC Bioinformatics

(2012)

N. Donofrio et al.

Research-driven medical education and practice: a case for T-shaped professionals

IBM working document

(2010)

J.A. Evans et al.

Metaknowledge

Science

(2011)

H.J. Falk-Krzesinski et al.

Advancing the science of team science

Clinical and Translational Science

(2010)

R. Fidel

User-centered indexing

Journal of the American Society for Information Science

(1994)

Garfield, E. (2001). From bibliographic coupling to co-citation analysis via algorithmic historio-bibliography....

E. Garfield et al.

Why do we need algorithmic historiography

Journal of the American Society for Information Science and Technology

(2003)

E. Garfield et al.

The use of citation data in writing the history of science

(1964)

W. Glänzel

Bibliometrics-aided retrieval: where information retrieval meets scientometrics

Scientometrics

(2015)

T.L. Griffiths et al.

Finding scientific topics

Proceedings of the National Academy of Sciences

(2004)

A. Halevy et al.

The unreasonable effectiveness of data

IEEE Intelligent Systems

(2009)

S.P. Harter

Psychological relevance and information science

Journal of the American Society for Information Science

(1992)

Q. He et al.

Detecting topic evolution in scientific literature: how can citations help?

Proceedings of the 18th ACM conference on Information and knowledge management (CIKM '09), ACM

(2009)

J. Hendler

Avoiding Another AI Winter

IEEE Intelligent Systems

(2008)

B. Hjorland

The concept of subject in information science

Journal of Documentation

(1992)

Z. Jiang et al.

Chronological Citation Recommendation with Information-Need Shifting

Proceedings of the 24th ACM international conference on information and knowledge management (CIKM ’15), ACM

(2015)

Y. Jo et al.

Detecting research topics via the correlation between graphs and texts

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD '07), ACM

(2007)

Jones, B. F. (2005). The burden of knowledge and the ‘death of the renaissance man': is innovation getting harder? NBER...

Jones, B. F. (2010). As science evolves, how can science policy? NBER Working Paper 16002....

R. Koopman et al.

Ariadne’s thread—interactive navigation in a world of networked information

Proceedings of the 33rd Annual ACM Conference Extended Abstracts on Human Factors in Computing Systems (CHI EA ’15), ACM

(2015)

J.B. Lee et al.

Link prediction in a modified heterogeneous bibliographic network

Proceedings of the international conference on advances in social networks analysis and mining (ASONAM), IEEE

(2012)

X. Liu et al.

Cluster-based retrieval using language models

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR '04), ACM

(2004)

Liu, X., Yu, Y., Guo, C., Sun, Y., & Gao, L. (2014a). Full-text based context-rich heterogeneous network mining...

X. Liu et al.

Meta-path-based ranking with pseudo relevance feedback on heterogeneous graph for citation recommendation

Proceedings of the 23rd ACM international conference on conference on information and knowledge management (CIKM ‘14), ACM

(2014)

Cited by (25)

Developing a four-entities reinforced rank model to evaluate the topic influence in academic networks
2023, Journal of Informetrics
Several studies have reported on metrics for measuring the influence of scientific topics from different perspectives; however, current ranking methods ignore the reinforcing effect of other academic entities on topic influence. In this paper, we developed an effective topic ranking model, 4EFRRank, by modeling the influence transfer mechanism among all academic entities in a complex academic network using a four-layer network design that incorporates the strengthening effect of multiple entities on topic influence. The PageRank algorithm is utilized to calculate the initial influence of topics, papers, authors, and journals in a homogeneous network, whereas the HITS algorithm is utilized to express the mutual reinforcement between topics, papers, authors, and journals in a heterogeneous network, iteratively calculating the final topic influence value. Based on a specific interdisciplinary domain, social media data, we applied the 4ERRank model to the 19,527 topics included in the criteria. The experimental results demonstrate that the 4ERRank model can successfully synthesize the performance of classic co-word metrics and effectively reflect high citation topics. This study enriches the methodology for assessing topic impact and contributes to the development of future topic-based retrieval and prediction tasks.
Hotness prediction of scientific topics based on a bibliographic knowledge graph
2022, Information Processing and Management
Citation Excerpt :
A bibliographic network is composed of various objects and links. Through this heterogeneous bibliographic network, we can use a network analysis to identify articles, authors, and venues that contribute to the advances of research topics, generate topic evolution trees, and identify relevant experts for given topics (Caschili, 2014; Jensen, 2016; Neshati et al., 2014). Although gradually evolving from a homogeneous network to a heterogeneous network, the context of a topic is still insufficiently represented in the complex science system.
As a part of innovation in forecasting, scientific topic hotness prediction plays an essential role in dynamic scientific topic assessment and domain knowledge transformation modeling. To improve the topic hotness prediction performance, we propose an innovative model to estimate the co-evolution of scientific topic and bibliographic entities, which leverages a novel dynamic Bibliographic Knowledge Graph (BKG). Then, one can predict the topic hotness by using various kinds of topological entity information, i.e., TopicRank, PaperRank, AuthorRank, and VenueRank, along with pre-trained node embedding, i.e., node2vec embedding, and different pooling techniques. To validate the proposed method, we constructed a new BKG by using 4.5 million PubMed Central publications plus MeSH (Medical Subject Heading) thesaurus and witnessed the essential prediction improvement with extensive experiment outcomes over 10 years observations.
Extracting a core structure from heterogeneous information network using h-subnet and meta-path strength
2021, Journal of Informetrics
Citation Excerpt :
Jiang et al. (2016) and Yu et al. (2017) investigated ranking by constructing heterogeneous bibliographic networks, considering multiple links among papers, authors, and journals. Jensen et al. (2016) developed a method for constructing topic revolution trees containing scholarly entities, which represented a major contribution to topics based on heterogeneous bibliographic networks. Additionally, Wu (2019) proposed a network framework to represent relationships in scientific data and utilized network analysis to handle scientometric questions.
Based on the analytical methodology of homogeneous networks, we present a novel method to extract a core structure from a heterogeneous network. By extending two forms of meta-paths to represent the relationships between attribute edges, we propose the meta-path strength as a measure of the link strength of attribute edges in a heterogeneous information network. Inspired by the h-subnet method for weighted complex networks, we identify important attribute edges based on the h-cutoff of meta-path strengths. Additionally, important base edges can be filtered according to the base nodes on the retained attribute edges. Therefore, a heterogeneous h-subnet can be obtained by combining important attribute edges and base edges. Two bibliographic information networks are used to evaluate the proposed method empirically, and the results indicate that the extracted heterogeneous h-subnets contain less than 1% of the nodes and edges of the original networks and can cover different features of at least one of several other core structures.
Understanding hierarchical structural evolution in a scientific discipline: A case study of artificial intelligence
2020, Journal of Informetrics
Citation Excerpt :
Topic-Rose-Tree is proposed to construct the topic taxonomy by a visual framework (Dou et al., 2013). Jensen, Liu, Yu, and Milojevic (2016) use the unrestricted meta-path to construct and analyze hierarchical structures. Tu et al. (2018) propose a NMF-based method to decompose documents from top to bottom to build the hierarchy.
Detecting what type of knowledge constitutes a discipline, tracking how the knowledge changes, and understanding why the changes are triggered are the key issues in analyzing scientific development from a macro perspective, which is usually analyzed by the topic of evolution. However, traditional methods assume that the disciplinary structure is flat with only one-layer topics, rather than a tree-like structure with hierarchical topics, which leads to the inability of existing methods to effectively examine the details of the evolution, such as the interactions between different research directions. In this paper, we take artificial intelligence (AI) as a case in which we study its hierarchical structural evolution, more precisely inspecting disciplinary development, by analyzing 65,887 AI-related research papers published during a 10-year period from 2009 to 2018. From a hierarchical topic model that can construct a topic-tree with multi-layer organizations, we design a visual analysis model for the topic-tree to systematically and visually investigate how knowledge transfers along the topic-tree and how the topic-tree changes over time. Moreover, some assistant indicators are employed to help in the exploration of the complicated structural evolution. Then, we discover the latent relationship between the sub-structures within a topic as well as the triggering reason for the knowledge migration. Based on these results, we conclude that different topics have different development patterns and that the recent artificial intelligence revolution stems from the interactions among the different topics.
Topic-linked innovation paths in science and technology
2020, Journal of Informetrics
Citation Excerpt :
The multi-relationship fusion provides a more effective analysis basis for innovation evolution research, through combining multiple relationships among different entities based on different attributes of scientific papers into a new relationship (Xu et al., 2017). Jensen, Liu, Yu, and Milojevic (2016) correlated many attributes such as documents, topic words, authors, and citations, using the meta path method, presented the relatedness and similarity of different bibliometric entities, and initially applied it to the exploration of topic evolution. They also applied this method to topic evolution exploration.
In the modern world, science and technology jointly determine the evolutionary path of scientific innovation, with an increasingly close relationship between them. Therefore, it is important to study the identification method of the innovation path, based on the linkage of topics in science and technology. This study focuses on connected topics utilizing bibliometric analysis, thereby exploring the identification method for innovation paths based on the linkage of scientific and technological topics. The internal mechanism of knowledge dissemination and the relationship between science and technology are revealed and described in detail by measuring the linkage of knowledge units. For practical bibliometric analyses, research papers and patent literature were used to characterize scientific research and technological research to reveal the innovation path for the interaction of science and technology quantitatively, automatically, and visually. Experimental study shows that analysis of the topic-linked path of science and technology, along with the integration of multi-relationships, can effectively identify important science- and technology-related topics in a field in the evolution process, and help grasp the key points of basic research and applied research.
Application of graph theory in the library domain—Building a faceted framework based on a literature review
2022, Journal of Librarianship and Information Science

View all citing articles on Scopus

View full text

Generation of topic evolution trees from heterogeneous bibliographic networks

Highlights

Abstract

Introduction

Section snippets

Related work

Data and methods

Results

Discussion and conclusions

Authors’ contributions

Acknowledgements

Research Policy

International Journal of Human-Computer Studies

Journal of Informetrics

Journal of Informetrics

Structure and dynamics of a T-shaped knowledge: from individuals to cooperating communities of practice

Service Science

Indexing and access for digital libraries and the Internet: human, database, and domain factors

Journal of the American Society for Information Science

NLM DTD to NISO JATS Z39. 96-2012

The NCBI Handbook [Internet]

Dynamic topic models

Proceedings of the 23rd international conference on machine learning, ACM

Latent Dirichlet allocation

The Journal of Machine Learning Research

Finding topic trends in digital libraries

Proceedings of the 9th ACM/IEEE-CS joint conference on digital libraries (JCDL ‘09), ACM

Growth rates of modern science: a bibliometric analysis based on the number of publications and cited references

Journal of the Association for Information Science and Technology

CiteSpace II: Detecting and visualizing emerging trends and transient patterns in scientific literature

Journal of the American Society for Information Science and Technology

Quantitative biomedical annotation using medical subject heading overrepresentation profiles (meshops)

BMC Bioinformatics

Research-driven medical education and practice: a case for T-shaped professionals

IBM working document

Metaknowledge

Science

Advancing the science of team science

Clinical and Translational Science

User-centered indexing

Journal of the American Society for Information Science

Why do we need algorithmic historiography

Journal of the American Society for Information Science and Technology

The use of citation data in writing the history of science

Bibliometrics-aided retrieval: where information retrieval meets scientometrics

Scientometrics

Finding scientific topics

Proceedings of the National Academy of Sciences

The unreasonable effectiveness of data

IEEE Intelligent Systems

Psychological relevance and information science

Journal of the American Society for Information Science

Detecting topic evolution in scientific literature: how can citations help?

Proceedings of the 18th ACM conference on Information and knowledge management (CIKM '09), ACM

Avoiding Another AI Winter

IEEE Intelligent Systems

The concept of subject in information science

Journal of Documentation

Chronological Citation Recommendation with Information-Need Shifting

Proceedings of the 24th ACM international conference on information and knowledge management (CIKM ’15), ACM

Detecting research topics via the correlation between graphs and texts

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD '07), ACM

Ariadne’s thread—interactive navigation in a world of networked information

Proceedings of the 33rd Annual ACM Conference Extended Abstracts on Human Factors in Computing Systems (CHI EA ’15), ACM

Link prediction in a modified heterogeneous bibliographic network

Proceedings of the international conference on advances in social networks analysis and mining (ASONAM), IEEE

Cluster-based retrieval using language models

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR '04), ACM

Meta-path-based ranking with pseudo relevance feedback on heterogeneous graph for citation recommendation

Proceedings of the 23rd ACM international conference on conference on information and knowledge management (CIKM ‘14), ACM