Skip to main content
Log in

Constructing and evaluating automated literature review systems

  • Published:
Scientometrics Aims and scope Submit manuscript

Abstract

Automated literature reviews have the potential to accelerate knowledge synthesis and provide new insights. However, a lack of labeled ground-truth data has made it difficult to develop and evaluate these methods. We propose a framework that uses the reference lists from existing review papers as labeled data, which can then be used to train supervised classifiers, allowing for experimentation and testing of models and features at a large scale. We demonstrate our framework by training classifiers using different combinations of citation- and text-based features on 500 review papers. We use the R-Precision scores for the task of reconstructing the review papers’ reference lists as a way to evaluate and compare methods. We also extend our method, generating a novel set of articles relevant to the fields of misinformation studies and science communication. We find that our method can identify many of the most relevant papers for a literature review from a large set of candidate papers, and that our framework allows for development and testing of models and features to incrementally improve the results. The models we build are able to identify relevant papers even when starting with a very small set of seed papers. We also find that the methods can be adapted to identify previously undiscovered articles that may be relevant to a given topic.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. For the clustering, we used the cleaned version of the Web of Science network as described in “Data” section. We used the network after cleaning for citations, but before removing papers with other missing metadata. This version of the network had 73,725,142 nodes and 1,164,650,021 edges.

  2. Since every node is in exactly one cluster (even if the cluster is only one node), and the leaves of the hierarchy tree represent the nodes themselves, the minimum depth in the hierarchy is 2. In this case, the first level is the cluster the node belongs to, and the second level is the node.

  3. We divide the standard measure of distance between nodes in a tree by the sum of the nodes’ depth. This is because, in the case of hierarchical Infomap clustering, the total depth varies throughout the tree, and the actual depth of the nodes is arbitrary when describing the distance between the nodes. For example, a pair of nodes in the same bottom-level cluster at a depth of level 5 are no closer together than a pair of nodes in the same bottom-level cluster at level 2.

  4. Machine learning experiments were conducted using scikit-learn version 0.20.3 running on Python 3.6.9.

  5. Although we only performed ranking and clustering once, it would be ideal to remove all nodes and links past the year of the review paper, as well as the review paper itself, and cluster this network. However, performing a separate clustering for each review paper would be computationally infeasible. Nevertheless, any bias introduced by this should be small, as the clustering method we use considers the overall flow of information across multiple pathways, which makes it robust to the removal of individual nodes and links in large networks.

  6. We chose to report the best-performing model for each experiment, rather than restricting to a single classifier type. This decision did not have a large effect on the results. We chose to be flexible in which classifier to use because there are differences among the different review articles. We will continue to explore the nature of these differences in future work.

  7. The actual feature used was the absolute difference between a paper’s publication year and the mean publication year of the seed papers.

  8. We used the spaCy library (version 2.2.3) with a pretrained English language model (core_web_lg version 2.2.5).

  9. The models that had both network and title embedding features, but not publication year (“Cluster, PageRank, Embeddings”), performed worse in general than models with embeddings alone, with scores tending to be between 0.5 and 0.7. The reason for this is unclear.

  10. Since the same random seeds (1, 2, 3, 4, 5) were used each time, the smaller seed sets are always subsets of the larger ones. For example, for a given review article and a given random seed, the 100 seed papers identified are all included in the set of 150; the set of 50 seed papers are all included in both the set of 100 and 150; and so on.

  11. See Data and Methods at http://www.misinformationresearch.org for details

References

  • Albarqouni, L., Doust, J., & Glasziou, P. (2017). Patient preferences for cardiovascular preventive medication: A systematic review. Heart, 103(20), 1578–1586. https://doi.org/10.1136/heartjnl-2017-311244.

    Article  Google Scholar 

  • Ammar, W., Groeneveld, D., Bhagavatula, C., Beltagy, I., Crawford, M., & Downey, D., et al. Construction of the literature graph in semantic scholar. In: Proceedings of the 2018 conference of the North American chapter of the association for computational linguistics: Human language technologies, Volume 3 (Industry Papers), pp. 84–91. Association for Computational Linguistics. https://doi.org/10.18653/v1/N18-3011. https://www.aclweb.org/anthology/N18-3011

  • Bae, S. H., Halperin, D., West, J., Rosvall, M., Howe, B. (2013). Scalable flow-based community detection for large-scale network analysis. In 2013 IEEE 13th international conference on data mining workshops (pp. 303–310). https://doi.org/10.1109/ICDMW.2013.138

  • Bastian, H., Glasziou, P., & Chalmers, I. (2010). Seventy-five trials and eleven systematic reviews a day: How will we ever keep up? PLOS Medicine, 7(9), e1000326. https://doi.org/10.1371/journal.pmed.1000326.

    Article  Google Scholar 

  • Beel, J., Gipp, B., Langer, S., & Breitinger, C. (2016). Research-paper recommender systems: A literature survey. International Journal on Digital Libraries, 17(4), 305–338. https://doi.org/10.1007/s00799-015-0156-0.

    Article  Google Scholar 

  • Belter, C. W. (2016). Citation analysis as a literature search method for systematic reviews. Journal of the Association for Information Science and Technology, 67(11), 2766–2777. https://doi.org/10.1002/asi.23605.

    Article  Google Scholar 

  • Chen, T. T. (2012). The development and empirical study of a literature review aiding system. Scientometrics, 92(1), 105–116. https://doi.org/10.1007/s11192-012-0728-3.

    Article  Google Scholar 

  • Cormack, G. V., Grossman, M. R. (2014). Evaluation of machine-learning protocols for technology-assisted review in electronic discovery. In: Proceedings of the 37th international ACM SIGIR conference on research & development in information retrieval, SIGIR ’14, pp. 153–162. ACM, New York, NY, USA. https://doi.org/10.1145/2600428.2609601. http://doi.acm.org/10.1145/2600428.2609601. Event-place: Gold Coast, Queensland, Australia

  • Djidjev, H. N., Pantziou, G. E., & Zaroliagis, C. D. (1991). Computing shortest paths and distances in planar graphs. In J. L. Albert, B. Monien, & M. R. Artalejo (Eds.), Automata, languages and programming (pp. 327–338). Berlin: Springer.

    Chapter  Google Scholar 

  • Fortunato, S. (2010). Community detection in graphs. Physics Reports, 486(3–5), 75–174.

    Article  MathSciNet  Google Scholar 

  • Greenhalgh, T., & Peacock, R. (2005). Effectiveness and efficiency of search methods in systematic reviews of complex evidence: Audit of primary sources. BMJ, 331(7524), 1064–1065. https://doi.org/10.1136/bmj.38636.593461.68.

    Article  Google Scholar 

  • Gupta, S., Varma, V. (2017). Scientific Article recommendation by using distributed representations of text and graph. In Proceedings of the 26th international conference on world wide web companion, WWW ’17 Companion (pp. 1267–1268). International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, Switzerland. https://doi.org/10.1145/3041021.3053062.

  • Horsley, T., Dingwall, O., & Sampson, M. (2011). Checking reference lists to find additional studies for systematic reviews. Cochrane Database of Systematic Reviews,. https://doi.org/10.1002/14651858.MR000026.pub2.

    Article  Google Scholar 

  • Janssens, A. C. J. W., & Gwinn, M. (2015). Novel citation-based search method for scientific literature: Application to meta-analyses. BMC Medical Research Methodology, 15(1), 84. https://doi.org/10.1186/s12874-015-0077-z.

    Article  Google Scholar 

  • Jha, R., Abu-Jbara, A., Radev, D. (2013). A system for summarizing scientific topics starting from keywords. In Proceedings of the 51st annual meeting of the association for computational linguistics (volume 2: short papers) (pp. 572–577).

  • Kanakia, A., Shen, Z., Eide, D., Wang, K. A scalable hybrid research paper recommender system for microsoft academic. In The World Wide Web conference, WWW ’19 (pp. 2893–2899). Association for Computing Machinery. https://doi.org/10.1145/3308558.3313700.

  • Kong, X., Mao, M., Wang, W., Liu, J., & Xu, B. (2018). VOPRec: Vector representation learning of papers with text information and structural identity for recommendation. IEEE Transactions on Emerging Topics in Computing,. https://doi.org/10.1109/TETC.2018.2830698.

    Article  Google Scholar 

  • Larsen, K. R., Hovorka, D., Dennis, A., & West, J. (2019). Understanding the elephant: The discourse approach to boundary identification and corpus construction for theory review articles. Journal of the Association for Information Systems, 20, 7. https://doi.org/10.17705/1jais.00556.

    Article  Google Scholar 

  • Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to Information Retrieval (1st ed.). New York: Cambridge University Press.

    Book  Google Scholar 

  • Miwa, M., Thomas, J., O’Mara-Eves, A., & Ananiadou, S. (2014). Reducing systematic review workload through certainty-based screening. Journal of Biomedical Informatics, 51, 242–253. https://doi.org/10.1016/j.jbi.2014.06.005.

    Article  Google Scholar 

  • Murphy, K. P. (2010). Machine Learning: A Probabilistic Perspective. Cambridge: MIT Press.

    MATH  Google Scholar 

  • National Academies of Sciences. (2017). Engineering, and Medicine and others: Communicating science effectively: A research agenda. National Academies Press.

  • O’Mara-Eves, A., Thomas, J., McNaught, J., Miwa, M., & Ananiadou, S. (2015). Using text mining for study identification in systematic reviews: A systematic review of current approaches. Systematic Reviews, 4(1), 1–22. https://doi.org/10.1186/2046-4053-4-5.

    Article  Google Scholar 

  • Page, L., Brin, S., Motwani, R., Winograd, T. (1999). The PageRank citation ranking: Bringing order to the web. Technical report, Stanford InfoLab. http://ilpubs.stanford.edu:8090/422.

  • Portenoy, J., & West, J. D. (2019). Supervised learning for automated literature review. BIRNDL, 2019, 9.

    Google Scholar 

  • Robinson, K. A., Dunn, A. G., Tsafnat, G., & Glasziou, P. (2014). Citation networks of related trials are often disconnected: Implications for bidirectional citation searches. Journal of Clinical Epidemiology, 67(7), 793–799. https://doi.org/10.1016/j.jclinepi.2013.11.015.

    Article  Google Scholar 

  • Ronzano, F., Saggion, H. (2015). Dr. inventor framework: Extracting structured information from scientific publications. In International conference on discovery science (pp. 209–220). Springer.

  • Rosvall, M., & Bergstrom, C. T. (2008). Maps of random walks on complex networks reveal community structure. Proceedings of the National Academy of Sciences, 105(4), 1118–1123.

    Article  Google Scholar 

  • Silva, F. N., Amancio, D. R., Bardosova, M., Costa, L.D.F., & Oliveira, O.N. (2016). Using network science and text analytics to produce surveys in a scientific topic. Journal of Informetrics, 10(2), 487–502. https://doi.org/10.1016/j.joi.2016.03.008.

    Article  Google Scholar 

  • Tsafnat, G., Dunn, A., Glasziou, P., & Coiera, E. (2013). The automation of systematic reviews: Would lead to best currently available evidence at the push of a button. BMJ, 346(7891), 8–8.

    Google Scholar 

  • Wallace, B. C., Trikalinos, T. A., Lau, J., Brodley, C., & Schmid, C. H. (2010). Semi-automated screening of biomedical citations for systematic reviews. BMC Bioinformatics, 11(1), 55. https://doi.org/10.1186/1471-2105-11-55.

    Article  Google Scholar 

  • Webster, J., & Watson, R. T. (2002). Analyzing the past to prepare for the future: Writing a literature review. MIS Quarterly, 26(2), xiii–xxiii.

  • Williams, K., Wu, J., Choudhury, S. R., Khabsa, M., Giles, C. L. Scholarly big data information extraction and integration in the CiteSeerx digital library. In 2014 IEEE 30th international conference on data engineering workshops (pp. 68–73). IEEE. https://doi.org/10.1109/ICDEW.2014.6818305. http://ieeexplore.ieee.org/document/6818305/.

  • Yu, Z., Kraft, N. A., & Menzies, T. (2018). Finding better active learners for faster literature reviews. Empirical Software Engineering, 23(6), 3161–3186. https://doi.org/10.1007/s10664-017-9587-0.

    Article  Google Scholar 

  • Yu, Z., & Menzies, T. (2019). FAST2: An intelligent assistant for finding relevant papers. Expert Systems with Applications, 120, 57–71. https://doi.org/10.1016/j.eswa.2018.11.021.

    Article  Google Scholar 

  • Zitt, M. (2015). Meso-level retrieval: IR-bibliometrics interplay and hybrid citation-words methods in scientific fields. Delineation, 102(3), 2223–2245. https://doi.org/10.1007/s11192-014-1482-5.

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

We thank Dr. Chirag Shah for helpful conversations around evaluation measures, and Clarivate Analytics for the use of the Web of Science data. We also thank three anonymous reviewers for constructive feedback. This work was facilitated through the use of advanced computational, storage, and networking infrastructure provided by the Hyak supercomputer system and funded by the STF at the University of Washington.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jason Portenoy.

Appendix

Appendix

Example of autoreview results

Below is a sample of results (random samples of true positives, false positives, true negatives, and false negatives) from the autoreview classifier using the references from Fortunato (2010)—a review on Community Detection in Graphs—with a random seed of 5. The “Rank” represents the position of the candidate paper when ordered descending by the classifier’s score. The false positives, while not in the original reference list, still seem to be relevant to the topic (e.g., “Overlapping Community Search for Social Networks”). The true negatives tend to have lower scores than the false negatives, suggesting that the assigned score does tend to predict relevant documents, even if they are below the cutoff.

True Positives

Rank

Title

Year

10

Role Models For Complex Networks

2007

21

Adaptive Clustering Algorithm For Community Detection In Complex Networks

2008

24

Random Field Ising Model And Community Structure In Complex Networks

2006

50

Bayesian Approach To Network Modularity

2008

58

The Effect Of Size Heterogeneity On Community Identification In Complex Networks

2006

76

Loops And Multiple Edges In Modularity Maximization Of Networks

2010

97

Synchronization Interfaces And Overlapping Communities In Complex Networks

2008

118

The Analysis And Dissimilarity Comparison Of Community Structure

2006

119

Searching For Communities In Bipartite Networks

2008

208

Epidemic Spreading In Scale-Free Networks

2001

False Positives

Rank

Title

Year

72

Modularity From Fluctuations In Random Graphs And Complex Networks

2004

94

Clustering Coefficient And Community Structure Of Bipartite Networks

2008

98

Detecting Overlapping Community Structures In Networks

2009

106

Size Reduction Of Complex Networks Preserving Modularity

2007

129

Extracting Weights From Edge Directions To Find Communities In Directed Networks

2010

146

Identifying The Role That Animals Play In Their Social Networks

2004

150

Seeding The Kernels In Graphs: Toward Multi-Resolution Community Analysis

2009

159

Overlapping Community Search For Social Networks

2010

162

Modularity Clustering Is Force-Directed Layout

2009

185

Cartography Of Complex Networks: Modules And Universal Roles

2005

True Negatives

Rank

Title

Year

2967

Graph Models Of Complex Information-Sources

1979

120959

Parallel Distributed Network Characteristics Of The Dsct

1992

322251

Hidden Semantic Concept Discovery In Region Based Image Retrieval

2004

327308

A Multilevel Matrix Decomposition Algorithm For Analyzing Scattering From Large Structures

1996

394850

Multiple-Model Approach To Finite Memory Adaptive Filtering

1992

749175

Statistical Computer-Aided Design For Microwave Circuits

1996

943999

Segmental Anhidrosis In The Spinal Dermatomes In Sjogrens Syndrome-Associated Neuropathy

1993

1121787

Rheological And Dielectrical Characterization Of Melt Mixed Polycarbonate-Multiwalled Carbon Nanotube Composites

2004

1177851

Explaining The Rate Spread On Corporate Bonds

2001

1256866

The Cyanobacterial Cell Division Factor Ftn6 Contains An N-Terminal Dnad-Like Domain

2009

False Negatives

Rank

Title

Year

259

Heterogeneity In Oscillator Networks: Are Smaller Worlds Easier To Synchronize?

2003

324

Assessing The Relevance Of Node Features For Network Structure

2009

385

The Use Of Edge-Betweenness Clustering To Investigate Biological Function In Protein Interaction Networks

2005

6605

A Measure Of Betweenness Centrality Based On Random Walks

2005

19863

On Decomposition Of Networks In Minimally Interconnected Subnetworks

1969

59900

Objective Criteria For Evaluation Of Clustering Methods

1971

139178

Optimization With Extremal Dynamics

2001

250583

The Tie Effect On Information Dissemination: The Spread Of A Commercial Rumor In Hong Kong

2002

281952

Compartments Revealed In Food-Web Structure

2003

1203248

Dynamic Asset Trees And Portfolio Analysis

2002

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Portenoy, J., West, J.D. Constructing and evaluating automated literature review systems. Scientometrics 125, 3233–3251 (2020). https://doi.org/10.1007/s11192-020-03490-w

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11192-020-03490-w

Keywords

Navigation