Abstract
Automated literature reviews have the potential to accelerate knowledge synthesis and provide new insights. However, a lack of labeled ground-truth data has made it difficult to develop and evaluate these methods. We propose a framework that uses the reference lists from existing review papers as labeled data, which can then be used to train supervised classifiers, allowing for experimentation and testing of models and features at a large scale. We demonstrate our framework by training classifiers using different combinations of citation- and text-based features on 500 review papers. We use the R-Precision scores for the task of reconstructing the review papers’ reference lists as a way to evaluate and compare methods. We also extend our method, generating a novel set of articles relevant to the fields of misinformation studies and science communication. We find that our method can identify many of the most relevant papers for a literature review from a large set of candidate papers, and that our framework allows for development and testing of models and features to incrementally improve the results. The models we build are able to identify relevant papers even when starting with a very small set of seed papers. We also find that the methods can be adapted to identify previously undiscovered articles that may be relevant to a given topic.
Similar content being viewed by others
Notes
For the clustering, we used the cleaned version of the Web of Science network as described in “Data” section. We used the network after cleaning for citations, but before removing papers with other missing metadata. This version of the network had 73,725,142 nodes and 1,164,650,021 edges.
Since every node is in exactly one cluster (even if the cluster is only one node), and the leaves of the hierarchy tree represent the nodes themselves, the minimum depth in the hierarchy is 2. In this case, the first level is the cluster the node belongs to, and the second level is the node.
We divide the standard measure of distance between nodes in a tree by the sum of the nodes’ depth. This is because, in the case of hierarchical Infomap clustering, the total depth varies throughout the tree, and the actual depth of the nodes is arbitrary when describing the distance between the nodes. For example, a pair of nodes in the same bottom-level cluster at a depth of level 5 are no closer together than a pair of nodes in the same bottom-level cluster at level 2.
Machine learning experiments were conducted using scikit-learn version 0.20.3 running on Python 3.6.9.
Although we only performed ranking and clustering once, it would be ideal to remove all nodes and links past the year of the review paper, as well as the review paper itself, and cluster this network. However, performing a separate clustering for each review paper would be computationally infeasible. Nevertheless, any bias introduced by this should be small, as the clustering method we use considers the overall flow of information across multiple pathways, which makes it robust to the removal of individual nodes and links in large networks.
We chose to report the best-performing model for each experiment, rather than restricting to a single classifier type. This decision did not have a large effect on the results. We chose to be flexible in which classifier to use because there are differences among the different review articles. We will continue to explore the nature of these differences in future work.
The actual feature used was the absolute difference between a paper’s publication year and the mean publication year of the seed papers.
We used the spaCy library (version 2.2.3) with a pretrained English language model (core_web_lg version 2.2.5).
The models that had both network and title embedding features, but not publication year (“Cluster, PageRank, Embeddings”), performed worse in general than models with embeddings alone, with scores tending to be between 0.5 and 0.7. The reason for this is unclear.
Since the same random seeds (1, 2, 3, 4, 5) were used each time, the smaller seed sets are always subsets of the larger ones. For example, for a given review article and a given random seed, the 100 seed papers identified are all included in the set of 150; the set of 50 seed papers are all included in both the set of 100 and 150; and so on.
See Data and Methods at http://www.misinformationresearch.org for details
References
Albarqouni, L., Doust, J., & Glasziou, P. (2017). Patient preferences for cardiovascular preventive medication: A systematic review. Heart, 103(20), 1578–1586. https://doi.org/10.1136/heartjnl-2017-311244.
Ammar, W., Groeneveld, D., Bhagavatula, C., Beltagy, I., Crawford, M., & Downey, D., et al. Construction of the literature graph in semantic scholar. In: Proceedings of the 2018 conference of the North American chapter of the association for computational linguistics: Human language technologies, Volume 3 (Industry Papers), pp. 84–91. Association for Computational Linguistics. https://doi.org/10.18653/v1/N18-3011. https://www.aclweb.org/anthology/N18-3011
Bae, S. H., Halperin, D., West, J., Rosvall, M., Howe, B. (2013). Scalable flow-based community detection for large-scale network analysis. In 2013 IEEE 13th international conference on data mining workshops (pp. 303–310). https://doi.org/10.1109/ICDMW.2013.138
Bastian, H., Glasziou, P., & Chalmers, I. (2010). Seventy-five trials and eleven systematic reviews a day: How will we ever keep up? PLOS Medicine, 7(9), e1000326. https://doi.org/10.1371/journal.pmed.1000326.
Beel, J., Gipp, B., Langer, S., & Breitinger, C. (2016). Research-paper recommender systems: A literature survey. International Journal on Digital Libraries, 17(4), 305–338. https://doi.org/10.1007/s00799-015-0156-0.
Belter, C. W. (2016). Citation analysis as a literature search method for systematic reviews. Journal of the Association for Information Science and Technology, 67(11), 2766–2777. https://doi.org/10.1002/asi.23605.
Chen, T. T. (2012). The development and empirical study of a literature review aiding system. Scientometrics, 92(1), 105–116. https://doi.org/10.1007/s11192-012-0728-3.
Cormack, G. V., Grossman, M. R. (2014). Evaluation of machine-learning protocols for technology-assisted review in electronic discovery. In: Proceedings of the 37th international ACM SIGIR conference on research & development in information retrieval, SIGIR ’14, pp. 153–162. ACM, New York, NY, USA. https://doi.org/10.1145/2600428.2609601. http://doi.acm.org/10.1145/2600428.2609601. Event-place: Gold Coast, Queensland, Australia
Djidjev, H. N., Pantziou, G. E., & Zaroliagis, C. D. (1991). Computing shortest paths and distances in planar graphs. In J. L. Albert, B. Monien, & M. R. Artalejo (Eds.), Automata, languages and programming (pp. 327–338). Berlin: Springer.
Fortunato, S. (2010). Community detection in graphs. Physics Reports, 486(3–5), 75–174.
Greenhalgh, T., & Peacock, R. (2005). Effectiveness and efficiency of search methods in systematic reviews of complex evidence: Audit of primary sources. BMJ, 331(7524), 1064–1065. https://doi.org/10.1136/bmj.38636.593461.68.
Gupta, S., Varma, V. (2017). Scientific Article recommendation by using distributed representations of text and graph. In Proceedings of the 26th international conference on world wide web companion, WWW ’17 Companion (pp. 1267–1268). International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, Switzerland. https://doi.org/10.1145/3041021.3053062.
Horsley, T., Dingwall, O., & Sampson, M. (2011). Checking reference lists to find additional studies for systematic reviews. Cochrane Database of Systematic Reviews,. https://doi.org/10.1002/14651858.MR000026.pub2.
Janssens, A. C. J. W., & Gwinn, M. (2015). Novel citation-based search method for scientific literature: Application to meta-analyses. BMC Medical Research Methodology, 15(1), 84. https://doi.org/10.1186/s12874-015-0077-z.
Jha, R., Abu-Jbara, A., Radev, D. (2013). A system for summarizing scientific topics starting from keywords. In Proceedings of the 51st annual meeting of the association for computational linguistics (volume 2: short papers) (pp. 572–577).
Kanakia, A., Shen, Z., Eide, D., Wang, K. A scalable hybrid research paper recommender system for microsoft academic. In The World Wide Web conference, WWW ’19 (pp. 2893–2899). Association for Computing Machinery. https://doi.org/10.1145/3308558.3313700.
Kong, X., Mao, M., Wang, W., Liu, J., & Xu, B. (2018). VOPRec: Vector representation learning of papers with text information and structural identity for recommendation. IEEE Transactions on Emerging Topics in Computing,. https://doi.org/10.1109/TETC.2018.2830698.
Larsen, K. R., Hovorka, D., Dennis, A., & West, J. (2019). Understanding the elephant: The discourse approach to boundary identification and corpus construction for theory review articles. Journal of the Association for Information Systems, 20, 7. https://doi.org/10.17705/1jais.00556.
Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to Information Retrieval (1st ed.). New York: Cambridge University Press.
Miwa, M., Thomas, J., O’Mara-Eves, A., & Ananiadou, S. (2014). Reducing systematic review workload through certainty-based screening. Journal of Biomedical Informatics, 51, 242–253. https://doi.org/10.1016/j.jbi.2014.06.005.
Murphy, K. P. (2010). Machine Learning: A Probabilistic Perspective. Cambridge: MIT Press.
National Academies of Sciences. (2017). Engineering, and Medicine and others: Communicating science effectively: A research agenda. National Academies Press.
O’Mara-Eves, A., Thomas, J., McNaught, J., Miwa, M., & Ananiadou, S. (2015). Using text mining for study identification in systematic reviews: A systematic review of current approaches. Systematic Reviews, 4(1), 1–22. https://doi.org/10.1186/2046-4053-4-5.
Page, L., Brin, S., Motwani, R., Winograd, T. (1999). The PageRank citation ranking: Bringing order to the web. Technical report, Stanford InfoLab. http://ilpubs.stanford.edu:8090/422.
Portenoy, J., & West, J. D. (2019). Supervised learning for automated literature review. BIRNDL, 2019, 9.
Robinson, K. A., Dunn, A. G., Tsafnat, G., & Glasziou, P. (2014). Citation networks of related trials are often disconnected: Implications for bidirectional citation searches. Journal of Clinical Epidemiology, 67(7), 793–799. https://doi.org/10.1016/j.jclinepi.2013.11.015.
Ronzano, F., Saggion, H. (2015). Dr. inventor framework: Extracting structured information from scientific publications. In International conference on discovery science (pp. 209–220). Springer.
Rosvall, M., & Bergstrom, C. T. (2008). Maps of random walks on complex networks reveal community structure. Proceedings of the National Academy of Sciences, 105(4), 1118–1123.
Silva, F. N., Amancio, D. R., Bardosova, M., Costa, L.D.F., & Oliveira, O.N. (2016). Using network science and text analytics to produce surveys in a scientific topic. Journal of Informetrics, 10(2), 487–502. https://doi.org/10.1016/j.joi.2016.03.008.
Tsafnat, G., Dunn, A., Glasziou, P., & Coiera, E. (2013). The automation of systematic reviews: Would lead to best currently available evidence at the push of a button. BMJ, 346(7891), 8–8.
Wallace, B. C., Trikalinos, T. A., Lau, J., Brodley, C., & Schmid, C. H. (2010). Semi-automated screening of biomedical citations for systematic reviews. BMC Bioinformatics, 11(1), 55. https://doi.org/10.1186/1471-2105-11-55.
Webster, J., & Watson, R. T. (2002). Analyzing the past to prepare for the future: Writing a literature review. MIS Quarterly, 26(2), xiii–xxiii.
Williams, K., Wu, J., Choudhury, S. R., Khabsa, M., Giles, C. L. Scholarly big data information extraction and integration in the CiteSeerx digital library. In 2014 IEEE 30th international conference on data engineering workshops (pp. 68–73). IEEE. https://doi.org/10.1109/ICDEW.2014.6818305. http://ieeexplore.ieee.org/document/6818305/.
Yu, Z., Kraft, N. A., & Menzies, T. (2018). Finding better active learners for faster literature reviews. Empirical Software Engineering, 23(6), 3161–3186. https://doi.org/10.1007/s10664-017-9587-0.
Yu, Z., & Menzies, T. (2019). FAST2: An intelligent assistant for finding relevant papers. Expert Systems with Applications, 120, 57–71. https://doi.org/10.1016/j.eswa.2018.11.021.
Zitt, M. (2015). Meso-level retrieval: IR-bibliometrics interplay and hybrid citation-words methods in scientific fields. Delineation, 102(3), 2223–2245. https://doi.org/10.1007/s11192-014-1482-5.
Acknowledgements
We thank Dr. Chirag Shah for helpful conversations around evaluation measures, and Clarivate Analytics for the use of the Web of Science data. We also thank three anonymous reviewers for constructive feedback. This work was facilitated through the use of advanced computational, storage, and networking infrastructure provided by the Hyak supercomputer system and funded by the STF at the University of Washington.
Author information
Authors and Affiliations
Corresponding author
Appendix
Appendix
Example of autoreview results
Below is a sample of results (random samples of true positives, false positives, true negatives, and false negatives) from the autoreview classifier using the references from Fortunato (2010)—a review on Community Detection in Graphs—with a random seed of 5. The “Rank” represents the position of the candidate paper when ordered descending by the classifier’s score. The false positives, while not in the original reference list, still seem to be relevant to the topic (e.g., “Overlapping Community Search for Social Networks”). The true negatives tend to have lower scores than the false negatives, suggesting that the assigned score does tend to predict relevant documents, even if they are below the cutoff.
True Positives | ||
---|---|---|
Rank | Title | Year |
10 | Role Models For Complex Networks | 2007 |
21 | Adaptive Clustering Algorithm For Community Detection In Complex Networks | 2008 |
24 | Random Field Ising Model And Community Structure In Complex Networks | 2006 |
50 | Bayesian Approach To Network Modularity | 2008 |
58 | The Effect Of Size Heterogeneity On Community Identification In Complex Networks | 2006 |
76 | Loops And Multiple Edges In Modularity Maximization Of Networks | 2010 |
97 | Synchronization Interfaces And Overlapping Communities In Complex Networks | 2008 |
118 | The Analysis And Dissimilarity Comparison Of Community Structure | 2006 |
119 | Searching For Communities In Bipartite Networks | 2008 |
208 | Epidemic Spreading In Scale-Free Networks | 2001 |
False Positives | ||
---|---|---|
Rank | Title | Year |
72 | Modularity From Fluctuations In Random Graphs And Complex Networks | 2004 |
94 | Clustering Coefficient And Community Structure Of Bipartite Networks | 2008 |
98 | Detecting Overlapping Community Structures In Networks | 2009 |
106 | Size Reduction Of Complex Networks Preserving Modularity | 2007 |
129 | Extracting Weights From Edge Directions To Find Communities In Directed Networks | 2010 |
146 | Identifying The Role That Animals Play In Their Social Networks | 2004 |
150 | Seeding The Kernels In Graphs: Toward Multi-Resolution Community Analysis | 2009 |
159 | Overlapping Community Search For Social Networks | 2010 |
162 | Modularity Clustering Is Force-Directed Layout | 2009 |
185 | Cartography Of Complex Networks: Modules And Universal Roles | 2005 |
True Negatives | ||
---|---|---|
Rank | Title | Year |
2967 | Graph Models Of Complex Information-Sources | 1979 |
120959 | Parallel Distributed Network Characteristics Of The Dsct | 1992 |
322251 | Hidden Semantic Concept Discovery In Region Based Image Retrieval | 2004 |
327308 | A Multilevel Matrix Decomposition Algorithm For Analyzing Scattering From Large Structures | 1996 |
394850 | Multiple-Model Approach To Finite Memory Adaptive Filtering | 1992 |
749175 | Statistical Computer-Aided Design For Microwave Circuits | 1996 |
943999 | Segmental Anhidrosis In The Spinal Dermatomes In Sjogrens Syndrome-Associated Neuropathy | 1993 |
1121787 | Rheological And Dielectrical Characterization Of Melt Mixed Polycarbonate-Multiwalled Carbon Nanotube Composites | 2004 |
1177851 | Explaining The Rate Spread On Corporate Bonds | 2001 |
1256866 | The Cyanobacterial Cell Division Factor Ftn6 Contains An N-Terminal Dnad-Like Domain | 2009 |
False Negatives | ||
---|---|---|
Rank | Title | Year |
259 | Heterogeneity In Oscillator Networks: Are Smaller Worlds Easier To Synchronize? | 2003 |
324 | Assessing The Relevance Of Node Features For Network Structure | 2009 |
385 | The Use Of Edge-Betweenness Clustering To Investigate Biological Function In Protein Interaction Networks | 2005 |
6605 | A Measure Of Betweenness Centrality Based On Random Walks | 2005 |
19863 | On Decomposition Of Networks In Minimally Interconnected Subnetworks | 1969 |
59900 | Objective Criteria For Evaluation Of Clustering Methods | 1971 |
139178 | Optimization With Extremal Dynamics | 2001 |
250583 | The Tie Effect On Information Dissemination: The Spread Of A Commercial Rumor In Hong Kong | 2002 |
281952 | Compartments Revealed In Food-Web Structure | 2003 |
1203248 | Dynamic Asset Trees And Portfolio Analysis | 2002 |
Rights and permissions
About this article
Cite this article
Portenoy, J., West, J.D. Constructing and evaluating automated literature review systems. Scientometrics 125, 3233–3251 (2020). https://doi.org/10.1007/s11192-020-03490-w
Received:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11192-020-03490-w