Abstract
Digital libraries containing scholarly publications are common today. They are an invaluable source of information to students, researchers, and practitioners. However, many digital libraries expose only the article metadata like title, author names, publication date, and the abstract for free; access to full-text requires access toll. Given that journal subscription charges are sometimes prohibitive, many important publications remain beyond the access of researchers, especially in developing countries. While open access publication solves this issue, the hard reality is that many research papers are not currently available for free reading or download. In this paper, we present a novel approach to alleviate this problem. We present a technique to retrieve open access surrogates of a scholarly article when the latter is unavailable freely in a digital library. Surrogates are articles semantically close to the original articles, written by the same author(s) and give valuable insights into the paper being searched for; they address the same or a very similar problem using the same or very similar techniques. Our focus on approximate matches of scholarly articles distinguishes our application from many academic search engines. We run it on a large corpus of computer science papers and compare the results with human judgment. Experimental results show that our tool can indeed identify relevant OA surrogates of access-restricted papers.
Similar content being viewed by others
Notes
The OA links were manually verified on May 26, 2019.
References
Aggarwal, C. C., & Zhai, C. (Eds.) (2012). A survey of text clustering algorithms. In Mining text data (pp. 77–128). Boston, MA: Springer. https://doi.org/10.1007/978-1-4614-3223-4_4.
Ahlgren, P., & Colliander, C. (2009). Document-document similarity approaches and science mapping: Experimental comparison of five approaches. Journal of Informetrics, 3(1), 49–63.
Arnab, S., Zhihong, S., Yang, S., Hao, M., Darrin, E., Bo-June, H., & Kuansan, W. (2019). Microsoft academic graph data from 2019-03-22. https://doi.org/10.5281/zenodo.2628216. Accessed 6 Aug 2019.
Bassett, G. W., & Persky, J. (1999). Robust voting. Public Choice, 99(3–4), 299–310.
Beg, M. M. S. (2005). A subjective measure of web search quality. Information Sciences, 169(3–4), 365–381.
Beltagy, I., Cohan, A., & Lo, K. (2019). SciBERT: Pretrained contextualized embeddings for scientific text. arXiv:1903.10676.
Blei, D. M., & Lafferty, J. D. (2009). Topic models. In A. N. Srivastava, & M. Sahami (Eds.), Text mining: Classification, clustering, and applications (pp. 71–93). Chapman and Hall/CRC.
Bohannon, J. (2016). Who’s downloading pirated papers? Everyone. Science, 352, 6285.
Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5, 135–146.
Bornmann, L., & Mutz, R. (2015). Growth rates of modern science: A bibliometric analysis based on the number of publications and cited references. Journal of the Association for Information Science and Technology, 66(11), 2215–2222.
Boyack, K. W., Newman, D., Duhon, R. J., Klavans, R., Patek, M., Biberstine, J. R., et al. (2011). Clustering more than two million biomedical publications: Comparing the accuracies of nine text-based similarity approaches. PloS ONE, 6(3), e18029.
Camacho-Collados, J., & Taher Pilehvar, M. (2018). From word to sense embeddings: A survey on vector representations of meaning. Journal of Artificial Intelligence Research, 63, 743–788.
Campos, D., Matos, S., & Oliveira, J. L. (2012). Biomedical named entity recognition: A survey of machine-learning tools. In S. Sakurai (Ed.), Theory and applications for advanced text mining (pp. 175–195). IntechOpen. https://doi.org/10.5772/51066.
Chan, J., Chang, J. C., Hope, T., Shahaf, D., & Kittur, A. (2018). SOLVENT: A mixed initiative system for finding analogies between research papers. Proceedings of the ACM on Human-Computer Interaction, 2(CSCW), 1–31. https://doi.org/10.1145/3274300.
Chan, L., Kirsop, B., & Arunachalam, S. (2011). Towards open and equitable access to research and knowledge for development. PLoS Medicine, 8(3), e1001016.
Dai, A. M., Olah, C., & Le, Q. V. (2015). Document embedding with paragraph vectors. arXiv:1507.07998.
de Borda, J. C. (1784). Mémoire sur les élections au scrutin. In Histoire de l \(\backslash\)’Academie Royale des Sciences pour 1781 (Paris, 1784) (pp. 657–665).
Department of Higher Education Ministry of Human Resource Development Government of India (2018). India rankings 2018: National institutional ranking framework. https://www.nirfindia.org/2018/pdf/nirf_2018_final.pdf. Retrieved 6 Aug 2019.
Desarkar, M. S., Sarkar, S., & Mitra, P. (2016). Preference relations based unsupervised rank aggregation for metasearch. Expert Systems with Applications, 49, 86–98.
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805.
Dwork, C., Kumar, R., Naor, M., & Sivakumar, D. (2001). Rank aggregation methods for the web. In Proceedings of the 10th international conference on world wide web (pp. 613–622). ACM.
Else, H. (2018). How unpaywall is transforming open science. Nature, 560, 290–291.
Else, H. (2019). Thousands of scientists run up against Elsevier’s paywall. https://doi.org/10.1038/d41586-019-00492-4. Retrieved 6 Aug 2019.
Ferreira, A. A., Gonçalves, M. A., & Laender, A. H. F. (2012). A brief survey of automatic methods for author name disambiguation. ACM SIGMOD Record, 41(2), 15–26.
Fiedler, R. L., & Kaner, C. (2010). Plagiarism-detection services: How well do they actually perform? IEEE Technology and Society Magazine, 29(4), 37–43.
Fortunato, S., Bergstrom, C. T., Börner, K., Evans, J. A., Helbing, D., Milojević, S., et al. (2018). Science of science. Science, 359(6379), eaao0185.
Gadd, E., & Covey, D. T. (2019). What does ‘green’ open access mean? Tracking twelve years of changes to journal publisher self-archiving policies. Journal of Librarianship and Information Science, 51(1), 106–122.
Gaind, N. (2019). Huge US university cancels subscription with Elsevier. Nature, 567(7746), 15–16. https://doi.org/10.1038/d41586-019-00758-x.
Gaulé, P. (2009). Access to scientific literature in India. Journal of the American Society for Information Science and Technology, 60(12), 2548–2553.
Guan, L., Lin, J., Luo, B., & Jing, J. (2014). Copker: Computing with private keys without ram. In Proceedings of the network and distributed system security symposium (NDSS) (pp. 23–26).
Guan, L., Lin, J., Ma, Z., Luo, B., Xia, L., & Jing, J. (2018). Copker: A cryptographic engine against cold-boot attacks. IEEE Transactions on Dependable and Secure Computing, 15(5), 742–754.
Gusenbauer, M. (2019). Google scholar to overshadow them all? Comparing the sizes of 12 academic search engines and bibliographic databases. Scientometrics, 118(1), 177–214.
Halevi, G., Moed, H., & Bar-Ilan, J. (2017). Suitability of google scholar as a source of scientific information and as a source of data for scientific evaluation—review of the literature. Journal of Informetrics, 11(3), 823–834.
Hamedani, M. R., Kim, S.-W., & Kim, D.-J. (2016). SimCC: A novel method to consider both content and citations for computing similarity of scientific papers. Information Sciences, 334, 273–292.
Heesen, R. (2017). Communism and the incentive to share in science. Philosophy of Science, 84(4), 698–716.
Howard, J., & Ruder, S. (2018). Universal language model fine-tuning for text classification. arXiv:1801.06146.
Jamali, H. R., & Nabavi, M. (2015). Open access and sources of full-text articles in google scholar in different subject fields. Scientometrics, 105(3), 1635–1651.
Jin, D., & Szolovits, P. (2018). Hierarchical neural networks for sequential sentence classification in medical scientific abstracts. In Proceedings of the 2018 conference on empirical methods in natural language processing (pp. 3100–3109).
Joulin, A., Grave, E., Bojanowski, P., & Mikolov, T. (2016). Bag of tricks for efficient text classification. arXiv:1607.01759.
Kim, J. (2019). Author-based analysis of conference versus journal publication in computer science. Journal of the Association for Information Science and Technology, 70(1), 71–82.
Klein, M., Broadwell, P., Farb, S. E., & Grappone, T. (2016). Comparing published scientific journal articles to their pre-print versions. In Proceedings of the 16th ACM/IEEE-CS joint conference on digital libraries (pp. 153–162). ACM.
Kong, X., Mao, M., Wang, W., Liu, J., & Xu, B. (2018). VOPRec: Vector representation learning of papers with text information and structural identity for recommendation. IEEE Transactions on Emerging Topics in Computing. https://doi.org/10.1109/TETC.2018.2830698.
Kowsari, K., Jafari Meimandi, K., Heidarysafa, M., Mendu, S., Barnes, L., & Brown, D. (2019). Text classification algorithms: A survey. Information, 10(4), 0. https://doi.org/10.3390/info10040150. ISSN 2078-2489.
Kusner, M., Sun, Y., Kolkin, N., & Weinberger, K. (2015). From word embeddings to document distances. In Proceedings of the international conference on machine learning (ICML) (pp 957–966).
Le, Q., & Mikolov, T. (2014). Distributed representations of sentences and documents. In Proceedings of international conference on machine learning (pp 1188–1196).
Ley, M. (2002). The DBLP computer science bibliography: Evolution, research issues, perspectives. In Proceedings of the international symposium on string processing and information retrieval (SPIRE) (pp. 1–10). Berlin: Springer.
Ley, M. (2009). DBLP: Some lessons learned. Proceedings of the VLDB Endowment, 2(2), 1493–1500.
Li, Y., & Yang, T. (2018). Word embedding for understanding natural language: A survey. In S. Srinivasan (Ed.), Guide to big data applications. Studies in big data (Vol. 26, pp. 83–104). Cham: Springer. https://doi.org/10.1007/978-3-319-53817-4_4.
Lin, S. (2010). Space oriented rank-based data integration. Statistical Applications in Genetics and Molecular Biology. https://doi.org/10.2202/1544-6115.1534.
Marcos-Pablos, S., & García-Peñalvo, F. J. (2018). Information retrieval methodology for aiding scientific database search. Soft Computing (pp 1–10).
Martín-Martín, A., Costas, R., van Leeuwen, T., & Delgado López-Cózar, E. (2018). Evidence of open access of scientific publications in google scholar: A large-scale analysis. Journal of Informetrics, 12(3), 819–841.
McKiernan, E. C., Bourne, P. E., Brown, C. T., Buck, S., Kenall, A., Lin, J., et al. (2016). Point of view: How open science helps researchers succeed. ELife, 5, e16800. https://doi.org/10.7554/eLife.16800.
Meuschke, N., & Gipp, B. (2013). State-of-the-art in detecting academic plagiarism. International Journal for Educational Integrity, 9(1), 1–22.
Meuschke, N., Stange, V., Schubotz, M., & Gipp, B. (2018). HyPlag: A hybrid approach to academic plagiarism detection. In Proceedings of the 41st international ACM SIGIR conference on research & development in information retrieval (pp. 1321–1324).
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Proceedings of the advances in neural information processing systems (pp. 3111–3119).
Mubin, O., Arsalan, M., & Mahmud, A. A. (2018). Tracking the follow-up of work in progress papers. Scientometrics, 114(3), 1159–1174.
Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C. Lee, K., & Zettlemoyer, L. (2018). Deep contextualized word representations. arXiv:1802.05365.
Piwowar, H., Priem, J., Larivière, V., Alperin, J. P., Matthias, L., Norlander, B., et al. (2018). The state of OA: A large-scale analysis of the prevalence and impact of open access articles. PeerJ, 6, e4375.
Prathap, G., & Gupta, B. M. (2009). Ranking of indian engineering and technological institutes for their research performance during 1999–2008. Current Science, 97(3), 304–306.
Salton, G., & Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Information Processing & Management, 24(5), 513–523.
Santosh, T. Y. S. S., Sanyal, D. K., & Bhowmick, P. K. (2018). Surrogator: Enriching a digital library with open access surrogate resources. In Demo track of the ACM India joint international conference on data science and management of data (CoDS-COMAD) 2018 (5th ACM IKDD CoDS and 23rd COMAD).
Santosh, T. Y. S. S., Sanyal, D. K., Bhowmick, P. K., & Das, P. P. (2018). Surrogator: A tool to enrich a digital library with open access surrogate resources. In Proceedings of the 18th ACM/IEEE on joint conference on digital libraries (pp. 379–380). ACM.
Schiltz, M. (2018). Science without publication paywalls: cOAlition S for the realisation of full and immediate open access. PLoS Medicine, 15(9), e1002663.
Schimek, M. G., Budinska, E., Ding, J., Kugler, K. G., Svendova, V., & Lin, S. (2019). TopKLists: Analyzing multiple ranked lists. https://cran.r-project.org/web/packages/TopKLists/vignettes/TopKLists.pdf. Accessed 6 Aug 2019.
Shen, J., Xiao, J., He, X., Shang, J., Sinha, S., & Han, J. (2018). Entity set search of scientific literature: An unsupervised ranking approach. In Proceedings of the 41st international ACM SIGIR conference on research & development in information retrieval (pp. 565–574). ACM.
Singh, V. K., Uddin, A., & Pinto, D. (2015). Computer science research: The top 100 institutions in India and in the world. Scientometrics, 104(2), 529–553.
Strevens, M. (2017). Scientific sharing: Communism and the social contract. In T. Boyer-Kassem, C. Mayo-Wilson, & M. Weisberg (Eds.), Scientific collaboration and collective knowledge: New essays (pp. 1–50). Oxford University Press.
Suber, P., et al. (2019). Timeline of the open access movement. http://oad.simmons.edu/oadwiki/Timeline. Retrieved 6 Aug 2019.
Tang, Y., & Tong, Q. (2016). BordaRank: A ranking aggregation based approach to collaborative filtering. In Proceedings of IEEE/ACIS 15th international conference on computer and information science (ICIS) (pp. 1–6). IEEE.
Wainer, J., & Valle, E. (2013). What happens to computer science research after it is published? Tracking cs research lines. Journal of the American Society for Information Science and Technology, 64(6), 1104–1111.
Western Illinois University (2019). Open access and scholarly publishing: The scholarly publishing crisis. URL https://wiu.libguides.com/c.php?g=295451&p=1969198. Retrieved 6 Aug 2019.
Xia, F., Wang, W., Bekele, T. M., & Liu, H. (2017). Big scholarly data: A survey. IEEE Transactions on Big Data, 3(1), 18–35.
Yoon, S.-H., Kim, S.-W., Kim, J.-S., & Hwang, W.-S. (2011). On computing text-based similarity in scientific literature. In Proceedings of the 20th international conference companion on world wide web (pp. 169–170). ACM.
Zanibbi, R., & Blostein, D. (2012). Recognition and retrieval of mathematical expressions. International Journal on Document Analysis and Recognition (IJDAR), 15(4), 331–357.
Acknowledgements
This work is supported by National Digital Library of India Project sponsored by Ministry of Human Resource Development (Grant No. F.No.16-7/2017-TEL), Government of India at IIT Kharagpur. We thank Soumya Banerjee and Gopal Agarwal of the Department of Information Technology, Jadavpur University for their assistance in preparing the dataset used in this work.
Author information
Authors and Affiliations
Corresponding author
Appendix
Appendix
This “Appendix” displays tables containing data and experimental results reported in the main text.
Rights and permissions
About this article
Cite this article
Sanyal, D.K., Bhowmick, P.K., Das, P.P. et al. Enhancing access to scholarly publications with surrogate resources. Scientometrics 121, 1129–1164 (2019). https://doi.org/10.1007/s11192-019-03227-4
Received:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11192-019-03227-4