Skip to main content
Log in

Enhancing access to scholarly publications with surrogate resources

  • Published:
Scientometrics Aims and scope Submit manuscript

Abstract

Digital libraries containing scholarly publications are common today. They are an invaluable source of information to students, researchers, and practitioners. However, many digital libraries expose only the article metadata like title, author names, publication date, and the abstract for free; access to full-text requires access toll. Given that journal subscription charges are sometimes prohibitive, many important publications remain beyond the access of researchers, especially in developing countries. While open access publication solves this issue, the hard reality is that many research papers are not currently available for free reading or download. In this paper, we present a novel approach to alleviate this problem. We present a technique to retrieve open access surrogates of a scholarly article when the latter is unavailable freely in a digital library. Surrogates are articles semantically close to the original articles, written by the same author(s) and give valuable insights into the paper being searched for; they address the same or a very similar problem using the same or very similar techniques. Our focus on approximate matches of scholarly articles distinguishes our application from many academic search engines. We run it on a large corpus of computer science papers and compare the results with human judgment. Experimental results show that our tool can indeed identify relevant OA surrogates of access-restricted papers.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. https://dl.acm.org.

  2. https://ieeexplore.ieee.org/Xplore/home.jsp.

  3. https://www.sciencedirect.com/.

  4. https://link.springer.com/.

  5. https://www.ndl.gov.in.

  6. https://www.inflibnet.ac.in/ess/index.php.

  7. https://www.research4life.org/.

  8. https://sci-hub.se/.

  9. https://arxiv.org/.

  10. https://www.researchgate.net/.

  11. https://www.academia.edu/.

  12. https://scholar.google.com.

  13. http://dev.www.isocdev.org/sites/default/files/07_1_1.pdf.

  14. https://academic.microsoft.com/.

  15. https://www.semanticscholar.org/.

  16. https://citeseerx.ist.psu.edu/.

  17. https://openaccessbutton.org/.

  18. https://openaccessbutton.org/about

  19. https://unpaywall.org/.

  20. https://blog.acolyer.org/.

  21. https://dblp.uni-trier.de/.

  22. https://scikit-learn.org/.

  23. https://radimrehurek.com/gensim/.

  24. https://github.com/zalandoresearch/flair.

  25. The OA links were manually verified on May 26, 2019.

  26. https://github.com/soumyaxyz/illumine/.

References

  • Aggarwal, C. C., & Zhai, C. (Eds.) (2012). A survey of text clustering algorithms. In Mining text data (pp. 77–128). Boston, MA: Springer. https://doi.org/10.1007/978-1-4614-3223-4_4.

    Chapter  Google Scholar 

  • Ahlgren, P., & Colliander, C. (2009). Document-document similarity approaches and science mapping: Experimental comparison of five approaches. Journal of Informetrics, 3(1), 49–63.

    Article  Google Scholar 

  • Arnab, S., Zhihong, S., Yang, S., Hao, M., Darrin, E., Bo-June, H., & Kuansan, W. (2019). Microsoft academic graph data from 2019-03-22. https://doi.org/10.5281/zenodo.2628216. Accessed 6 Aug 2019.

  • Bassett, G. W., & Persky, J. (1999). Robust voting. Public Choice, 99(3–4), 299–310.

    Article  Google Scholar 

  • Beg, M. M. S. (2005). A subjective measure of web search quality. Information Sciences, 169(3–4), 365–381.

    Article  MathSciNet  Google Scholar 

  • Beltagy, I., Cohan, A., & Lo, K. (2019). SciBERT: Pretrained contextualized embeddings for scientific text. arXiv:1903.10676.

  • Blei, D. M., & Lafferty, J. D. (2009). Topic models. In A. N. Srivastava, & M. Sahami (Eds.), Text mining: Classification, clustering, and applications (pp. 71–93). Chapman and Hall/CRC.

  • Bohannon, J. (2016). Who’s downloading pirated papers? Everyone. Science, 352, 6285.

    Google Scholar 

  • Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5, 135–146.

    Article  Google Scholar 

  • Bornmann, L., & Mutz, R. (2015). Growth rates of modern science: A bibliometric analysis based on the number of publications and cited references. Journal of the Association for Information Science and Technology, 66(11), 2215–2222.

    Article  Google Scholar 

  • Boyack, K. W., Newman, D., Duhon, R. J., Klavans, R., Patek, M., Biberstine, J. R., et al. (2011). Clustering more than two million biomedical publications: Comparing the accuracies of nine text-based similarity approaches. PloS ONE, 6(3), e18029.

    Article  Google Scholar 

  • Camacho-Collados, J., & Taher Pilehvar, M. (2018). From word to sense embeddings: A survey on vector representations of meaning. Journal of Artificial Intelligence Research, 63, 743–788.

    Article  MathSciNet  Google Scholar 

  • Campos, D., Matos, S., & Oliveira, J. L. (2012). Biomedical named entity recognition: A survey of machine-learning tools. In S. Sakurai (Ed.), Theory and applications for advanced text mining (pp. 175–195). IntechOpen. https://doi.org/10.5772/51066.

    Google Scholar 

  • Chan, J., Chang, J. C., Hope, T., Shahaf, D., & Kittur, A. (2018). SOLVENT: A mixed initiative system for finding analogies between research papers. Proceedings of the ACM on Human-Computer Interaction, 2(CSCW), 1–31. https://doi.org/10.1145/3274300.

    Article  Google Scholar 

  • Chan, L., Kirsop, B., & Arunachalam, S. (2011). Towards open and equitable access to research and knowledge for development. PLoS Medicine, 8(3), e1001016.

    Article  Google Scholar 

  • Dai, A. M., Olah, C., & Le, Q. V. (2015). Document embedding with paragraph vectors. arXiv:1507.07998.

  • de Borda, J. C. (1784). Mémoire sur les élections au scrutin. In Histoire de l \(\backslash\)Academie Royale des Sciences pour 1781 (Paris, 1784) (pp. 657–665).

  • Department of Higher Education Ministry of Human Resource Development Government of India (2018). India rankings 2018: National institutional ranking framework. https://www.nirfindia.org/2018/pdf/nirf_2018_final.pdf. Retrieved 6 Aug 2019.

  • Desarkar, M. S., Sarkar, S., & Mitra, P. (2016). Preference relations based unsupervised rank aggregation for metasearch. Expert Systems with Applications, 49, 86–98.

    Article  Google Scholar 

  • Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805.

  • Dwork, C., Kumar, R., Naor, M., & Sivakumar, D. (2001). Rank aggregation methods for the web. In Proceedings of the 10th international conference on world wide web (pp. 613–622). ACM.

  • Else, H. (2018). How unpaywall is transforming open science. Nature, 560, 290–291.

    Article  Google Scholar 

  • Else, H. (2019). Thousands of scientists run up against Elsevier’s paywall. https://doi.org/10.1038/d41586-019-00492-4. Retrieved 6 Aug 2019.

  • Ferreira, A. A., Gonçalves, M. A., & Laender, A. H. F. (2012). A brief survey of automatic methods for author name disambiguation. ACM SIGMOD Record, 41(2), 15–26.

    Article  Google Scholar 

  • Fiedler, R. L., & Kaner, C. (2010). Plagiarism-detection services: How well do they actually perform? IEEE Technology and Society Magazine, 29(4), 37–43.

    Article  Google Scholar 

  • Fortunato, S., Bergstrom, C. T., Börner, K., Evans, J. A., Helbing, D., Milojević, S., et al. (2018). Science of science. Science, 359(6379), eaao0185.

    Article  Google Scholar 

  • Gadd, E., & Covey, D. T. (2019). What does ‘green’ open access mean? Tracking twelve years of changes to journal publisher self-archiving policies. Journal of Librarianship and Information Science, 51(1), 106–122.

    Article  Google Scholar 

  • Gaind, N. (2019). Huge US university cancels subscription with Elsevier. Nature, 567(7746), 15–16. https://doi.org/10.1038/d41586-019-00758-x.

    Article  Google Scholar 

  • Gaulé, P. (2009). Access to scientific literature in India. Journal of the American Society for Information Science and Technology, 60(12), 2548–2553.

    Article  Google Scholar 

  • Guan, L., Lin, J., Luo, B., & Jing, J. (2014). Copker: Computing with private keys without ram. In Proceedings of the network and distributed system security symposium (NDSS) (pp. 23–26).

  • Guan, L., Lin, J., Ma, Z., Luo, B., Xia, L., & Jing, J. (2018). Copker: A cryptographic engine against cold-boot attacks. IEEE Transactions on Dependable and Secure Computing, 15(5), 742–754.

    Article  Google Scholar 

  • Gusenbauer, M. (2019). Google scholar to overshadow them all? Comparing the sizes of 12 academic search engines and bibliographic databases. Scientometrics, 118(1), 177–214.

    Article  Google Scholar 

  • Halevi, G., Moed, H., & Bar-Ilan, J. (2017). Suitability of google scholar as a source of scientific information and as a source of data for scientific evaluation—review of the literature. Journal of Informetrics, 11(3), 823–834.

    Article  Google Scholar 

  • Hamedani, M. R., Kim, S.-W., & Kim, D.-J. (2016). SimCC: A novel method to consider both content and citations for computing similarity of scientific papers. Information Sciences, 334, 273–292.

    Article  Google Scholar 

  • Heesen, R. (2017). Communism and the incentive to share in science. Philosophy of Science, 84(4), 698–716.

    Article  MathSciNet  Google Scholar 

  • Howard, J., & Ruder, S. (2018). Universal language model fine-tuning for text classification. arXiv:1801.06146.

  • Jamali, H. R., & Nabavi, M. (2015). Open access and sources of full-text articles in google scholar in different subject fields. Scientometrics, 105(3), 1635–1651.

    Article  Google Scholar 

  • Jin, D., & Szolovits, P. (2018). Hierarchical neural networks for sequential sentence classification in medical scientific abstracts. In Proceedings of the 2018 conference on empirical methods in natural language processing (pp. 3100–3109).

  • Joulin, A., Grave, E., Bojanowski, P., & Mikolov, T. (2016). Bag of tricks for efficient text classification. arXiv:1607.01759.

  • Kim, J. (2019). Author-based analysis of conference versus journal publication in computer science. Journal of the Association for Information Science and Technology, 70(1), 71–82.

    Article  Google Scholar 

  • Klein, M., Broadwell, P., Farb, S. E., & Grappone, T. (2016). Comparing published scientific journal articles to their pre-print versions. In Proceedings of the 16th ACM/IEEE-CS joint conference on digital libraries (pp. 153–162). ACM.

  • Kong, X., Mao, M., Wang, W., Liu, J., & Xu, B. (2018). VOPRec: Vector representation learning of papers with text information and structural identity for recommendation. IEEE Transactions on Emerging Topics in Computing. https://doi.org/10.1109/TETC.2018.2830698.

  • Kowsari, K., Jafari Meimandi, K., Heidarysafa, M., Mendu, S., Barnes, L., & Brown, D. (2019). Text classification algorithms: A survey. Information, 10(4), 0. https://doi.org/10.3390/info10040150. ISSN 2078-2489.

    Article  Google Scholar 

  • Kusner, M., Sun, Y., Kolkin, N., & Weinberger, K. (2015). From word embeddings to document distances. In Proceedings of the international conference on machine learning (ICML) (pp 957–966).

  • Le, Q., & Mikolov, T. (2014). Distributed representations of sentences and documents. In Proceedings of international conference on machine learning (pp 1188–1196).

  • Ley, M. (2002). The DBLP computer science bibliography: Evolution, research issues, perspectives. In Proceedings of the international symposium on string processing and information retrieval (SPIRE) (pp. 1–10). Berlin: Springer.

    Google Scholar 

  • Ley, M. (2009). DBLP: Some lessons learned. Proceedings of the VLDB Endowment, 2(2), 1493–1500.

    Article  Google Scholar 

  • Li, Y., & Yang, T. (2018). Word embedding for understanding natural language: A survey. In S. Srinivasan (Ed.), Guide to big data applications. Studies in big data (Vol. 26, pp. 83–104). Cham: Springer. https://doi.org/10.1007/978-3-319-53817-4_4.

    Google Scholar 

  • Lin, S. (2010). Space oriented rank-based data integration. Statistical Applications in Genetics and Molecular Biology. https://doi.org/10.2202/1544-6115.1534.

    Article  MathSciNet  MATH  Google Scholar 

  • Marcos-Pablos, S., & García-Peñalvo, F. J. (2018). Information retrieval methodology for aiding scientific database search. Soft Computing (pp 1–10).

  • Martín-Martín, A., Costas, R., van Leeuwen, T., & Delgado López-Cózar, E. (2018). Evidence of open access of scientific publications in google scholar: A large-scale analysis. Journal of Informetrics, 12(3), 819–841.

    Article  Google Scholar 

  • McKiernan, E. C., Bourne, P. E., Brown, C. T., Buck, S., Kenall, A., Lin, J., et al. (2016). Point of view: How open science helps researchers succeed. ELife, 5, e16800. https://doi.org/10.7554/eLife.16800.

    Article  Google Scholar 

  • Meuschke, N., & Gipp, B. (2013). State-of-the-art in detecting academic plagiarism. International Journal for Educational Integrity, 9(1), 1–22.

    Google Scholar 

  • Meuschke, N., Stange, V., Schubotz, M., & Gipp, B. (2018). HyPlag: A hybrid approach to academic plagiarism detection. In Proceedings of the 41st international ACM SIGIR conference on research & development in information retrieval (pp. 1321–1324).

  • Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Proceedings of the advances in neural information processing systems (pp. 3111–3119).

  • Mubin, O., Arsalan, M., & Mahmud, A. A. (2018). Tracking the follow-up of work in progress papers. Scientometrics, 114(3), 1159–1174.

    Article  Google Scholar 

  • Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C. Lee, K., & Zettlemoyer, L. (2018). Deep contextualized word representations. arXiv:1802.05365.

  • Piwowar, H., Priem, J., Larivière, V., Alperin, J. P., Matthias, L., Norlander, B., et al. (2018). The state of OA: A large-scale analysis of the prevalence and impact of open access articles. PeerJ, 6, e4375.

    Article  Google Scholar 

  • Prathap, G., & Gupta, B. M. (2009). Ranking of indian engineering and technological institutes for their research performance during 1999–2008. Current Science, 97(3), 304–306.

    Google Scholar 

  • Salton, G., & Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Information Processing & Management, 24(5), 513–523.

    Article  Google Scholar 

  • Santosh, T. Y. S. S., Sanyal, D. K., & Bhowmick, P. K. (2018). Surrogator: Enriching a digital library with open access surrogate resources. In Demo track of the ACM India joint international conference on data science and management of data (CoDS-COMAD) 2018 (5th ACM IKDD CoDS and 23rd COMAD).

  • Santosh, T. Y. S. S., Sanyal, D. K., Bhowmick, P. K., & Das, P. P. (2018). Surrogator: A tool to enrich a digital library with open access surrogate resources. In Proceedings of the 18th ACM/IEEE on joint conference on digital libraries (pp. 379–380). ACM.

  • Schiltz, M. (2018). Science without publication paywalls: cOAlition S for the realisation of full and immediate open access. PLoS Medicine, 15(9), e1002663.

    Article  Google Scholar 

  • Schimek, M. G., Budinska, E., Ding, J., Kugler, K. G., Svendova, V., & Lin, S. (2019). TopKLists: Analyzing multiple ranked lists. https://cran.r-project.org/web/packages/TopKLists/vignettes/TopKLists.pdf. Accessed 6 Aug 2019.

  • Shen, J., Xiao, J., He, X., Shang, J., Sinha, S., & Han, J. (2018). Entity set search of scientific literature: An unsupervised ranking approach. In Proceedings of the 41st international ACM SIGIR conference on research & development in information retrieval (pp. 565–574). ACM.

  • Singh, V. K., Uddin, A., & Pinto, D. (2015). Computer science research: The top 100 institutions in India and in the world. Scientometrics, 104(2), 529–553.

    Article  Google Scholar 

  • Strevens, M. (2017). Scientific sharing: Communism and the social contract. In T. Boyer-Kassem, C. Mayo-Wilson, & M. Weisberg (Eds.), Scientific collaboration and collective knowledge: New essays (pp. 1–50). Oxford University Press.

  • Suber, P., et al. (2019). Timeline of the open access movement. http://oad.simmons.edu/oadwiki/Timeline. Retrieved 6 Aug 2019.

  • Tang, Y., & Tong, Q. (2016). BordaRank: A ranking aggregation based approach to collaborative filtering. In Proceedings of IEEE/ACIS 15th international conference on computer and information science (ICIS) (pp. 1–6). IEEE.

  • Wainer, J., & Valle, E. (2013). What happens to computer science research after it is published? Tracking cs research lines. Journal of the American Society for Information Science and Technology, 64(6), 1104–1111.

    Article  Google Scholar 

  • Western Illinois University (2019). Open access and scholarly publishing: The scholarly publishing crisis. URL https://wiu.libguides.com/c.php?g=295451&p=1969198. Retrieved 6 Aug 2019.

  • Xia, F., Wang, W., Bekele, T. M., & Liu, H. (2017). Big scholarly data: A survey. IEEE Transactions on Big Data, 3(1), 18–35.

    Article  Google Scholar 

  • Yoon, S.-H., Kim, S.-W., Kim, J.-S., & Hwang, W.-S. (2011). On computing text-based similarity in scientific literature. In Proceedings of the 20th international conference companion on world wide web (pp. 169–170). ACM.

  • Zanibbi, R., & Blostein, D. (2012). Recognition and retrieval of mathematical expressions. International Journal on Document Analysis and Recognition (IJDAR), 15(4), 331–357.

    Article  Google Scholar 

Download references

Acknowledgements

This work is supported by National Digital Library of India Project sponsored by Ministry of Human Resource Development (Grant No. F.No.16-7/2017-TEL), Government of India at IIT Kharagpur. We thank Soumya Banerjee and Gopal Agarwal of the Department of Information Technology, Jadavpur University for their assistance in preparing the dataset used in this work.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Debarshi Kumar Sanyal.

Appendix

Appendix

This “Appendix” displays tables containing data and experimental results reported in the main text.

Table 1 Papers and their top-1 OA surrogates in the dataset
Table 2 The top-1 OA surrogate that is dropped when TMIN is changed from 0.2 to 0.3
Table 3 The top-1 OA surrogates that are dropped when TMIN is changed from 0.3 to 0.4

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sanyal, D.K., Bhowmick, P.K., Das, P.P. et al. Enhancing access to scholarly publications with surrogate resources. Scientometrics 121, 1129–1164 (2019). https://doi.org/10.1007/s11192-019-03227-4

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11192-019-03227-4

Keywords

Navigation