Enhancing access to scholarly publications with surrogate resources

Sanyal, Debarshi Kumar; Bhowmick, Plaban Kumar; Das, Partha Pratim; Chattopadhyay, Samiran; Santosh, T. Y. S. S.

doi:10.1007/s11192-019-03227-4

Enhancing access to scholarly publications with surrogate resources

Published: 23 September 2019

Volume 121, pages 1129–1164, (2019)
Cite this article

Scientometrics Aims and scope Submit manuscript

Debarshi Kumar Sanyal ORCID: orcid.org/0000-0001-8723-5002¹,
Plaban Kumar Bhowmick²,
Partha Pratim Das³,
Samiran Chattopadhyay⁴ &
…
T. Y. S. S. Santosh³

937 Accesses
6 Citations
4 Altmetric
Explore all metrics

Abstract

Digital libraries containing scholarly publications are common today. They are an invaluable source of information to students, researchers, and practitioners. However, many digital libraries expose only the article metadata like title, author names, publication date, and the abstract for free; access to full-text requires access toll. Given that journal subscription charges are sometimes prohibitive, many important publications remain beyond the access of researchers, especially in developing countries. While open access publication solves this issue, the hard reality is that many research papers are not currently available for free reading or download. In this paper, we present a novel approach to alleviate this problem. We present a technique to retrieve open access surrogates of a scholarly article when the latter is unavailable freely in a digital library. Surrogates are articles semantically close to the original articles, written by the same author(s) and give valuable insights into the paper being searched for; they address the same or a very similar problem using the same or very similar techniques. Our focus on approximate matches of scholarly articles distinguishes our application from many academic search engines. We run it on a large corpus of computer science papers and compare the results with human judgment. Experimental results show that our tool can indeed identify relevant OA surrogates of access-restricted papers.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

unarXive: a large scholarly data set with publications’ full-text, annotated in-text citations, and links to metadata

Article Open access 02 March 2020

CiteSeer x : A Scholarly Big Dataset

Writers of the Lost Paper: A Case Study on Barriers to (Re-) Finding Publications

Notes

References

Aggarwal, C. C., & Zhai, C. (Eds.) (2012). A survey of text clustering algorithms. In Mining text data (pp. 77–128). Boston, MA: Springer. https://doi.org/10.1007/978-1-4614-3223-4_4.
Chapter Google Scholar
Ahlgren, P., & Colliander, C. (2009). Document-document similarity approaches and science mapping: Experimental comparison of five approaches. Journal of Informetrics, 3(1), 49–63.
Article Google Scholar
Arnab, S., Zhihong, S., Yang, S., Hao, M., Darrin, E., Bo-June, H., & Kuansan, W. (2019). Microsoft academic graph data from 2019-03-22. https://doi.org/10.5281/zenodo.2628216. Accessed 6 Aug 2019.
Bassett, G. W., & Persky, J. (1999). Robust voting. Public Choice, 99(3–4), 299–310.
Article Google Scholar
Beg, M. M. S. (2005). A subjective measure of web search quality. Information Sciences, 169(3–4), 365–381.
Article MathSciNet Google Scholar
Beltagy, I., Cohan, A., & Lo, K. (2019). SciBERT: Pretrained contextualized embeddings for scientific text. arXiv:1903.10676.
Blei, D. M., & Lafferty, J. D. (2009). Topic models. In A. N. Srivastava, & M. Sahami (Eds.), Text mining: Classification, clustering, and applications (pp. 71–93). Chapman and Hall/CRC.
Bohannon, J. (2016). Who’s downloading pirated papers? Everyone. Science, 352, 6285.
Google Scholar
Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5, 135–146.
Article Google Scholar
Bornmann, L., & Mutz, R. (2015). Growth rates of modern science: A bibliometric analysis based on the number of publications and cited references. Journal of the Association for Information Science and Technology, 66(11), 2215–2222.
Article Google Scholar
Boyack, K. W., Newman, D., Duhon, R. J., Klavans, R., Patek, M., Biberstine, J. R., et al. (2011). Clustering more than two million biomedical publications: Comparing the accuracies of nine text-based similarity approaches. PloS ONE, 6(3), e18029.
Article Google Scholar
Camacho-Collados, J., & Taher Pilehvar, M. (2018). From word to sense embeddings: A survey on vector representations of meaning. Journal of Artificial Intelligence Research, 63, 743–788.
Article MathSciNet Google Scholar
Campos, D., Matos, S., & Oliveira, J. L. (2012). Biomedical named entity recognition: A survey of machine-learning tools. In S. Sakurai (Ed.), Theory and applications for advanced text mining (pp. 175–195). IntechOpen. https://doi.org/10.5772/51066.
Google Scholar
Chan, J., Chang, J. C., Hope, T., Shahaf, D., & Kittur, A. (2018). SOLVENT: A mixed initiative system for finding analogies between research papers. Proceedings of the ACM on Human-Computer Interaction, 2(CSCW), 1–31. https://doi.org/10.1145/3274300.
Article Google Scholar
Chan, L., Kirsop, B., & Arunachalam, S. (2011). Towards open and equitable access to research and knowledge for development. PLoS Medicine, 8(3), e1001016.
Article Google Scholar
Dai, A. M., Olah, C., & Le, Q. V. (2015). Document embedding with paragraph vectors. arXiv:1507.07998.
de Borda, J. C. (1784). Mémoire sur les élections au scrutin. In Histoire de l \(\backslash\)’Academie Royale des Sciences pour 1781 (Paris, 1784) (pp. 657–665).
Department of Higher Education Ministry of Human Resource Development Government of India (2018). India rankings 2018: National institutional ranking framework. https://www.nirfindia.org/2018/pdf/nirf_2018_final.pdf. Retrieved 6 Aug 2019.
Desarkar, M. S., Sarkar, S., & Mitra, P. (2016). Preference relations based unsupervised rank aggregation for metasearch. Expert Systems with Applications, 49, 86–98.
Article Google Scholar
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805.
Dwork, C., Kumar, R., Naor, M., & Sivakumar, D. (2001). Rank aggregation methods for the web. In Proceedings of the 10th international conference on world wide web (pp. 613–622). ACM.
Else, H. (2018). How unpaywall is transforming open science. Nature, 560, 290–291.
Article Google Scholar
Else, H. (2019). Thousands of scientists run up against Elsevier’s paywall. https://doi.org/10.1038/d41586-019-00492-4. Retrieved 6 Aug 2019.
Ferreira, A. A., Gonçalves, M. A., & Laender, A. H. F. (2012). A brief survey of automatic methods for author name disambiguation. ACM SIGMOD Record, 41(2), 15–26.
Article Google Scholar
Fiedler, R. L., & Kaner, C. (2010). Plagiarism-detection services: How well do they actually perform? IEEE Technology and Society Magazine, 29(4), 37–43.
Article Google Scholar
Fortunato, S., Bergstrom, C. T., Börner, K., Evans, J. A., Helbing, D., Milojević, S., et al. (2018). Science of science. Science, 359(6379), eaao0185.
Article Google Scholar
Gadd, E., & Covey, D. T. (2019). What does ‘green’ open access mean? Tracking twelve years of changes to journal publisher self-archiving policies. Journal of Librarianship and Information Science, 51(1), 106–122.
Article Google Scholar
Gaind, N. (2019). Huge US university cancels subscription with Elsevier. Nature, 567(7746), 15–16. https://doi.org/10.1038/d41586-019-00758-x.
Article Google Scholar
Gaulé, P. (2009). Access to scientific literature in India. Journal of the American Society for Information Science and Technology, 60(12), 2548–2553.
Article Google Scholar
Guan, L., Lin, J., Luo, B., & Jing, J. (2014). Copker: Computing with private keys without ram. In Proceedings of the network and distributed system security symposium (NDSS) (pp. 23–26).
Guan, L., Lin, J., Ma, Z., Luo, B., Xia, L., & Jing, J. (2018). Copker: A cryptographic engine against cold-boot attacks. IEEE Transactions on Dependable and Secure Computing, 15(5), 742–754.
Article Google Scholar
Gusenbauer, M. (2019). Google scholar to overshadow them all? Comparing the sizes of 12 academic search engines and bibliographic databases. Scientometrics, 118(1), 177–214.
Article Google Scholar
Halevi, G., Moed, H., & Bar-Ilan, J. (2017). Suitability of google scholar as a source of scientific information and as a source of data for scientific evaluation—review of the literature. Journal of Informetrics, 11(3), 823–834.
Article Google Scholar
Hamedani, M. R., Kim, S.-W., & Kim, D.-J. (2016). SimCC: A novel method to consider both content and citations for computing similarity of scientific papers. Information Sciences, 334, 273–292.
Article Google Scholar
Heesen, R. (2017). Communism and the incentive to share in science. Philosophy of Science, 84(4), 698–716.
Article MathSciNet Google Scholar
Howard, J., & Ruder, S. (2018). Universal language model fine-tuning for text classification. arXiv:1801.06146.
Jamali, H. R., & Nabavi, M. (2015). Open access and sources of full-text articles in google scholar in different subject fields. Scientometrics, 105(3), 1635–1651.
Article Google Scholar
Jin, D., & Szolovits, P. (2018). Hierarchical neural networks for sequential sentence classification in medical scientific abstracts. In Proceedings of the 2018 conference on empirical methods in natural language processing (pp. 3100–3109).
Joulin, A., Grave, E., Bojanowski, P., & Mikolov, T. (2016). Bag of tricks for efficient text classification. arXiv:1607.01759.
Kim, J. (2019). Author-based analysis of conference versus journal publication in computer science. Journal of the Association for Information Science and Technology, 70(1), 71–82.
Article Google Scholar
Klein, M., Broadwell, P., Farb, S. E., & Grappone, T. (2016). Comparing published scientific journal articles to their pre-print versions. In Proceedings of the 16th ACM/IEEE-CS joint conference on digital libraries (pp. 153–162). ACM.
Kong, X., Mao, M., Wang, W., Liu, J., & Xu, B. (2018). VOPRec: Vector representation learning of papers with text information and structural identity for recommendation. IEEE Transactions on Emerging Topics in Computing. https://doi.org/10.1109/TETC.2018.2830698.
Kowsari, K., Jafari Meimandi, K., Heidarysafa, M., Mendu, S., Barnes, L., & Brown, D. (2019). Text classification algorithms: A survey. Information, 10(4), 0. https://doi.org/10.3390/info10040150. ISSN 2078-2489.
Article Google Scholar
Kusner, M., Sun, Y., Kolkin, N., & Weinberger, K. (2015). From word embeddings to document distances. In Proceedings of the international conference on machine learning (ICML) (pp 957–966).
Le, Q., & Mikolov, T. (2014). Distributed representations of sentences and documents. In Proceedings of international conference on machine learning (pp 1188–1196).
Ley, M. (2002). The DBLP computer science bibliography: Evolution, research issues, perspectives. In Proceedings of the international symposium on string processing and information retrieval (SPIRE) (pp. 1–10). Berlin: Springer.
Google Scholar
Ley, M. (2009). DBLP: Some lessons learned. Proceedings of the VLDB Endowment, 2(2), 1493–1500.
Article Google Scholar
Li, Y., & Yang, T. (2018). Word embedding for understanding natural language: A survey. In S. Srinivasan (Ed.), Guide to big data applications. Studies in big data (Vol. 26, pp. 83–104). Cham: Springer. https://doi.org/10.1007/978-3-319-53817-4_4.
Google Scholar
Lin, S. (2010). Space oriented rank-based data integration. Statistical Applications in Genetics and Molecular Biology. https://doi.org/10.2202/1544-6115.1534.
Article MathSciNet MATH Google Scholar
Marcos-Pablos, S., & García-Peñalvo, F. J. (2018). Information retrieval methodology for aiding scientific database search. Soft Computing (pp 1–10).
Martín-Martín, A., Costas, R., van Leeuwen, T., & Delgado López-Cózar, E. (2018). Evidence of open access of scientific publications in google scholar: A large-scale analysis. Journal of Informetrics, 12(3), 819–841.
Article Google Scholar
McKiernan, E. C., Bourne, P. E., Brown, C. T., Buck, S., Kenall, A., Lin, J., et al. (2016). Point of view: How open science helps researchers succeed. ELife, 5, e16800. https://doi.org/10.7554/eLife.16800.
Article Google Scholar
Meuschke, N., & Gipp, B. (2013). State-of-the-art in detecting academic plagiarism. International Journal for Educational Integrity, 9(1), 1–22.
Google Scholar
Meuschke, N., Stange, V., Schubotz, M., & Gipp, B. (2018). HyPlag: A hybrid approach to academic plagiarism detection. In Proceedings of the 41st international ACM SIGIR conference on research & development in information retrieval (pp. 1321–1324).
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Proceedings of the advances in neural information processing systems (pp. 3111–3119).
Mubin, O., Arsalan, M., & Mahmud, A. A. (2018). Tracking the follow-up of work in progress papers. Scientometrics, 114(3), 1159–1174.
Article Google Scholar
Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C. Lee, K., & Zettlemoyer, L. (2018). Deep contextualized word representations. arXiv:1802.05365.
Piwowar, H., Priem, J., Larivière, V., Alperin, J. P., Matthias, L., Norlander, B., et al. (2018). The state of OA: A large-scale analysis of the prevalence and impact of open access articles. PeerJ, 6, e4375.
Article Google Scholar
Prathap, G., & Gupta, B. M. (2009). Ranking of indian engineering and technological institutes for their research performance during 1999–2008. Current Science, 97(3), 304–306.
Google Scholar
Salton, G., & Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Information Processing & Management, 24(5), 513–523.
Article Google Scholar
Santosh, T. Y. S. S., Sanyal, D. K., & Bhowmick, P. K. (2018). Surrogator: Enriching a digital library with open access surrogate resources. In Demo track of the ACM India joint international conference on data science and management of data (CoDS-COMAD) 2018 (5th ACM IKDD CoDS and 23rd COMAD).
Santosh, T. Y. S. S., Sanyal, D. K., Bhowmick, P. K., & Das, P. P. (2018). Surrogator: A tool to enrich a digital library with open access surrogate resources. In Proceedings of the 18th ACM/IEEE on joint conference on digital libraries (pp. 379–380). ACM.
Schiltz, M. (2018). Science without publication paywalls: cOAlition S for the realisation of full and immediate open access. PLoS Medicine, 15(9), e1002663.
Article Google Scholar
Schimek, M. G., Budinska, E., Ding, J., Kugler, K. G., Svendova, V., & Lin, S. (2019). TopKLists: Analyzing multiple ranked lists. https://cran.r-project.org/web/packages/TopKLists/vignettes/TopKLists.pdf. Accessed 6 Aug 2019.
Shen, J., Xiao, J., He, X., Shang, J., Sinha, S., & Han, J. (2018). Entity set search of scientific literature: An unsupervised ranking approach. In Proceedings of the 41st international ACM SIGIR conference on research & development in information retrieval (pp. 565–574). ACM.
Singh, V. K., Uddin, A., & Pinto, D. (2015). Computer science research: The top 100 institutions in India and in the world. Scientometrics, 104(2), 529–553.
Article Google Scholar
Strevens, M. (2017). Scientific sharing: Communism and the social contract. In T. Boyer-Kassem, C. Mayo-Wilson, & M. Weisberg (Eds.), Scientific collaboration and collective knowledge: New essays (pp. 1–50). Oxford University Press.
Suber, P., et al. (2019). Timeline of the open access movement. http://oad.simmons.edu/oadwiki/Timeline. Retrieved 6 Aug 2019.
Tang, Y., & Tong, Q. (2016). BordaRank: A ranking aggregation based approach to collaborative filtering. In Proceedings of IEEE/ACIS 15th international conference on computer and information science (ICIS) (pp. 1–6). IEEE.
Wainer, J., & Valle, E. (2013). What happens to computer science research after it is published? Tracking cs research lines. Journal of the American Society for Information Science and Technology, 64(6), 1104–1111.
Article Google Scholar
Western Illinois University (2019). Open access and scholarly publishing: The scholarly publishing crisis. URL https://wiu.libguides.com/c.php?g=295451&p=1969198. Retrieved 6 Aug 2019.
Xia, F., Wang, W., Bekele, T. M., & Liu, H. (2017). Big scholarly data: A survey. IEEE Transactions on Big Data, 3(1), 18–35.
Article Google Scholar
Yoon, S.-H., Kim, S.-W., Kim, J.-S., & Hwang, W.-S. (2011). On computing text-based similarity in scientific literature. In Proceedings of the 20th international conference companion on world wide web (pp. 169–170). ACM.
Zanibbi, R., & Blostein, D. (2012). Recognition and retrieval of mathematical expressions. International Journal on Document Analysis and Recognition (IJDAR), 15(4), 331–357.
Article Google Scholar

Download references

Acknowledgements

This work is supported by National Digital Library of India Project sponsored by Ministry of Human Resource Development (Grant No. F.No.16-7/2017-TEL), Government of India at IIT Kharagpur. We thank Soumya Banerjee and Gopal Agarwal of the Department of Information Technology, Jadavpur University for their assistance in preparing the dataset used in this work.

Author information

Authors and Affiliations

National Digital Library of India, Indian Institute of Technology Kharagpur, Kharagpur, 721302, India
Debarshi Kumar Sanyal
Center for Educational Technology, Indian Institute of Technology Kharagpur, Kharagpur, 721302, India
Plaban Kumar Bhowmick
Department of Computer Science and Engineering, Indian Institute of Technology Kharagpur, Kharagpur, 721302, India
Partha Pratim Das & T. Y. S. S. Santosh
Department of Information Technology, Jadavpur University, Kolkata, 700098, India
Samiran Chattopadhyay

Authors

Debarshi Kumar Sanyal
View author publications
You can also search for this author in PubMed Google Scholar
Plaban Kumar Bhowmick
View author publications
You can also search for this author in PubMed Google Scholar
Partha Pratim Das
View author publications
You can also search for this author in PubMed Google Scholar
Samiran Chattopadhyay
View author publications
You can also search for this author in PubMed Google Scholar
T. Y. S. S. Santosh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Debarshi Kumar Sanyal.

Appendix

This “Appendix” displays tables containing data and experimental results reported in the main text.

Table 1 Papers and their top-1 OA surrogates in the dataset

Full size table

Table 2 The top-1 OA surrogate that is dropped when TMIN is changed from 0.2 to 0.3

Full size table

Table 3 The top-1 OA surrogates that are dropped when TMIN is changed from 0.3 to 0.4

Full size table

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sanyal, D.K., Bhowmick, P.K., Das, P.P. et al. Enhancing access to scholarly publications with surrogate resources. Scientometrics 121, 1129–1164 (2019). https://doi.org/10.1007/s11192-019-03227-4

Download citation

Received: 06 June 2019
Published: 23 September 2019
Issue Date: November 2019
DOI: https://doi.org/10.1007/s11192-019-03227-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Enhancing access to scholarly publications with surrogate resources

Abstract

Access this article

Similar content being viewed by others

unarXive: a large scholarly data set with publications’ full-text, annotated in-text citations, and links to metadata

CiteSeer x : A Scholarly Big Dataset

Writers of the Lost Paper: A Case Study on Barriers to (Re-) Finding Publications

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Appendix

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Enhancing access to scholarly publications with surrogate resources

Abstract

Access this article

Similar content being viewed by others

unarXive: a large scholarly data set with publications’ full-text, annotated in-text citations, and links to metadata

CiteSeer x : A Scholarly Big Dataset

Writers of the Lost Paper: A Case Study on Barriers to (Re-) Finding Publications

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation