Skip to main content
Log in

An in-text citation classification predictive model for a scholarly search system

  • Published:
Scientometrics Aims and scope Submit manuscript

Abstract

We argue that citations in scholarly documents do not always perform equivalent functions or possess equal importance. To address this problem, we worked with a corpus of over 21 k citations from the Association for Computational Linguistics, from which 465 citations were randomly annotated by experts as either important or unimportant. We used an array of machine-learning models on these annotated citations: Random Forest (RF); Support Vector Machine (SVM); and Decision Tree (DT). For the classification task, the selected models employed 15 novel features: contextual; quantitative; and qualitative. We show that the RF model outperformed the comparative model by 9.52%, achieving a 92% precision-recall area under the curve. We present a prototype of a scientific publication search system based on the RF prediction model for feature engineering. This was used on a dataset of 4138 full-text articles indexed by PLOS ONE that consists of 31,839 unique references. The empirical evaluation shows that the proposed search system improves visibility of a given scientific document by including, along with its index terms, terms from the works that it cites that are predicted to be important. Overall, this yields improved search results against the queries by the user.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. https://scikit-learn.org/stable/modules/cross_validation.html#stratified-k-fold.

References

  • Ananiadou, S., Thompson, P., & Nawaz, R. (2013). Enhancing search: Events and their discourse context. International conference on intelligent text processing and computational linguistics. (pp. 318–334). Springer.

    Chapter  Google Scholar 

  • Athar, A. (2011). Sentiment analysis of citations using sentence structure-based features. In Proceedings of the ACL 2011 student session (pp. 81–87). Association for Computational Linguistics.

  • Batista-Navarro, R. T., Kontonatsios, G., Mihăilă, C., Thompson, P., Rak, R., Nawaz, R., Korkontzelos, I., & Ananiadou, S. (2013). Facilitating the analysis of discourse phenomena in an interoperable NLP platform. In International conference on intelligent text processing and computational linguistics (pp. 559–571). Springer.

  • Bhagavatula, C., Feldman, S., Power, R., & Ammar, W. (2018). Content-based citation recommendation. In Proceedings of the 2018 conference of the North American chapter of the association for computational linguistics: Human language technologies, Volume 1 (Long Papers) (pp. -251).

  • Bonzi, S. (1982). Characteristics of a literature as predictors of relatedness between cited and citing works. Journal of the American Society for Information Science, 33(4), 208–216.

    Article  Google Scholar 

  • Bornmann, L., Haunschild, R., & Hug, S. E. (2018). Visualizing the context of citations referencing papers published by Eugene Garfield: A new type of keyword co-occurrence analysis. Scientometrics, 114(2), 427–437.

    Article  Google Scholar 

  • Bornmann, L., Wray, K. B., & Haunschild, R. (2020). Citation Concept Analysis (CCA): A new form of citation analysis revealing the usefulness of concepts for other researchers, illustrated by exemplary case studies including classic books by Thomas S Kuhn and Karl R. Popper. Scientometrics, 122(2), 1051–1074.

    Article  Google Scholar 

  • Boyack, K. W., van Eck, N. J., Colavizza, G., & Waltman, L. (2018). Characterizing in-text citations in scientific articles: A large-scale analysis. Journal of Informetrics, 12(1), 59–73.

    Article  Google Scholar 

  • Cano, V. (1989). Citation behavior: Classification, utility, and location. Journal of the American Society for Information Science, 40(4), 284–290.

    Article  Google Scholar 

  • Cohan, A., & Goharian, N. (2018). Scientific document summarization via citation contextualization and scientific discourse. International Journal on Digital Libraries, 19(2–3), 287–303.

    Article  Google Scholar 

  • Conrad, J. G., & Dabney, D. P. (2001). Automatic recognition of distinguishing negative indirect history language in judicial opinions. In Proceedings of the 10th international conference on information and knowledge management (pp. 287–294). ACM.

  • Councill, I. G., Giles, C. L., & Kan, M.-Y. (2008). ParsCit: An open-source CRF reference string parsing package. In LREC (Vol. 8, pp. 661–667).

  • Ding, Y., Zhang, G., Chambers, T., Song, M., Wang, X., & Zhai, C. (2014). Content-based citation analysis: The next generation of citation analysis. Journal of the Association for Information Science and Technology, 65(9), 1820–1833.

    Article  Google Scholar 

  • Dong, Y., Ma, H., Shen, Z., & Wang, K. (2017). A century of science: Globalization of scientific collaborations, citations, and innovations. In Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining (pp. 1437–1446). ACM.

  • Drongstrup, D., Malik, S., Aljohani, N. R., Alelyani, S., Safder, I., & Hassan, S. U. (2020). Can social media usage of scientific literature predict journal indices of AJG, SNIP and JCR? An altmetric study of economics. Scientometrics, 125(2), 1541–1558.

    Article  Google Scholar 

  • Finney, B. (1979). The reference characteristics of scientific texts. Ph.D. thesis, City University, London.

  • Garfield, E. (1965). Can citation indexing be automated. In Statistical association methods for mechanized documentation, symposium proceedings (Vol. 269, pp. 189–192). National Bureau of Standards, Miscellaneous Publication 269, Washington, DC.

  • Garzone, M., & Mercer, R. E. (2000). Towards an automated citation classifier. In Conference of the Canadian Society for computational studies of intelligence (pp. 337–346). Springer.

  • Hassan, S.-U., Akram, A., & Haddawy, P. (2017a). Identifying important citations using contextual information from full text. In Proceedings of the 17th ACM/IEEE joint conference on digital libraries (pp. 41–48). IEEE Press.

  • Hassan, S. U., Imran, M., Iftikhar, T., Safder, I., & Shabbir, M. (2017b). Deep stylometry and lexical & syntactic features based author attribution on PLoS digital repository. In International conference on Asian digital libraries (pp. 119–127). Springer.

  • Hassan, S. U., & Haddawy, P. (2013). Measuring international knowledge flows and scholarly impact of scientific research. Scientometrics, 94(1), 163–179.

    Article  Google Scholar 

  • Hassan, S. U., & Haddawy, P. (2015). Analyzing knowledge flows of scientific literature through semantic links: A case study in the field of energy. Scientometrics, 103(1), 33–46.

    Article  Google Scholar 

  • Hassan, S. U., Imran, M., Iqbal, S., Aljohani, N. R., & Nawaz, R. (2018a). Deep context of citations using machine-learning models in scholarly full-text articles. Scientometrics, 117(3), 1645–1662.

    Article  Google Scholar 

  • Hassan, S.-U., Iqbal, S., Imran, M., Aljohani, N. R., & Nawaz, R. (2018b). Mining the context of citations in scientific publications. In International conference on Asian Digital Libraries (pp. 316–322). Springer.

  • Hassan, S.-U., Safder, I., Akram, A., & Kamiran, F. (2018c). A novel machine-learning approach to measuring scientific knowledge flows using citation context analysis. Scientometrics, 116(2), 973–996.

    Article  Google Scholar 

  • Hoffmann, A., & Pham, S. B. (2003). Towards topic-based summarization for interactive document viewing. In Proceedings of the 2nd international conference on knowledge capture (pp. 28–35). ACM.

  • Hooten, P. A. (1991). Frequency and functional use of cited documents in information science. Journal of the American Society for Information Science, 42(6), 397–404.

    Article  Google Scholar 

  • Hu, Z., Chen, C., & Liu, Z. (2013). Where are citations located in the body of scientific articles? A study of the distributions of citation locations. Journal of Informetrics, 7(4), 887–896.

    Article  Google Scholar 

  • Jahangir, M., Afzal, H., Ahmed, M., Khurshid, K., & Nawaz, R. (2017) An expert system for diabetes prediction using auto tuned multi-layer perceptron. In 2017 Intelligent systems conference (IntelliSys) (pp. 722–728). IEEE.

  • Karimi, S., Moraes, L., Das, A., Shakery, A., & Verma, R. (2018). Citance-based retrieval and summarization using IR and machine learning. Scientometrics, 116(2), 1331–1366.

    Article  Google Scholar 

  • Kumar, S. (2016). Structure and dynamics of signed citation networks. In Proceedings of the 25th international conference companion on world wide web (pp. 63–64). International World Wide Web Conferences Steering Committee.

  • Lamers, W., van Eck, N. J., Waltman, L., & Hoos, H. (2018). Patterns in citation context: The case of the field of scientometrics. In 23rd International conference on science and technology indicators (STI 2018), September, Leiden, The Netherlands. Centre for Science and Technology Studies (CWTS).

  • Ma, S., Xu, J., & Zhang, C. (2018). Automatic identification of cited text spans: A multi-classifier approach over imbalanced dataset. Scientometrics, 116, 1303–1330.

    Article  Google Scholar 

  • Ma, S., Zhang, C., & Liu, X. (2020). A review of citation recommendation: From textual content to enriched context. Scientometrics, 122, 1445–1472.

    Article  Google Scholar 

  • Mayr, P., Frommholz, I., Cabanac, G., Chandrasekaran, M. K., Jaidka, K., Kan, M.-Y., & Wolfram, D. (2018). Introduction to the special issue on bibliometric-enhanced information retrieval and natural language processing for digital libraries (BIRNDL). International Journal on Digital Libraries, 19(2–3), 107–111.

    Article  Google Scholar 

  • Moravcsik, M. J., & Murugesan, P. (1975). Some results on the function and quality of citations. Social Studies of Science, 5(1), 86–92.

    Article  Google Scholar 

  • Nazir, S., Asif, M., Ahmad, S., Bukhari, F., Afzal, M. T., & Aljuaid, H. (2020). Important citation identification by exploiting content and section-wise in-text citation count. PLoS ONE, 15(3), e0228885.

    Article  Google Scholar 

  • Nakov, P. I., Schwartz, A. S., & Hearst, M. (2004). Citances: Citation sentences for semantic analysis of bioscience text. Proceed, 4, 81–88.

    Google Scholar 

  • Nawaz, R., Thompson, P., & Ananiadou, S. (2012). Identification of manner in bio-events. In LREC (pp. 3505–3510).

  • Nawaz, R., Thompson, P., & Ananiadou, S. (2013). Negated bio-events: analysis and identification. BMC Bioinformatics, 14(1), 14.

    Article  Google Scholar 

  • Nanba, H., & Okumura, M. (1999). Towards multi-paper summarization using reference information. In IJCAI (Vol. 99, pp. 926–931).

  • Oppenheim, C., & Renn, S. P. (1978). Highly cited old papers and the reasons why they continue to be cited. Journal of the American Society for Information Science, 29(5), 225–231.

    Article  Google Scholar 

  • Pang, B., & Lee, L. (2008). Opinion mining and sentiment analysis. Foundations and Trends® in Information Retrieval, 2(1–2), 1–135.

  • Qayyum, F., & Afzal, M. T. (2019). Identification of important citations by exploiting research articles’ metadata and cue-terms from content. Scientometrics, 118(1), 21–43.

    Article  Google Scholar 

  • Qazvinian, V., Radev, D. R., Mohammad, S. M., Dorr, B., Zajic, D., Whidby, M., & Moon, T. (2013). Generating extractive summaries of scientific paradigms. Journal of Artificial Intelligence Research, 46, 165–201.

    Article  MathSciNet  Google Scholar 

  • Rose, S., Engel, D., Cramer, N., & Cowley, W. (2010). Automatic keyword extraction from individual documents. In M. W. Berry and J, Kogan (Eds.), Text mining: Applications and theory, pp. 1–20. Wiley.

  • Safder, I., Hassan, S. U., Visvizi, A., Noraset, T., Nawaz, R., & Tuarob, S. (2020). Deep learning-based extraction of algorithmic metadata in full-text scholarly documents. Information Processing & Management, 57(6), 102269.

    Article  Google Scholar 

  • Safder, I., & Hassan, S.-U. (2019). Bibliometric-enhanced information retrieval: A novel deep feature engineering approach for algorithm searching from full-text publications. Scientometrics, 119, 257–277.

    Article  Google Scholar 

  • Safder, I., & Hassan, S. U. (2018). DS4A: Deep search system for algorithms from full-text scholarly big data. In 2018 IEEE international conference on data mining workshops (ICDMW) (pp. 1308–1315). IEEE.

  • Shardlow, M., Batista-Navarro, R., Thompson, P., Nawaz, R., McNaught, J., & Ananiadou, S. (2018). Identification of research hypotheses and new knowledge from scientific literature. BMC Medical Informatics and Decision Making, 18(1), 46.

    Article  Google Scholar 

  • Small, H. (2018). Characterizing highly cited method and non-method papers using citation contexts: The role of uncertainty. Journal of Informetrics, 12(2), 461–480.

    Article  Google Scholar 

  • Sugiyama, K., Kumar, T., Kan, M.-Y., & Tripathi, R. C. (2010). Identifying citing sentences in research papers using supervised learning. In information retrieval & knowledge management (CAMP), 2010 international conference (pp. 67–72). IEEE.

  • Tahamtan, I., & Bornmann, L. (2018). Core elements in the process of citing publications: Conceptual overview of the literature. Journal of Informetrics, 12(1), 203–216.

    Article  Google Scholar 

  • Thompson, P., Nawaz, R., McNaught, J., & Ananiadou, S. (2017). Enriching news events with meta-knowledge information. Language Resources and Evaluation, 51(2), 409–438.

    Article  Google Scholar 

  • Valenzuela, M., Ha, V., & Etzioni, O. (2015). Identifying meaningful citations. In 29th AAAI workshop: Scholarly big data. https://www.aaai.org/ocs/index.php/WS/AAAIW15/paper/viewPaper/10185/.

  • Voos, H., & Dagaev, K. S. (1976). Are all citations equal? Or, Did we op. cit. your idem? Journal of Academic Librarianship, 1(6), 19–21.

    Google Scholar 

  • Wang, X., Rak, R., Restificar, A., Nobata, C., Rupp, C. J., Batista-Navarro, R. T. B., Nawaz, R., & Ananiadou, S. (2011). Detecting experimental techniques and selecting relevant documents for protein-protein interactions from biomedical literature. BMC Bioinformatics, 12(8), S11.

    Article  Google Scholar 

  • Zhao, D., & Strotmann, A. (2020). Deep and narrow impact: Introducing location filtered citation counting. Scientometrics, 122(1), 503–517.

    Article  Google Scholar 

  • Zhu, X., Turney, P., Lemire, D., & Vellino, A. (2015). Measuring academic influence: Not all citations are equal. Journal of the Association for Information Science and Technology, 66(2), 408–427.

    Article  Google Scholar 

Download references

Acknowledgements

This project was funded by the Deanship of Scientific Research (DSR), King Abdulaziz University, Jeddah, under grant No. RG-14-611-40. The authors, therefore, gratefully acknowledge DSR technical and financial support.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Saeed-Ul Hassan.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Aljohani, N.R., Fayoumi, A. & Hassan, SU. An in-text citation classification predictive model for a scholarly search system. Scientometrics 126, 5509–5529 (2021). https://doi.org/10.1007/s11192-021-03986-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11192-021-03986-z

Keywords

Navigation