Abstract
Scientists face the challenge of having to navigate the deluge of information contained in the articles published in their domain of research. Tools such as citation indexes link papers but do not indicate the passage in the paper that is being cited. In this study, we report our early attempts to design a framework for finding sentences that are cited in a given article, a task we have called citation linkage. We first discuss our building of a corpus annotated by domain experts. Then, with datasets consisting of all possible citing sentence-candidate sentence pairs, some deemed not to be cited and others deemed to be by the annotators with confidence ratings 1 to 5 (lowest to highest), we have built regression models whose outputs are used to predict the degree of similarity for any pair of sentences in a target paper. Even though the Pearson correlation coefficient between the predicted values and the expected values is low (0.2759 with a linear regression model), we have shown that the citation linkage goal can be achieved. When we use the learning models to rank the predicted scores for sentences in a target article, 18 papers out of 22 have at least one sentence ranked in the top k positions (k being the number of relevant sentences per paper) and 10 papers (45%) have their Normalized Discounted Cumulative Gain (NDCG) scores greater than 71% and Precision greater than 44%. The mean average NDCG is 47% and the Mean Average Precision is 29% over all the papers.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Aggarwal, N., Asooja, K., Buitelaar, P.: DERI&UPM: pushing corpus based relatedness to similarity: shared task system description. In: Proceedings of the First Joint Conference on Lexical and Computational Semantics (Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation), pp. 643–647. Association for Computational Linguistics (2012)
Bär, D., Biemann, C., Gurevych, I., Zesch, T.: UKP: computing semantic textual similarity by combining multiple content similarity measures. In: Proceedings of the First Joint Conference on Lexical and Computational Semantics (Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation), pp. 435–440. Association for Computational Linguistics (2012)
Buscaldi, D., Tournier, R., Aussenac-Gilles, N., Mothe, J.: IRIT: textual similarity combining conceptual similarity with an n-gram comparison method. In: Proceedings of the First Joint Conference on Lexical and Computational Semantics (Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation), pp. 552–556. Association for Computational Linguistics (2012)
Cohan, A., Soldaini, L.: Towards citation-based summarization of biomedical literature. In: Proceedings of the Text Analysis Conference (TAC 2014) (2014)
Cohan, A., Soldaini, L., Goharian, N.: Matching citation text and cited spans in biomedical literature: a search-oriented approach. In: Human Language Technologies: The 2015 Annual Conference of the North American Chapter of the ACL (NAACL HLT 2015) (2015)
Garfield, E.: Science citation index—a new dimension in indexing. Science 144(3619), 649–654 (1964)
Garzone, M., Mercer, R.E.: Towards an automated citation classifier. In: Hamilton, H.J. (ed.) AI 2000. LNCS (LNAI), vol. 1822, pp. 337–346. Springer, Heidelberg (2000). doi:10.1007/3-540-45486-1_28
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. ACM SIGKDD Explor. Newslett. 11(1), 10–18 (2009)
Islam, A., Inkpen, D.: Semantic text similarity using corpus-based word similarity and string similarity. ACM Trans. Knowl. Discov. Data 2(2), Article No. 10 (2008)
Li, Y., McLean, D., Bandar, Z.A., O’Shea, J.D., Crockett, K.: Sentence similarity based on semantic nets and corpus statistics. IEEE Trans. Knowl. Data Eng. 18(8), 1138–1150 (2006)
Mihalcea, R., Corley, C., Strapparava, C.: Corpus-based and knowledge-based measures of text semantic similarity. In: Proceedings of the 21st National Conference on Artificial Intelligence - Volume 1, AAAI 2006, pp. 775–780. AAAI Press (2006)
Palau, R.M., Moens, M.F.: Argumentation mining: the detection, classification and structure of arguments in text. In: Proceedings of the 12th International Conference on Artificial Intelligence and Law, pp. 98–107. ACM (2009)
Radev, D.R., Jing, H., Budzikowska, M.: Centroid-based summarization of multiple documents: sentence extraction, utility-based evaluation, and user studies. In: Proceedings of the 2000 NAACL-ANLP Workshop on Automatic Summarization, pp. 21–30. Association for Computational Linguistics (2000)
Stefuanescu, D., Banjade, R., Rus, V.: Latent semantic analysis models on Wikipedia and TASA. In: Language Resources Evaluation Conference (LREC) (2014)
Teufel, S., Siddharthan, A., Tidhar, D.: Automatic classification of citation function. In: Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing (EMNLP 2006), pp. 103–110. Association for Computational Linguistics (2006)
Acknowledgements
Support for this work was provided through a Natural Sciences and Engineering Research Council of Canada (NSERC) Discovery Grant to Robert E. Mercer.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Houngbo, H., Mercer, R.E. (2017). Investigating Citation Linkage with Machine Learning. In: Mouhoub, M., Langlais, P. (eds) Advances in Artificial Intelligence. Canadian AI 2017. Lecture Notes in Computer Science(), vol 10233. Springer, Cham. https://doi.org/10.1007/978-3-319-57351-9_10
Download citation
DOI: https://doi.org/10.1007/978-3-319-57351-9_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-57350-2
Online ISBN: 978-3-319-57351-9
eBook Packages: Computer ScienceComputer Science (R0)