Investigating Citation Linkage with Machine Learning

Houngbo, Hospice; Mercer, Robert E.

doi:10.1007/978-3-319-57351-9_10

Hospice Houngbo¹⁵ &
Robert E. Mercer¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10233))

Included in the following conference series:

Canadian Conference on Artificial Intelligence

1787 Accesses
1 Citations

Abstract

Scientists face the challenge of having to navigate the deluge of information contained in the articles published in their domain of research. Tools such as citation indexes link papers but do not indicate the passage in the paper that is being cited. In this study, we report our early attempts to design a framework for finding sentences that are cited in a given article, a task we have called citation linkage. We first discuss our building of a corpus annotated by domain experts. Then, with datasets consisting of all possible citing sentence-candidate sentence pairs, some deemed not to be cited and others deemed to be by the annotators with confidence ratings 1 to 5 (lowest to highest), we have built regression models whose outputs are used to predict the degree of similarity for any pair of sentences in a target paper. Even though the Pearson correlation coefficient between the predicted values and the expected values is low (0.2759 with a linear regression model), we have shown that the citation linkage goal can be achieved. When we use the learning models to rank the predicted scores for sentences in a target article, 18 papers out of 22 have at least one sentence ranked in the top k positions (k being the number of relevant sentences per paper) and 10 papers (45%) have their Normalized Discounted Cumulative Gain (NDCG) scores greater than 71% and Precision greater than 44%. The mean average NDCG is 47% and the Mean Average Precision is 29% over all the papers.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Aggarwal, N., Asooja, K., Buitelaar, P.: DERI&UPM: pushing corpus based relatedness to similarity: shared task system description. In: Proceedings of the First Joint Conference on Lexical and Computational Semantics (Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation), pp. 643–647. Association for Computational Linguistics (2012)
Google Scholar
Bär, D., Biemann, C., Gurevych, I., Zesch, T.: UKP: computing semantic textual similarity by combining multiple content similarity measures. In: Proceedings of the First Joint Conference on Lexical and Computational Semantics (Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation), pp. 435–440. Association for Computational Linguistics (2012)
Google Scholar
Buscaldi, D., Tournier, R., Aussenac-Gilles, N., Mothe, J.: IRIT: textual similarity combining conceptual similarity with an n-gram comparison method. In: Proceedings of the First Joint Conference on Lexical and Computational Semantics (Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation), pp. 552–556. Association for Computational Linguistics (2012)
Google Scholar
Cohan, A., Soldaini, L.: Towards citation-based summarization of biomedical literature. In: Proceedings of the Text Analysis Conference (TAC 2014) (2014)
Google Scholar
Cohan, A., Soldaini, L., Goharian, N.: Matching citation text and cited spans in biomedical literature: a search-oriented approach. In: Human Language Technologies: The 2015 Annual Conference of the North American Chapter of the ACL (NAACL HLT 2015) (2015)
Google Scholar
Garfield, E.: Science citation index—a new dimension in indexing. Science 144(3619), 649–654 (1964)
Article Google Scholar
Garzone, M., Mercer, R.E.: Towards an automated citation classifier. In: Hamilton, H.J. (ed.) AI 2000. LNCS (LNAI), vol. 1822, pp. 337–346. Springer, Heidelberg (2000). doi:10.1007/3-540-45486-1_28
Chapter Google Scholar
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. ACM SIGKDD Explor. Newslett. 11(1), 10–18 (2009)
Article Google Scholar
Islam, A., Inkpen, D.: Semantic text similarity using corpus-based word similarity and string similarity. ACM Trans. Knowl. Discov. Data 2(2), Article No. 10 (2008)
Google Scholar
Li, Y., McLean, D., Bandar, Z.A., O’Shea, J.D., Crockett, K.: Sentence similarity based on semantic nets and corpus statistics. IEEE Trans. Knowl. Data Eng. 18(8), 1138–1150 (2006)
Article Google Scholar
Mihalcea, R., Corley, C., Strapparava, C.: Corpus-based and knowledge-based measures of text semantic similarity. In: Proceedings of the 21st National Conference on Artificial Intelligence - Volume 1, AAAI 2006, pp. 775–780. AAAI Press (2006)
Google Scholar
Palau, R.M., Moens, M.F.: Argumentation mining: the detection, classification and structure of arguments in text. In: Proceedings of the 12th International Conference on Artificial Intelligence and Law, pp. 98–107. ACM (2009)
Google Scholar
Radev, D.R., Jing, H., Budzikowska, M.: Centroid-based summarization of multiple documents: sentence extraction, utility-based evaluation, and user studies. In: Proceedings of the 2000 NAACL-ANLP Workshop on Automatic Summarization, pp. 21–30. Association for Computational Linguistics (2000)
Google Scholar
Stefuanescu, D., Banjade, R., Rus, V.: Latent semantic analysis models on Wikipedia and TASA. In: Language Resources Evaluation Conference (LREC) (2014)
Google Scholar
Teufel, S., Siddharthan, A., Tidhar, D.: Automatic classification of citation function. In: Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing (EMNLP 2006), pp. 103–110. Association for Computational Linguistics (2006)
Google Scholar

Download references

Acknowledgements

Support for this work was provided through a Natural Sciences and Engineering Research Council of Canada (NSERC) Discovery Grant to Robert E. Mercer.

Author information

Authors and Affiliations

Department of Computer Science, The University of Western Ontario, London, Ontario, Canada
Hospice Houngbo & Robert E. Mercer

Authors

Hospice Houngbo
View author publications
You can also search for this author in PubMed Google Scholar
Robert E. Mercer
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hospice Houngbo .

Editor information

Editors and Affiliations

University of Regina, Regina, Saskatchewan, Canada
Malek Mouhoub
University of Montreal, Montreal, Québec, Canada
Philippe Langlais

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Houngbo, H., Mercer, R.E. (2017). Investigating Citation Linkage with Machine Learning. In: Mouhoub, M., Langlais, P. (eds) Advances in Artificial Intelligence. Canadian AI 2017. Lecture Notes in Computer Science(), vol 10233. Springer, Cham. https://doi.org/10.1007/978-3-319-57351-9_10

Download citation

DOI: https://doi.org/10.1007/978-3-319-57351-9_10
Published: 11 April 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-57350-2
Online ISBN: 978-3-319-57351-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics