Skip to main content

Investigating Citation Linkage with Machine Learning

  • Conference paper
  • First Online:
Advances in Artificial Intelligence (Canadian AI 2017)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10233))

Included in the following conference series:

Abstract

Scientists face the challenge of having to navigate the deluge of information contained in the articles published in their domain of research. Tools such as citation indexes link papers but do not indicate the passage in the paper that is being cited. In this study, we report our early attempts to design a framework for finding sentences that are cited in a given article, a task we have called citation linkage. We first discuss our building of a corpus annotated by domain experts. Then, with datasets consisting of all possible citing sentence-candidate sentence pairs, some deemed not to be cited and others deemed to be by the annotators with confidence ratings 1 to 5 (lowest to highest), we have built regression models whose outputs are used to predict the degree of similarity for any pair of sentences in a target paper. Even though the Pearson correlation coefficient between the predicted values and the expected values is low (0.2759 with a linear regression model), we have shown that the citation linkage goal can be achieved. When we use the learning models to rank the predicted scores for sentences in a target article, 18 papers out of 22 have at least one sentence ranked in the top k positions (k being the number of relevant sentences per paper) and 10 papers (45%) have their Normalized Discounted Cumulative Gain (NDCG) scores greater than 71% and Precision greater than 44%. The mean average NDCG is 47% and the Mean Average Precision is 29% over all the papers.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Aggarwal, N., Asooja, K., Buitelaar, P.: DERI&UPM: pushing corpus based relatedness to similarity: shared task system description. In: Proceedings of the First Joint Conference on Lexical and Computational Semantics (Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation), pp. 643–647. Association for Computational Linguistics (2012)

    Google Scholar 

  2. Bär, D., Biemann, C., Gurevych, I., Zesch, T.: UKP: computing semantic textual similarity by combining multiple content similarity measures. In: Proceedings of the First Joint Conference on Lexical and Computational Semantics (Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation), pp. 435–440. Association for Computational Linguistics (2012)

    Google Scholar 

  3. Buscaldi, D., Tournier, R., Aussenac-Gilles, N., Mothe, J.: IRIT: textual similarity combining conceptual similarity with an n-gram comparison method. In: Proceedings of the First Joint Conference on Lexical and Computational Semantics (Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation), pp. 552–556. Association for Computational Linguistics (2012)

    Google Scholar 

  4. Cohan, A., Soldaini, L.: Towards citation-based summarization of biomedical literature. In: Proceedings of the Text Analysis Conference (TAC 2014) (2014)

    Google Scholar 

  5. Cohan, A., Soldaini, L., Goharian, N.: Matching citation text and cited spans in biomedical literature: a search-oriented approach. In: Human Language Technologies: The 2015 Annual Conference of the North American Chapter of the ACL (NAACL HLT 2015) (2015)

    Google Scholar 

  6. Garfield, E.: Science citation index—a new dimension in indexing. Science 144(3619), 649–654 (1964)

    Article  Google Scholar 

  7. Garzone, M., Mercer, R.E.: Towards an automated citation classifier. In: Hamilton, H.J. (ed.) AI 2000. LNCS (LNAI), vol. 1822, pp. 337–346. Springer, Heidelberg (2000). doi:10.1007/3-540-45486-1_28

    Chapter  Google Scholar 

  8. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. ACM SIGKDD Explor. Newslett. 11(1), 10–18 (2009)

    Article  Google Scholar 

  9. Islam, A., Inkpen, D.: Semantic text similarity using corpus-based word similarity and string similarity. ACM Trans. Knowl. Discov. Data 2(2), Article No. 10 (2008)

    Google Scholar 

  10. Li, Y., McLean, D., Bandar, Z.A., O’Shea, J.D., Crockett, K.: Sentence similarity based on semantic nets and corpus statistics. IEEE Trans. Knowl. Data Eng. 18(8), 1138–1150 (2006)

    Article  Google Scholar 

  11. Mihalcea, R., Corley, C., Strapparava, C.: Corpus-based and knowledge-based measures of text semantic similarity. In: Proceedings of the 21st National Conference on Artificial Intelligence - Volume 1, AAAI 2006, pp. 775–780. AAAI Press (2006)

    Google Scholar 

  12. Palau, R.M., Moens, M.F.: Argumentation mining: the detection, classification and structure of arguments in text. In: Proceedings of the 12th International Conference on Artificial Intelligence and Law, pp. 98–107. ACM (2009)

    Google Scholar 

  13. Radev, D.R., Jing, H., Budzikowska, M.: Centroid-based summarization of multiple documents: sentence extraction, utility-based evaluation, and user studies. In: Proceedings of the 2000 NAACL-ANLP Workshop on Automatic Summarization, pp. 21–30. Association for Computational Linguistics (2000)

    Google Scholar 

  14. Stefuanescu, D., Banjade, R., Rus, V.: Latent semantic analysis models on Wikipedia and TASA. In: Language Resources Evaluation Conference (LREC) (2014)

    Google Scholar 

  15. Teufel, S., Siddharthan, A., Tidhar, D.: Automatic classification of citation function. In: Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing (EMNLP 2006), pp. 103–110. Association for Computational Linguistics (2006)

    Google Scholar 

Download references

Acknowledgements

Support for this work was provided through a Natural Sciences and Engineering Research Council of Canada (NSERC) Discovery Grant to Robert E. Mercer.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hospice Houngbo .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Houngbo, H., Mercer, R.E. (2017). Investigating Citation Linkage with Machine Learning. In: Mouhoub, M., Langlais, P. (eds) Advances in Artificial Intelligence. Canadian AI 2017. Lecture Notes in Computer Science(), vol 10233. Springer, Cham. https://doi.org/10.1007/978-3-319-57351-9_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-57351-9_10

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-57350-2

  • Online ISBN: 978-3-319-57351-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics