Skip to main content
Log in

Assessing plausibility of scientific claims to support high-quality content in digital collections

  • Published:
International Journal on Digital Libraries Aims and scope Submit manuscript

Abstract

This paper presents a formalization and extension of a novel approach to support high-quality content in digital libraries. Building on the concept of plausibility used in cognitive sciences, we aim at judging the plausibility of new scientific papers in light of prior knowledge. In particular, our work proposes a novel assessment of scientific papers to qualitatively support the work of reviewers. To do this, our approach focuses on the key aspect of scientific papers: claims. Claims are sentences found in empirical scientific papers that state statistical associations between entities and correspond to the core contributions of the papers. We can find these types of claims, for instance, in medicine, chemistry, and biology, where the consumption of a drug, a substance, or a product causes an effect on some other type of entity such as a disease, or another drug or substance. To operationalize the notion of plausibility, we promote claims as first-class citizens for scientific digital libraries and exploit state-of-the-art neural embedding representations of text and topic models. As a proof of concept of the potential usefulness of this notion of plausibility, we study and report extensive experiments on documents with scientific papers from the PubMed digital library.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. PubMed comprises more than 28 million citations for biomedical literature from MEDLINE, life science journals, and online books.

  2. More information about UMLS in https://www.nlm.nih.gov/research/umls/.

References

  1. Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Kaiser, L., Kudlur, M., Levenberg, J., Man, D., Monga, R., Moore, S., Murray, D., Shlens, J., Steiner, B., Sutskever, I., Tucker, P., Vanhoucke, V., Vasudevan, V., Vinyals, O., Warden, P., Wicke, M., Yu, Y., Zheng, X.: TensorFlow: large-scale machine learning on heterogeneous distributed systems. arXiv:1603.04467v2 p. 19 (2015). URLhttp://download.tensorflow.org/paper/whitepaper2015.pdf

  2. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: International Conference on Learning Representations, pp. 1–15 (2015). https://doi.org/10.1146/annurev.neuro.26.041002.131047

    Article  Google Scholar 

  3. Bengio, Y., Ducharme, R., Vincent, P., Janvin, C.: A neural probabilistic language model. J. Mach. Learn. Res. 3, 1137–1155 (2003). https://doi.org/10.1162/153244303322533223

    Article  MATH  Google Scholar 

  4. Bengio, Y., Simard, P., Frasconi, P.: Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Netw. 5(2), 157–166 (1994). https://doi.org/10.1109/72.279181

    Article  Google Scholar 

  5. Bertsimas, D., Tsitsiklis, J.N.: Introduction to Linear Optimization. Athena Scientific, Belmont (1997)

    Google Scholar 

  6. Blei, D.M.: Probabilistic topic models. Commun. ACM 55(4), 77 (2012)

    Article  Google Scholar 

  7. Blei, D.M., Lafferty, J.D.: Topic models. In: Srivastava AN, Sahami M (eds) Text Mining: Classification, Clustering, and Applications, chap. 4. Data Mining and Knowledge Discovery Series, Chapman & Hall/CRC, pp. 71–89 (2009). https://doi.org/10.1145/1143844.1143859

  8. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003). https://doi.org/10.1162/jmlr.2003.3.4-5.993

    Article  MATH  Google Scholar 

  9. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information 5, 135–146 (2016). DOI 1511.09249v1. arXiv:1607.04606

  10. Chollet, F.: Deep Learning with Python, 1st edn. Manning Publications, Shelter Island (2017)

    Google Scholar 

  11. Chollet, F., others: Keras. (2015) https://github.com/keras-team/keras

  12. Ciccarese, P., Wu, E., Wong, G., Ocana, M., Kinoshita, J., Ruttenberg, A., Clark, T.: The SWAN biomedical discourse ontology. J. Biomed. Inform. 41(5), 739–751 (2008). https://doi.org/10.1016/j.jbi.2008.04.010

    Article  Google Scholar 

  13. Connell, L., Keane, M.T.: A model of plausibility. Cognit. Sci. 30(1), 95–120 (2006). https://doi.org/10.1207/s15516709cog0000_53

    Article  Google Scholar 

  14. Dalvi, N., Ré, C., Suciu, D.: Probabilistic databases: diamonds in the dirt. Commun. ACM 52(7), 86–94 (2009). https://doi.org/10.1145/1538788.1538810

    Article  Google Scholar 

  15. González Pinto J.M.; Balke, W.T.: Can plausibility help to support high quality content in digital libraries? In: TPDL 2017 21st International Conference on Theory and Practice of Digital Libraries. Thessaloniki, Greece (2017)

    Chapter  Google Scholar 

  16. Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning, vol. 521(7553). MIT Press, Cambridge (2016). https://doi.org/10.1038/nmeth.3707

    Book  MATH  Google Scholar 

  17. Graves, a., Mohamed, A., Hinton, G.: Speech recognition with deep recurrent neural networks. In: 38th International Conference on Acoustics, Speech, and Signal Processing, pp. 6645 – 6649 (2013). https://doi.org/10.1109/ICASSP.2013.6638947

  18. Greff, K., Srivastava, R.K., Koutnik, J., Steunebrink, B.R., Schmidhuber, J.: LSTM: a search space odyssey (2016). https://doi.org/10.1109/TNNLS.2016.2582924

    Article  MathSciNet  Google Scholar 

  19. Groth, P., Gibson, A., Velterop, J.: The anatomy of a nanopublication. Inf. Serv. Use 30(1–2), 51–56 (2010). https://doi.org/10.3233/ISU-2010-0613

    Article  Google Scholar 

  20. Groth, P., Loizou, A., Gray, A.J.G., Goble, C., Harland, L., Pettifer, S.: API-centric linked data integration: the open PHACTS discovery platform case study. J. Web Semant. 29, 12–18 (2014). https://doi.org/10.1016/j.websem.2014.03.003

    Article  Google Scholar 

  21. Hinton, G.E., Srivastava, N., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.R.: Improving neural networks by preventing co-adaptation of feature detectors (2012). arXiv:1207.0580

  22. Hochreiter, S., Urgen Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997). https://doi.org/10.1162/neco.1997.9.8.1735

    Article  Google Scholar 

  23. Joachims, T.: Text categorization with support vector machines: Learning with many relevant features. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 1398, 137–142 (1998). https://doi.org/10.1007/s13928716

    Article  Google Scholar 

  24. Kilicoglu, H., Shin, D., Fiszman, M., Rosemblat, G., Rindflesch, T.C.: SemMedDB: a PubMed-scale repository of biomedical semantic predications. Bioinformatics 28(23), 3158–3160 (2012). https://doi.org/10.1093/bioinformatics/bts591

    Article  Google Scholar 

  25. Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP, pp. 1746–1751 (2014). https://doi.org/10.3115/v1/D14-1181. arXiv:1408.5882

  26. Kingma, D.P., Ba, J.L.: Adam: a method for stochastic optimization. Int. Conf. Learn. Represent. 2015, 1–15 (2015)

    Google Scholar 

  27. Kristal, A.R., Till, C., Platz, E.A., Song, X., King, I.B., Neuhouser, M.L., Ambrosone, C.B., Thompson, I.M.: Serum lycopene concentration and prostate cancer risk: results from the prostate cancer prevention trial. Cancer Epidemiol. Biomark. Prev. 20(4), 638–646 (2011). https://doi.org/10.1158/1055-9965.EPI-10-1221

    Article  Google Scholar 

  28. Kuhn, T., Barbano, P.E., Nagy, M.L., Krauthammer, M.: Broadening the scope of nanopublications. In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 7882 LNCS, pp. 487–501 (2013). https://doi.org/10.1007/978-3-642-38288-8-33

  29. Kusner, M.J., Sun, Y., Kolkin, N.I., Weinberger, K.Q.: From word embeddings to document distances. In: Proceedings of The 32nd international conference on machine learning vol. 37, pp. 957–966 (2015)

  30. Le, Q., Mikolov, T.: Distributed representations of sentences and documents. International Conference on Machine Learning - ICML 2014, vol. 32, pp. 1188–1196 (2014). https://doi.org/10.1145/2740908.2742760

  31. Manning, C.D., Raghavan, P.: An introduction to information retrieval (2009). https://doi.org/10.1109/LPT.2009.2020494. URLhttp://dspace.cusat.ac.in/dspace/handle/123456789/2538

  32. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. Nips pp. 1–9 (2013). https://doi.org/10.1162/jmlr.2003.3.4-5.951

  33. Mikolov, T., Corrado, G., Chen, K., Dean, J.: Efficient estimation of word representations in vector space. In: Proceedings of the International Conference on Learning Representations (ICLR 2013) pp. 1–12 (2013). https://doi.org/10.1162/153244303322533223. arXiv:1301.3781v3.pdf

  34. Mikolov, T., Yih, W.t., Zweig, G.: Linguistic regularities in continuous space word representations. In: Proceedings of NAACL-HLT, June, pp. 746–751 (2013)

  35. Palangi, H., Deng, L., Shen, Y., Gao, J., He, X., Chen, J., Song, X., Ward, R.: Deep Sentence embedding using long short-term memory networks: analysis and application to information retrieval. IEEE/ACM Trans. Audio Speech and Language Process. 24(4), 694–707 (2016). https://doi.org/10.1109/TASLP.2016.2520371

    Article  Google Scholar 

  36. Pele, O., Werman, M.: Fast and robust earth mover’s distances. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 460–467 (2009). https://doi.org/10.1109/ICCV.2009.5459199

  37. Peleteiro, B., Lopes, C., Figueiredo, C., Lunet, N.: Salt intake and gastric cancer risk according to Helicobacter pylori infection, smoking, tumour site and histological type. British Journal of Cancer 104(1), 198–207 (2011). https://doi.org/10.1038/sj.bjc.6605993. URLhttp://www.nature.com/doifinder/10.1038/sj.bjc.6605993

    Article  Google Scholar 

  38. Pennington, J., Socher, R., Manning, C.: Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014). https://doi.org/10.3115/v1/D14-1162. URLhttp://aclweb.org/anthology/D14-1162

  39. Price, B.Y.S., Flach, P.A.: Computational support for academic peer review: a perspective from artificial intelligence. Commun. ACM 60(3), 70–79 (2017)

    Article  Google Scholar 

  40. Rehurek, R., Sojka, P.: Software framework for topic modelling with large corpora. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks pp. 45–50 (2010). https://doi.org/10.13140/2.1.2393.1847

  41. Rindflesch, T.C., Fiszman, M.: The interaction of domain knowledge and linguistic structure in natural language processing: interpreting hypernymic propositions in biomedical text. J. Biomed. Inform. 36(6), 462–477 (2003). https://doi.org/10.1016/j.jbi.2003.11.003

    Article  Google Scholar 

  42. Schoenfeld, J.D., Ioannidis, J.P.A.: Is everything we eat associated with cancer? A systematic cookbook review. Am. J. Clin. Nutr. 97(1), 127–134 (2013). https://doi.org/10.3945/ajcn.112.047142

    Article  Google Scholar 

  43. Toulmin, S.: The uses of argument. Ethics 70(1), vi, 264 (1958). https://doi.org/10.2307/2183556

    Article  Google Scholar 

  44. Turian, J., Ratinov, L., Bengio, Y.: Word representations: a simple and general method for semi-supervised learning. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp 384–394 (2010)

  45. Velterop, J.: Nanopublications: the future of coping with information overload. LOGOS: J. World Book Community 21, 3–4 (2010)

    Article  Google Scholar 

  46. Verheij, B.: The toulmin argument model in artificial intelligence. In: Rahwan I (ed) Argumentation in Artificial Intelligence, pp. 219–238. Springer (2009). https://doi.org/10.1007/978-0-387-98197-0

    Google Scholar 

  47. Wang, P., Xu, J., Xu, B., Liu, C.l., Zhang, H., Wang, F., Hao, H.: Semantic clustering and convolutional neural network for short text categorization. In: Proceedings ACL 2015 pp. 352–357 (2015). https://doi.org/10.1016/j.neucom.2015.09.096

    Article  Google Scholar 

  48. Zhang, Y., Wallace, B.: A sensitivity analysis of (and practitioners’ guide to) convolutional neural networks for sentence classification. In: Proceedings of the The 8th International Joint Conference on Natural Language Processing, pp. 253–263 (2017). arXiv:1510.03820

  49. Zhao, J., Stockwell, T., Roemer, A., Chikritzhs, T., Bostwick, Dea: Is alcohol consumption a risk factor for prostate cancer? A systematic review and metaanalysis. BMC Cancer 16(1), 845 (2016). https://doi.org/10.1186/s12885-016-2891-z

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to José María González Pinto.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

González Pinto, J.M., Balke, WT. Assessing plausibility of scientific claims to support high-quality content in digital collections. Int J Digit Libr 21, 47–60 (2020). https://doi.org/10.1007/s00799-018-0256-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00799-018-0256-8

Keywords

Navigation