Abstract
Biological data and knowledge bases increasingly rely on Semantic Web technologies and the use of knowledge graphs for data integration, retrieval and federated queries. We propose a solution for automatically semantifying biological assays. Our solution contrasts the problem of automated semantification as labeling versus clustering where the two methods are on opposite ends of the method complexity spectrum. Characteristically modeling our problem, we find the clustering solution significantly outperforms a deep neural network state-of-the-art labeling approach. This novel contribution is based on two factors: 1) a learning objective closely modeled after the data outperforms an alternative approach with sophisticated semantic modeling; 2) automatically semantifying biological assays achieves a high performance F1 of nearly 83%, which to our knowledge is the first reported standardized evaluation of the task offering a strong benchmark model.
Supported by TIB Leibniz Information Centre for Science and Technology, the EU H2020 ERC project ScienceGRaph (GA ID: 819536) and the ITN PERICO (GA ID: 812968).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Abeyruwan, S., et al.: Evolving BioAssay ontology (BAO): modularization, integration and applications. J. Biomed. Semantics 5(Suppl 1), S5 (2014)
Ammar, W., Peters, M.E., Bhagavatula, C., Power, R.: The AI2 system at SemEval-2017 task 10 (ScienceIE): semi-supervised end-to-end entity and relation extraction. In: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pp. 592–596. Association for Computational Linguistics, Vancouver (2017). https://doi.org/10.18653/v1/S17-2097
Anteghini, M., D’Souza, J., Dos Santos, V.A.M., Auer, S.: SciBERT-based semantification of bioassays in the open research knowledge graph. In: EKAW-PD 2020, pp. 22–30 (2020)
Anteghini, M., D’Souza, J., Martins dos Santos, V.A.P., Auer, S.: Representing semantified biological assays in the open research knowledge graph. In: Ishita, E., Pang, N.L.S., Zhou, L. (eds.) ICADL 2020. LNCS, vol. 12504, pp. 89–98. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-64452-9_8
Auer, S.: Towards an open research knowledge graph (2018). https://doi.org/10.5281/zenodo.1157185
Augenstein, I., Das, M., Riedel, S., Vikraman, L., McCallum, A.: SemEval 2017 task 10: ScienceIE - extracting keyphrases and relations from scientific publications. In: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pp. 546–555. Association for Computational Linguistics, Vancouver (2017). https://doi.org/10.18653/v1/S17-2091
Beltagy, I., Lo, K., Cohan, A.: SciBERT: a pretrained language model for scientific text. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3606–3611 (2019)
Berners-Lee, T., Hendler, J., Lassila, O.: The semantic web. Sci. Am. 284(5), 34–43 (2001)
Brack, A., D’Souza, J., Hoppe, A., Auer, S., Ewerth, R.: Domain-independent extraction of scientific concepts from research articles. In: Jose, J.M., Yilmaz, E., Magalhães, J., Castells, P., Ferro, N., Silva, M.J., Martins, F. (eds.) ECIR 2020. LNCS, vol. 12035, pp. 251–266. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-45439-5_17
Clark, A.M., Bunin, B.A., Litterman, N.K., Schürer, S.C., Visser, U.: Fast and accurate semantic annotation of bioassays exploiting a hybrid of machine learning and user confirmation. PeerJ 2, e524 (2014)
The UniProt Consortium: UniProt: the universal protein knowledgebase in 2021. Nucleic Acids Res. 49(D1), D480–D489 (2020). https://doi.org/10.1093/nar/gkaa1100
Constantin, A., Peroni, S., Pettifer, S., Shotton, D., Vitali, F.: The document components ontology (DoCo). Semantic Web 7(2), 167–181 (2016). https://doi.org/10.3233/SW-150177
Dessì, D., Osborne, F., Reforgiato Recupero, D., Buscaldi, D., Motta, E., Sack, H.: AI-KG: an automatically generated knowledge graph of artificial intelligence. In: Pan, J.Z., et al. (eds.) ISWC 2020. LNCS, vol. 12507, pp. 127–143. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-62466-8_9
D’Souza, J., Auer, S., Pedersen, T.: SemEval-2021 Task 11: NLPContributionGraph - structuring scholarly NLP contributions for a research knowledge graph. In: Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021), pp. 364–376. Association for Computational Linguistics, Online (2021). https://doi.org/10.18653/v1/2021.semeval-1.44
D’Souza, J., Hoppe, A., Brack, A., Jaradeh, M.Y., Auer, S., Ewerth, R.: The STEM-ECR dataset: grounding scientific entity references in STEM scholarly content to authoritative encyclopedic and lexicographic sources. In: Proceedings of the 12th Language Resources and Evaluation Conference, pp. 2192–2203. European Language Resources Association, Marseille (2020). https://aclanthology.org/2020.lrec-1.268
Fisas, B., Ronzano, F., Saggion, H.: A multi-layered annotated corpus of scientific papers. In: LREC (2016)
Gábor, K., Buscaldi, D., Schumann, A.K., QasemiZadeh, B., Zargayouna, H., Charnois, T.: SemEval-2018 task 7: semantic relation extraction and classification in scientific papers. In: Proceedings of The 12th International Workshop on Semantic Evaluation, pp. 679–688. Association for Computational Linguistics, New Orleans (2018). https://doi.org/10.18653/v1/S18-1111
Hoskins, W.M., Craig, R.: Uses of bioassay in entomology. Annu. Rev. Entomol. 7(1), 437–464 (1962)
Irwin, J.: Statistical method in biological assay. Nature 172(4386), 925–926 (1953)
Jassal, B., et al.: The reactome pathway knowledgebase. Nucleic Acids Res. (2019). https://doi.org/10.1093/nar/gkz1031
Jin, X., Han, J.: K-means clustering. In: Sammut, C., Webb, G.I. (eds) Encyclopedia of Machine Learning. Springer, Boston (2011). https://doi.org/10.1007/978-0-387-30164-8_425
Kanehisa, M., Goto, S.: KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 28(1), 27–30 (2000). https://doi.org/10.1093/nar/28.1.27
Katayama, T., et al.: Biohackathon series in 2011 and 2012: penetration of ontology and linked data in life science domains. J. Biomed. Semantics 5(1), 1–13 (2014)
Kononova, O., et al.: Text-mined dataset of inorganic materials synthesis recipes. Sci. Data 6(1), 1–11 (2019)
Kulkarni, C., Xu, W., Ritter, A., Machiraju, R.: An annotated corpus for machine reading of instructions in wet lab protocols. In: NAACL: HLT, vol. 2, pp. 97–106 (Short Papers). New Orleans (2018). https://doi.org/10.18653/v1/N18-2016
Kuniyoshi, F., Makino, K., Ozawa, J., Miwa, M.: Annotating and extracting synthesis process of all-solid-state batteries from scientific literature. In: LREC, pp. 1941–1950 (2020)
Liakata, M., Saha, S., Dobnik, S., Batchelor, C., Rebholz-Schuhmann, D.: Automatic recognition of conceptualization zones in scientific articles and two life science applications. Bioinformatics 28(7), 991–1000 (2012). https://doi.org/10.1093/bioinformatics/bts071
Liakata, M., Teufel, S., Siddharthan, A., Batchelor, C.: Corpora for the conceptualisation and zoning of scientific papers. In: Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC 2010). European Language Resources Association (ELRA), Valletta (2010)
Liu, H., Sarol, M.J., Kilicoglu, H.: UIUC_BioNLP at SemEval-2021 task 11: a cascade of neural models for structuring scholarly NLP contributions. In: Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021), pp. 377–386. Association for Computational Linguistics, Online (2021). https://doi.org/10.18653/v1/2021.semeval-1.45
Luan, Y., He, L., Ostendorf, M., Hajishirzi, H.: Multi-task identification of entities, relations, and coreference for scientific knowledge graph construction. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 3219–3232. Association for Computational Linguistics, Brussels (2018). https://doi.org/10.18653/v1/D18-1360
Mysore, S., et al.: The materials science procedural text corpus: annotating materials synthesis procedures with shallow semantic structures. In: Proceedings of the 13th Linguistic Annotation Workshop, pp. 56–64 (2019)
Pertsas, V., Constantopoulos, P.: Scholarly ontology: modelling scholarly practices. Int. J. Digit. Libr. 18(3), 173–190 (2017)
Prud’hommeaux, E., Seaborne, A.: SPARQL query language for RDF. w3c recommendation (2008)
QasemiZadeh, B., Handschuh, S.: The ACL RD-TEC: a dataset for benchmarking terminology extraction and classification in computational linguistics. In: Proceedings of the 4th International Workshop on Computational Terminology (Computerm), pp. 52–63. Association for Computational Linguistics and Dublin City University, Dublin (2014). https://doi.org/10.3115/v1/W14-4807
Wheeler, D.L., et al.: Database resources of the national center for biotechnology information. Nucleic Acids Res. 46(D1), D8–D13 (2017). https://doi.org/10.1093/nar/gkx1095
Sammut, C., Webb, G.I. (eds.): TF-IDF, pp. 986–987. Springer, Boston (2010)
Schürer, S.C., Vempati, U., Smith, R., Southern, M., Lemmon, V.: Bioassay ontology annotations facilitate cross-analysis of diverse high-throughput screening data sets. J. Biomol. Screen. 16(4), 415–426 (2011)
Soldatova, L.N., King, R.D.: An ontology of scientific experiments. J. R. Soc. Interface 3(11), 795–803 (2006). https://doi.org/10.1098/rsif.2006.0134
Syakur, M., Khotimah, B., Rochman, E., Satoto, B.D.: Integration k-means clustering method and elbow method for identification of the best customer profile cluster. In: IOP Conference Series: Materials Science and Engineering, vol. 336, p. 012017. IOP Publishing (2018)
Teufel, S., Carletta, J., Moens, M.: An annotation scheme for discourse-level argumentation in research articles. In: Ninth Conference of the European Chapter of the Association for Computational Linguistics, pp. 110–117. Association for Computational Linguistics, Bergen (1999). https://aclanthology.org/E99-1015
Thomas, A.L.: Essentials in bioassay development. BioPharm Int. 32(11), 42–45 (2019)
Vempati, U.D., et al.: Formalization, annotation and analysis of diverse drug and probe screening assay datasets using the BioAssay Ontology (BAO). PLoS ONE 7(11), e49198 (2012)
Visser, U., Abeyruwan, S., Vempati, U., Smith, R.P., Lemmon, V., Schürer, S.C.: BioAssay Ontology (BAO): a semantic description of bioassays and high-throughput screening results. BMC Bioinform. 12(1), 257 (2011)
Wadden, D., Wennberg, U., Luan, Y., Hajishirzi, H.: Entity, relation, and event extraction with contextualized span representations. arXiv preprint arXiv:1909.03546 (2019)
Wang, Y., et al.: PubChem BioAssay: 2017 update. Nucleic Acids Res. 45(D1), D955–D963 (2016)
Wang, Y., et al.: PubChem’s BioAssay database. Nucleic Acids Res. 40(D1), D400–D412 (2011)
Zhou, P., et al.: Attention-based bidirectional long short-term memory networks for relation classification. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (vol. 2: Short Papers), pp. 207–212. Association for Computational Linguistics, Berlin (2016). https://doi.org/10.18653/v1/P16-2034
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Anteghini, M., D’Souza, J., dos Santos, V.A.P.M., Auer, S. (2022). Easy Semantification of Bioassays. In: Bandini, S., Gasparini, F., Mascardi, V., Palmonari, M., Vizzari, G. (eds) AIxIA 2021 – Advances in Artificial Intelligence. AIxIA 2021. Lecture Notes in Computer Science(), vol 13196. Springer, Cham. https://doi.org/10.1007/978-3-031-08421-8_14
Download citation
DOI: https://doi.org/10.1007/978-3-031-08421-8_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-08420-1
Online ISBN: 978-3-031-08421-8
eBook Packages: Computer ScienceComputer Science (R0)