Dissertation
OA Logo

Recognition of functional relationships between biomedical concepts in the scientific literature using text mining and machine learning


Erschienen in
Bibliographische Angaben
Erscheinungsjahr: 2023
DOI: 10.6094/UNIFR/237508 URN: urn:nbn:de:bsz:25-freidok-2375089 Sprache: englisch Naturwissenschaften / Chemie und zugeordnete Wissenschaften
Abstract
  • englisch
A tremendous amount of electronic research data is freely available as online open-source published literature, and which is rapidly growing. This huge, unstructured data contains a great wealth of valuable information which is hidden and difficult to access; e.g. it might be difficult for scientists to identify specific articles of interest. Artificial intelligence-based text mining and machine learning approaches are being exploited to process and analyze such huge amounts of data to identify and extract relevant information. Relevant information can be concepts as well as relationships between those concepts which answer questions of interest. Identifying biomedical concepts (e.g. compounds, proteins, diseases) and the functional relationships between them is one of the important domains in text mining and forms a key component in life science research. In the drug discovery field, knowledge of how small molecules associate with proteins plays a fundamental role in understanding how drugs or metabolites can affect cells, tissues, and human metabolism. This dissertation focuses on the automated identification of functional compound-protein relationships in biomedical and life sciences literature using text mining and machine learning techniques. A new benchmark dataset of 2,613 sentences was created, consisting of 5,562 small molecule and protein pairs which had been previously annotated with the help of text mining tools. The pairs were subsequently classified manually as functional or non-functional. Three machine learning approaches named shallow linguistic kernel (SL), all-paths graph kernel (APG), and BioBERT were evaluated to classify these relationships between small molecules and proteins. Furthermore, the benefit of the presence of interaction verbs in sentences which include the functional related compound-protein pairs was evaluated. On the benchmark dataset, the BioBERT machine learning approach achieved the best performance, with an F1-score of 86.0%, precision of 85.2%, and recall of 86.8%. Moreover, the trained model was applied on all titles and abstracts of the articles stored in the PubMed database. The results were processed and included in a new web server for literature research (CPRiL). The data allows novel query options, such as the calculation of the shortest relation path between any biomolecule. Currently, CPRiL contains ~2.5 million unique functional related compound-protein pairs, with ~460,000 unique names and synonyms of small molecules and ~90,000 unique proteins.

Beschreibung

Dateien
Lizenz Creative Commons CC BY-NC-ND (Namensnennung - Nicht kommerziell, Keine Bearbeitung)
DFG
Dieser Beitrag ist mit Zustimmung des Rechteinhabers (Verlag) aufgrund einer (DFG-geförderten) Allianz- bzw. Nationallizenz frei zugänglich.
Creative Commons CC BY-NC-ND 4.0 (Namensnennung - Nicht kommerziell, Keine Bearbeitung) 
Creative Commons CC BY-NC-ND 4.0 (Namensnennung - Nicht kommerziell, Keine Bearbeitung)
Creative Commons CC BY-NC-ND 4.0 (Namensnennung - Nicht kommerziell, Keine Bearbeitung)
Recognition of functional relationships between biomedical concepts in the scientific literature using text mining and machine learning
von
CC BY-NC-ND 4.0

ist lizenziert unter einer
Creative Commons CC BY-NC-ND 4.0 (Namensnennung - Nicht kommerziell, Keine Bearbeitung)
Creative Commons CC BY-NC-ND (Namensnennung - Nicht kommerziell, Keine Bearbeitung) 
Creative Commons CC BY-NC-ND (Namensnennung - Nicht kommerziell, Keine Bearbeitung)
Creative Commons CC BY-NC-ND (Namensnennung - Nicht kommerziell, Keine Bearbeitung)
Recognition of functional relationships between biomedical concepts in the scientific literature using text mining and machine learning
ist lizenziert unter einer
Creative Commons CC BY-NC-ND (Namensnennung - Nicht kommerziell, Keine Bearbeitung)
  • AmmarQaseem_Dissertation.pdf SHA256 checksum: aff01fdd83b7b53c74ed15008ce7a6b56ff66d379b00ee1ed3dcd36c52b92ee1
    Download (10.33 MB)

  • Beschreibung der Forschungsdaten

    Prüfungsangaben Fakultät: Fakultät für Chemie und Pharmazie Betreuer:in: Günther, Stefan Gutachter:in: Backofen, Rolf Zweitgutachter:in: Bechthold, Andreas Prüfungsdatum: 07.06.2023
    Korrekturanfrage
    Vielen Dank für Ihre Korrekturanfrage. Wir werden uns um Ihre Anfrage kümmern und uns ggf. über die angegebene E-Mail-Adresse bei Ihnen zurückmelden. Bitte haben Sie Verständnis dafür, dass die Korrektur unter Umständen einige Tage dauern kann.
    Es ist ein Fehler aufgetreten. Bitte versuchen Sie es später noch einmal.