Dissertation

Recognition of functional relationships between biomedical concepts in the scientific literature using text mining and machine learning
Erschienen in
Bibliographische Angaben
DOI:
10.6094/UNIFR/237508
URN:
urn:nbn:de:bsz:25-freidok-2375089
Sprache:
englisch
Naturwissenschaften / Chemie und zugeordnete Wissenschaften
Erscheinungsjahr: 2023
Abstract
-
englisch
A tremendous amount of electronic research data is freely available as online open-source published literature, and which is rapidly growing. This huge, unstructured data contains a great wealth of valuable information which is hidden and difficult to access; e.g. it might be difficult for scientists to identify specific articles of interest. Artificial intelligence-based text mining and machine learning approaches are being exploited to process and analyze such huge amounts of data to identify and extract relevant information. Relevant information can be concepts as well as relationships between those concepts which answer questions of interest. Identifying biomedical concepts (e.g. compounds, proteins, diseases) and the functional relationships between them is one of the important domains in text mining and forms a key component in life science research. In the drug discovery field, knowledge of how small molecules associate with proteins plays a fundamental role in understanding how drugs or metabolites can affect cells, tissues, and human metabolism.
This dissertation focuses on the automated identification of functional compound-protein relationships in biomedical and life sciences literature using text mining and machine learning techniques. A new benchmark dataset of 2,613 sentences was created, consisting of 5,562 small molecule and protein pairs which had been previously annotated with the help of text mining tools. The pairs were subsequently classified manually as functional or non-functional. Three machine learning approaches named shallow linguistic kernel (SL), all-paths graph kernel (APG), and BioBERT were evaluated to classify these relationships between small molecules and proteins. Furthermore, the benefit of the presence of interaction verbs in sentences which include the functional related compound-protein pairs was evaluated.
On the benchmark dataset, the BioBERT machine learning approach achieved the best performance, with an F1-score of 86.0%, precision of 85.2%, and recall of 86.8%. Moreover, the trained model was applied on all titles and abstracts of the articles stored in the PubMed database. The results were processed and included in a new web server for literature research (CPRiL). The data allows novel query options, such as the calculation of the shortest relation path between any biomolecule. Currently, CPRiL contains ~2.5 million unique functional related compound-protein pairs, with ~460,000 unique names and synonyms of small molecules and ~90,000 unique proteins.
Beschreibung
Dateien
Lizenz
AmmarQaseem_Dissertation.pdf
SHA256 checksum: aff01fdd83b7b53c74ed15008ce7a6b56ff66d379b00ee1ed3dcd36c52b92ee1
Download
(10.33 MB)
Beschreibung der Forschungsdaten
Relationen
Laden...
- verweist auf externe Anwendung
Laden...
Prüfungsangaben
Fakultät:
Fakultät für Chemie und Pharmazie
Betreuer:in:
Günther, Stefan
Gutachter:in:
Backofen, Rolf
Zweitgutachter:in:
Bechthold, Andreas
Prüfungsdatum: 07.06.2023
Korrekturanfrage
Vielen Dank für Ihre Korrekturanfrage. Wir werden uns um Ihre Anfrage kümmern und uns ggf. über die angegebene E-Mail-Adresse bei Ihnen zurückmelden. Bitte haben Sie Verständnis dafür, dass die Korrektur unter Umständen einige Tage dauern kann.
Es ist ein Fehler aufgetreten. Bitte versuchen Sie es später noch einmal.