Abstract

The most common method of publishing new discoveries about art conservation techniques and research has been through traditional full-text publications. Such corpora typically only support searching via metadata (e.g. title, authors, or keywords) and full-text. In particular, it is difficult to discover valuable information about the chemical processes, experimental results, or preservation treatments associated with the conservation of paintings from a specific genre. This article addresses this problem by focusing on the extraction of structured data (that complies with a pre-defined ontology) from a distributed corpus of publications about painting conservation. Our specific extraction method involves a unique combination of named entity recognition (using gazetteer-based and machine learning-based methods) followed by relationship extraction (using rule-based and machine learning-based methods). The resulting structured data are stored in a resource description framework triple store, and a Web-based graphical user interface enables the SPARQL querying, retrieval, and display of the search results. The results from applying our techniques to a corpus of publications on art conservation indicate that our approach achieves higher quality precision and recall in extracting named entities and relations from publications, relative to alternative existing approaches.

You do not currently have access to this article.