Vesper: Visualising species archives

doi:10.1016/j.ecoinf.2014.08.004

Ecological Informatics

Volume 24, November 2014, Pages 132-147

https://doi.org/10.1016/j.ecoinf.2014.08.004 Get rights and content

Under a Creative Commons license

open access

Highlights

•
Darwin Core based species archives are visualised to reveal data quality issues.
•
The visualisations can be used by data producers at a pre-publication stage.
•
Dumping grounds for both taxonomic and geographic data are revealed.

Abstract

Vesper (Visual Exploration of SPEcies-referenced Repositories) is a tool that visualises Darwin Core Archive (DwC-A) datasets, and is aimed at reducing the amount of time and effort expended by biologists to ascertain the quality of data they are generating or using. Currently, DwC-A quality checking is limited to table outputs of data ‘existence’ and compliance with DwC-A format guidelines via the online DwC-A archive validator and reader. Whilst these tools thoroughly examine the presence of data, and the correctness of data structure against the DwC-A schema, they do not give any insight into the underlying quality of the data itself.

Built on top of the D3 JavaScript library, Vesper analyses and displays DwC-A datasets in three fundamental dimensions—taxonomic, geographic and temporal—with a visualisation dedicated to each of these aspects of the data. By viewing a dataset's composition in these dimensions, a data consumer can judge whether it is suitable for the tasks or analyses they have in mind, whilst a data provider can identify where a dataset they've constructed may fall short in terms of data quality i.e. does it contain data that is obviously incorrect such as the classic longitude inversion that places North American specimens in China. A further visualisation of the taxonomic dimension can reveal the subtaxa distribution of reference taxonomies—whilst a simple table reveals the presence or not of certain data types for each record to give an overall data ‘existence’ profile for the dataset. Selections of parts of a dataset within one visualisation are linked to the other visualisation displays for that dataset, permitting the discovery of whether data quality issues are restricted to identifiable sub-portions of the dataset.

Vesper can handle client-side data sets of a million entities within a browser by judicious use of data filtering, as many of the data types within individual records are not necessary to judge the geographic, temporal or taxonomic distribution and extent of a dataset. Thus, many of the more verbose fields in the file can simply be passed over during an initial data decompression stage. Furthermore it can provide limited name and structure matching of a dataset against DwC-A packaged reference taxonomies to indicate data quality relative to sources outside the archive. A selection of annotated example scenarios shows how Vesper can reveal data quality issues in DwC-A archives.

Graphical abstract

Keywords

Information visualisation

Data quality

Darwin Core Archive

Open source

Biodiversity