Elsevier

Ecological Informatics

Volume 28, July 2015, Pages 19-28
Ecological Informatics

WaterML R package for managing ecological experiment data on a CUAHSI HydroServer

https://doi.org/10.1016/j.ecoinf.2015.05.002Get rights and content

Highlights

  • WaterML is a new R package for analysis of online experimental data from CUAHSI HydroServers inside the R environment.

  • We created an API for JSON based data upload from R to the PHP/Linux based HydroServer Lite software using an R script.

  • We tested the WaterML R package using real time data from wirelessly connected data loggers in an ecological experiment site.

  • The WaterML R package serves as a model for connecting scripting environments with online sensor data management systems.

Abstract

We present the design and development of a new WaterML R package that provides access to the Consortium of Universities for Advancement of Hydrologic Science (CUAHSI) Hydrologic Information System (HIS) HydroServer as a means for storing and managing data. The new WaterML R package is presented in terms of its functional requirements and design, with the express goal of providing support for four core web methods defined by the HydroServer WaterOneFlow web services specification. The system is tested in the context of data collected as part of a large ecological manipulation experiment. The resulting system allows research scientists to use a familiar statistical computation environment, R, together with the open source HydroServer software (for data archival and sharing). We also developed a new HydroServer data upload web service to facilitate data upload to a PHP version of HydroServer called HydroServer Lite directly from the WaterML R package presented here. Using the WaterML R package, the user can retrieve and analyze data from HydroServers of multiple organizations that are listed in the CUAHSI Water Data Center catalog and the Global Earth Observation System of Systems data catalog of the World Meteorological Organization. The new HydroServer Lite data upload API simplifies the upload of data to HydroServer Lite directly from the R environment.

Introduction

Essential to the successful execution of any data intensive research study is the management of data and data collection in a formal, organized and reproducible manner. Ecological data collection and storage have evolved in many instances into a large, complex task that demands automation for accuracy in acquiring, managing, analyzing, and long-term verification of data by the researchers themselves (Michener and Jones, 2012). Small independent ecology lab groups, and scientists who lead those labs focus not only on their own unique scientific questions and procedures, but now must learn data processing techniques and database development so that their work is not thwarted long term by poor storage or access (Conner et al., 2013a). Those who are performing an experiment and those who hope to understand the data coming from the experiment may not have the financial means necessary to develop their own data management system. Shared data hosting websites using international standards (WaterML) and open-source software solutions for data management (HydroServer for Windows or HydroServer Lite for Linux), archiving (ODM database), and publication (WaterOneFlow web service) can be an effective way for independent researchers to remain competitive in a future world of data deluge in scientific research. The Consortium of Universities for the Advancement of Hydrologic Science (CUAHSI) provides support for scientists and independent ecology lab groups and helps them to manage and organize their experimental data using open-source technology.

A particular problem faced by ecological researchers is integrating a data management system with a computational analysis environment such as Matlab, Stata, or R. A common feature of these computational analysis environments is that they provide capabilities for exploratory analysis (plots, graphs), and statistical inference (hypothesis testing). Typically data analysis steps are all recorded in a script, making the steps reproducible (Gentleman and Lang, 2007). A system that links computational analysis software with standards-based cloud data management would allow researchers to automate the retrieval of raw sensor data or previously processed data directly from the data management system into their analytical environment. The system would also allow researchers to post data and analysis results back to the system for both archival and sharing purposes.

A number of existing tools have been constructed that meet various parts of the overarching goals stated above. For example, two R packages for retrieving water quantity and quality data from USGS National Water Information System (NWIS) have recently been introduced in the R Comprehensive R Archive Network (CRAN) package repository including the “dataRetrieval” and “waterData” packages (De Cicco and Hirsch, 2013, Ryberg and Vecchia, 2012). These R packages provide useful data download functions that could support ecological research in terms of the U.S. national water information, however they are not intended for data upload or for managing data associated with laboratory research.

For laboratory research, it is possible to use, database drivers that link the R statistical software package to major relational database platforms. The “RObsDat” package (Reusser, 2014) is one such driver that is specifically designed for connecting to any environmental observations database compliant with the Observations Data Model (ODM) schema using the Structured Query Language (SQL) mechanism. Other more general-purpose examples of packages that link R to a relational database using SQL are “RMySQL” (James and DebRoy, 2012) and “RSQLite” (James and Falcon, 2011).

The drivers noted above require a direct SQL connection to an associated ODM database using an IP address and port number. The problem with a direct SQL connection using IP address and port number is that institutional firewalls block the necessary ports in most cases, making the connection only possible inside of the institution's local network. In the common case of multi-institutional collaborations, such firewalls can restrict direct database access—hence another approach is required. Also, not all HydroServer instances use the ODM database schema. A solution to these problems is to abstract the physical database by only exposing a layer of web services (also called web application programming interface or web API). The web API usually uses the Hyper Text Transfer Protocol (HTTP) to pass information between a client tool and a database using JavaScript Object Notation (JSON) or Extensible Markup Language (XML) encoded text. Such a web service solves the firewall issue and allows access to the data across institutions, though compared to the expressive power of SQL, the web service typically only enables a limited set of pre-defined queries. If well defined (i.e. as in the case of the CUAHSI HIS web services), this limited subset of queries can readily satisfy the requirements of most database management use cases.

Within the environmental sciences, the most widely used standards for communicating with a database using web services include Sensor Observation Service (SOS), and WaterOneFlow web service (Ames et al., 2012, Tarboton et al., 2009, Valentine et al., 2007). SOS has received widespread adoption in ocean and marine sciences. In hydrology, the WaterOneFlow service is widely used, with around 100 public database servers registered worldwide at http://HISCentral.cuahsi.org/ including ecological research labs (Conner et al., 2013b, Whitenack, 2010). The “sos4r” package (Nüst et al., 2011) facilitates connecting to the SOS web service from R.

Another tool, HydroR, is a plugin for the open source HydroDesktop software that can analyze data retrieved via WaterML (Horsburgh and Reeder, 2014). This tool requires installation of separate software, HydroDesktop (Ames et al., 2012) to perform the search, discovery, and download of data before it can be analyzed in R. HydroDesktop and HydroR require the Windows operating system, which can be a disadvantage for users of other operating systems. No software tools presently exist to push analytical result from R directly to the CUAHSI HIS via web services.

The “RObsDat” package overcomes several of these challenges since it is cross-platform and can both read and write data in an ODM formatted database using SQL database connection. The key limitation of the “RObsDat” package is that it is not suitable for situations where multi-institution access to a single HydroServer is required and where institutional firewalls block direct connections to SQL databases.

It is useful to note that out of the 98 data management systems registered on the HIS Central catalog, none provides direct back-end access to the entire database but all provide a method to query their database through the WaterML web services API. In short, although WaterML is a widely used international standard, there is currently no easy method of accessing the WaterML web service from the R environment.

The remainder of this paper presents the design, development, and testing of a new WaterML R package that addresses these problems by supporting download of data directly from any HydroServer instance and upload of data to a special version of HydroServer Lite using R and a web service interface. The data values from multiple sites or variables are retrieved as an R “data frame” that can be directly used in R. Because it is integrated directly into the R statistical software, our package can be installed and used on any operating system with an internet connection. To test the usability of the package, we present a case study using exploratory and statistical analysis of continuous observation data from a large scale ecological manipulation experiment.

The work presented here is a significant extension of previous work that presented the PHP based HydroServer Lite software tool as a system for managing ecological experiment data (Conner et al., 2013a). The first issue addressed by our work is simplifying access to HydroServer data through the WaterOneFlow web service from within the R environment. Using our WaterML R package, any HydroServer that implements the WaterOneFlow web service can be accessed from R. The second issue addressed by our work is enabling the upload of analytical results from R to the HydroServer Lite through a web service application program interface (API). While the data editing capabilities currently only work with instances of the HydroServer Lite software, the work presented here serves as a model for future efforts for creating connections between analytical software tools and distributed, web services based data management systems. Also, the HydroServer Lite data upload capabilities are expected to be added to the general HydroServer software stack in the near future, which will immediately enable use of this WaterML R package for both data download and upload on any CUAHSI HydroServer.

Section snippets

Software design and development

In our design we chose an open source solution using the MySQL relational database and HydroServer Lite software. The HydroServer Lite software installation package and source code are available on the website: http://hydroserverlite.codeplex.com. It can be installed and hosted on any web server or shared webhosting account that supports PHP (version 5 or higher) and MySQL. We hosted the database and web server on the shared webhosting site http://worldwater.byu.edu, which provides web space

Software design and development results

The WaterML R package has been published on the CRAN (http://cran.r-project.org/web/packages/WaterML/). Before being published on CRAN, the code had to be approved by running the automatic check on each function, checking the documentation of all function parameters, checking the code examples in the documentation, and a review by the CRAN team. The WaterML R package can be installed from R using the command:

Install.packages(“WaterML”)

The WaterML R package also can be found in the “Install

Discussion and conclusions

Ecological data management combined with data analysis presents a number of challenges for practicing researchers. In particular, there is a need for tools that link commonly used desktop analytical software tools with distributed cloud based data sharing networks. This paper presents a new WaterML R package that crosses the divide between local computation and network based data sharing. The new package, described here, has been successfully deployed in the context of a case study for a large

Acknowledgments

We gratefully acknowledge the assistance of the Bureau of Land Management Salt Lake City Field Office who conducted NEPA and carried out the experimental burn treatments. The experimental project was funded by USDA NIFA grant: 2010-38415-21908 and the wireless sensor network was generously provided by Decagon Devices, Inc. Additional funding was provided by the Charles Redd Center for Western Studies. The World Water Project at BYU was funded by a Mentored Environment Grant from the BYU Office

References (23)

  • C. Izurieta et al.

    A cyber-infrastructure for a Virtual Observatory and Ecological Informatics System-VOEIS

  • Cited by (0)

    View full text