Environmental chemistry through intelligent atmospheric data analysis

https://doi.org/10.1016/j.envsoft.2009.12.001Get rights and content

Abstract

Here we present a new open-source software package designed to facilitate the analysis of atmospheric data, with emphasis on data mining applications applied to single-particle mass spectrometry data from aerosol particles. The software package, Enchilada (Environmental Chemistry through Intelligent Atmospheric Data Analysis), is designed to seamlessly handle large datasets, to allow for temporal aggregation of data from many instruments, and to integrate techniques such as clustering (K-means, K-medians, and Art-2a), labeling of peaks in mass spectra, and temporal correlations of multiple datasets from multiple instrument types. The software, which continues to be developed and improved, provides users with a single package to integrate data from multiple mass spectrometer systems (ATOFMS, PALMS, SPASS, Q-AMS) as well as any time-based data stream. A detailed description of the software and examples of analysis methods that are incorporated into it are described here.

Introduction

Mass spectrometry is the name given to a collection of analytical techniques that measure the mass-to-charge ratio (m/z) of gas-phase ions generated from a wide variety of material types (air, biological tissue, minerals, etc.). These ions can be formed from any phase, using a diverse array of methods such as electron beams, voltage gradients, lasers, chemical reactions, and others. The analysis of the ions is done by separating or filtering them using electromagnetic fields and there are multiple methods that are commonly used, all of which are termed “mass spectrometry”. For measurements of complex systems made up of many components, mass spectrometry experiments are routinely coupled in an on-line manner to the output of a chemical separation device, for example a gas or liquid chromatograph, which delivers the constituent pure components to the mass spectrometer, separated in time. This simplifies the resulting mass spectrum. In cases where this pre-separation is not possible, each chemical compound in the sample can generate anywhere from zero to tens of distinct ions (from the intact molecule or from fragments or rearrangements of it, or formed via gas-phase reactions) which are overlaid in the resulting mass spectrum. Thus, the details of mass spectrometry experiments and their results can be complex.

There has been an explosion in use of mass spectrometers in recent decades (Grayson, 2005), with systems becoming smaller and, in some cases, portable. As ionization sources have been discovered or applied in new ways, a wider array of samples can be converted into gas-phase ions for analysis. And as the need for detailed information about chemical composition has emerged as a critical scientific challenge in many areas, mass spectrometers are attractive as they provide near-universal detection. Particularly in the realm of analysis of biological molecules (Dass, 2000) and in the analysis of environmental samples (Richardson, 2006), mass spectrometry has emerged as a technique of choice. Often, the analysis is done in situ rather than in the laboratory, as more field-deployable mass spectrometer systems are developed (Suess and Prather, 1999, Badman and Cooks, 2000). Mass spectrometers are routinely used to analyze air and water in situ and in real-time. Solid samples, such as soil (organic and mineral components) and food products can also be analyzed in situ, leading to the potential to detect contamination where it exists without traditional sample workup. These advances in instrument design have also been accompanied by increases in sensitivity, making analysis of very small samples possible.

In the past decade, mass spectrometry has been increasingly used for the analysis of airborne aerosol particles, which are solids or liquids and have sizes from a few nanometers to tens of micrometers in diameter (Suess and Prather, 1999, Nash et al., 2006, Canagaratna et al., 2007, Hinz and Spengler, 2007, Murphy, 2007). The specific details of these instruments vary. Many systems are transportable and can be operated in the field or on aircraft for in situ measurements of atmospherically relevant aerosol particles. These could be particles emitted directly from a particle source that is important to air pollution; on-line, real-time and in situ analysis of particles in the ambient atmosphere; or particles formed through chemistry occurring under atmospherically relevant conditions in laboratory settings. The analysis of these particles is typically done by introducing particle-laden air directly into the mass spectrometer and sampling the particles directly from the gas stream. The particles are ionized and a mass spectrum is obtained. Common instruments do this in one of two ways: by analyzing individual particles, such that each mass spectrum obtained represents the overlaid chemical components of a single aerosol particle; or by analyzing a small ensemble of particles, such that the mass spectrum obtained represents the overlaid chemical components of the small ensemble of particles, sampled in a short period of time. In each of these cases the instrumental requirements are strict, as the instrument must be sensitive enough to generate usable data from a very small amount of material (a single spherical particle with a diameter of 1.0 μm and a material density of 1.0 g/cm3 contains ∼0.5 pg of material). Analysis of the resulting data is complicated by the fact that atmospheric particles typically contain multiple compounds, and therefore the resulting mass spectra are generated from mixtures of small amounts of these components. Some instruments generate and/or analyze only one polarity of ions for mass analysis, but numerous instrumental setups are designed to detect both positive and negative ions from individual particles, leading to an even more complex dataset (Suess and Prather, 1999, Nash et al., 2006, Hinz and Spengler, 2007, Murphy, 2007). An example of the mass spectrometric data acquired from a single particle using a commercial single-particle mass spectrometer is presented in Fig. 1 and shows numerous peaks representing atomic and molecular species present in the particle which are observed as either positive or negative ions. Spectra such as those shown in Fig. 1 can be detected at rates up to tens-per-second, depending on the specific instrument configuration and particle concentration. Significant quantities of data (GB/day) can be generated, and because of particle-to-particle variability and temporal variability in the origins of particles that reach the instrument, data should not be averaged as it is acquired. This leads to a significant data analysis challenge.

It is important to realize that, within the realm of mass spectrometers devoted to the on-line, real-time analysis of atmospheric aerosol particles, there are dozens of instrument designs, only two types of which are currently commercially available. Many mass spectrometers have been developed for studies of atmospherically relevant aerosol particles within individual research groups. As there is no standard data format shared by these various instruments, even for those operating on the same fundamental measurement principle, most of the groups involved in instrument development have also developed their own data analysis software. Few of these analysis tools have been reported in the literature, and most are designed specifically to work with the data format of the instrument for which they were designed.

There are many goals that need to be achieved for the analysis of mass spectrometric data from aerosol particles, differing depending on the experiment. Thus, the different analysis tools that have been created contain different features, but in general, tools for use with aerosol mass spectrometry data fall into categories that include: grouping mass spectra by querying databases or by clustering; generating temporal profiles of various particle characteristics (e.g. particle counts matching a query, peak areas for specific ions, etc.); determining traditional aerosol metrics such as mass or number concentration and size distributions for a time period or as a function of time; generating calibrations based on peak area or particle number, compared to other quantitative measurements of particle mass or number concentration; visualization tools; and other tools such as comparison to co-located data and exporting data in forms useful for further analysis.

In the realm of single-particle mass spectrometry, where the initial data analysis goal is often to simplify a data set by grouping particle spectra by similarity, there are a number of important approaches that have been implemented by various research groups. Table 1 illustrates a variety of types of analysis which can be carried out, references to papers which utilize these techniques for the analysis of data from mass spectrometry of aerosol particles, and references to the original source of the technique. We break down the types of analysis that users are carrying out into the following categories: a) Partitioning-based clustering methods to divide data into clusters. Conceptually, items within a cluster are more similar to each other than to items in other clusters. Mathematically, the precise explanation of this idea varies from technique to technique; b) Hierarchical clustering methods which divide data into a hierarchy of clusters. At the lowest level, each item belongs to its own cluster; at the highest level, all items belong to the same cluster. As one moves up the hierarchy, the number of clusters decreases. An item thus belongs to an entire family of clusters; c) Discriminant analysis methods, which operate on data with an assigned classification, and aim to accurately predict that classification based on other features associated with that data; d) Factor-based methods which find underlying patterns within varied matrix problems based on the data of interest. They typically transform the data into other coordinate systems for purposes of finding underlying structure in the data or for reducing the number of features in the data; and e) Other data processing techniques, which include the standard analysis procedures that most users of single-particle mass spectrometers carry out on their data, and which require no special description, such as matching spectra to libraries, averaging spectra, and carrying out database queries.

While a number of software packages that perform data mining and/or data analysis for environmental purposes have been created, (Kanevski et al., 2004, Gibert et al., 2006, Stadler et al., 2006, Wong et al., 2007), some of which specifically target air quality measurements, such as those by Li and Shue (2004) and by Mazzoni et al. (2007), these methods are not designed to directly include the complex data obtained from mass spectrometry experiments. There have been efforts to compare some of the analysis approaches that are commonly used for the analysis of single-particle mass spectrometry data (rather than the analysis tools themselves) (Hinz et al., 2006), focusing on the broader types of techniques. There are only two software packages which have been developed for analysis of single-particle mass-spectral data which are freely available: YAADA (Allen, 2008) and the package entitled Enchilada (Environmental Chemistry through Intelligent Atmospheric Data Analysis) which we introduce here.

Section snippets

Framework for Enchilada

Enchilada was developed, first and foremost, to be generally useful with data from multiple instruments (i.e., varied data formats), and to be freely available. Additionally, Enchilada has been developed with an eye to integrating new and existing data mining tools with other tools which assist in the analysis of atmospheric datasets. Enchilada has the capability of handling the large mass-spectral datasets that are possible with mass spectrometric measurements of aerosol particles, seamlessly

Enchilada software description

Enchilada, which is written in Java, using the canonical distribution by Sun Microsystems (2004), has the capability of importing data in a variety of formats. Single-particle mass-spectral data from the commercial ATOFMS instrument (TSI, inc.) can be imported and calibrated from a raw binary format designed by the instrument vendor. The layout of this file is fairly complex, but well documented (TSI Incorporated, 2004). Additional single-particle mass spectrometry data from the NOAA PALMS

Data analysis methods

Enchilada provides users with some of the commonly used analysis methods, but it does not include all types of analysis that have been shown to be effective with atmospheric mass-spectral data to date (see Table 1). To this end, Enchilada is an open-source application, licensed by the Mozilla Public License. This allows the community to examine and modify Enchilada to serve local needs. We are continuing to develop new features and analysis methods, as well. In addition, data analyzed within

Software implementation and optimization

Enchilada is built on top of Microsoft's (2005) SQL Server database system, which offers very fast access to very large databases. Using SQL Server provided us with dramatically more capability than using Microsoft Access or simple data files, which some other solutions have used. We experimented with MySQL (Sun Microsystems, 2009) as well, and discovered that its query evaluation strategies were not as powerful as those provided by SQL Server. The power provided by SQL Server was not enough on

Summary and future outlook

In its current form, Enchilada offers users of multiple instruments the capability of handling their datasets, of visualizing mass spectral and temporal data, and of carrying out a variety of clustering methods on mass-spectral data. Enchilada can operate on millions of mass spectra in a reasonable amount of time on standard hardware, and provides users with the opportunity to engage meaningfully with large and complex datasets extremely promptly. The opportunity to combine data from multiple

Acknowledgements

In addition to the authors of this paper, there were others who contributed to the vision and implementation of Enchilada that we wish to thank. Efforts by Raghu Ramakrishnan, Bee-Chung Chen, Zheng Huang, Ilari Shafer, Greg Cipriano, and various individuals at the Wisconsin State Laboratory of Hygiene are much appreciated. We also thank Dan Ling and his staff at the MI DEQ air quality division and the Dearborn Public Schools for assistance with data acquisition in Dearborn, MI. Work on

References (69)

  • D.G. Nash et al.

    Aerosol mass spectrometry: an introductory review

    International Journal of Mass Spectrometry

    (2006)
  • T.P. Rebotier et al.

    Aerosol time-of-flight mass spectrometry data analysis: a benchmark of clustering algorithms

    Analytica Chimica Acta

    (2007)
  • D.C. Snyder et al.

    Estimating the contribution of point sources to atmospheric metals using single-particle mass spectrometry

    Atmospheric Environment

    (2009)
  • X.-H. Song et al.

    Source apportionment of gasoline and diesel by multivariate calibration based on single particle mass spectral data

    Analytica Chimica Acta

    (2001)
  • M. Stadler et al.

    Web-based tools for data analysis and quality assurance on a life-history trait database of plants of Northwest Europe

    Environmental Modelling and Software

    (2006)
  • P.V. Tan et al.

    Chemically-assigned classification of aerosol mass spectra

    Journal of the American Society for Mass Spectrometry

    (2002)
  • I.W. Wong et al.

    Species at risk: data and knowledge management within the WILDSPACE (TM) decision support system

    Environmental Modelling and Software

    (2007)
  • A. Zelenyuk et al.

    SpectraMiner, an interactive data mining and visualization software for single particle mass spectroscopy: a laboratory test case

    International Journal of Mass Spectrometry

    (2006)
  • A. Zelenyuk et al.

    ClusterSculptor: software for expert-steered classification of single particle mass spectra

    International Journal of Mass Spectrometry

    (2008)
  • W. Zhao et al.

    Comparison of two cluster analysis methods using single particle mass spectra

    Atmospheric Environment

    (2008)
  • W. Zhao et al.

    Predicting bulk ambient aerosol compositions from ATOFMS data with ART-2a and multivariate analysis

    Analytica Chimica Acta

    (2005)
  • J.O. Allen

    YAADA: Software Toolkit to Analyze Single-particle Mass Spectral Data

    (2008)
  • B.J. Anderson et al.

    Adapting K-Medians to Generate Normalized Cluster Centers. Sixth SIAM International Conference on Data Mining

    (2006)
  • B.J. Anderson et al.

    User-friendly Clustering for Atmospheric Data Analysis

    (2005)
  • E.R. Badman et al.

    Miniature mass analyzers

    Journal of Mass Spectrometry

    (2000)
  • J.C. Bezdek

    Source Reference: Pattern Recognition with Fuzzy Objective Function Algorithms

    (1981)
  • P.S. Bradley et al.

    Clustering via concave minimization

  • M.R. Canagaratna et al.

    Chemical and microphysical characterization of ambient aerosols with the aerodyne aerosol mass spectrometer

    Mass Spectrometry Reviews

    (2007)
  • L. Chen et al.

    Cost-based Labeling of Groups of Mass Spectra. Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data

    (2004)
  • V. Cherkassky et al.

    Learning from Data – Concepts, Theory and Methods

    (1998)
  • C. Dass

    Principles and Practice of Biological Mass Spectrometry

    (2000)
  • W.H.E. Day et al.

    Efficient algorithms for agglomerative hierarchical clustering methods

    Journal of Classification

    (1984)
  • R.O. Duda et al.

    Pattern Classification

    (2000)
  • Enchilada
  • Cited by (0)

    View full text