Environmental chemistry through intelligent atmospheric data analysis
Introduction
Mass spectrometry is the name given to a collection of analytical techniques that measure the mass-to-charge ratio (m/z) of gas-phase ions generated from a wide variety of material types (air, biological tissue, minerals, etc.). These ions can be formed from any phase, using a diverse array of methods such as electron beams, voltage gradients, lasers, chemical reactions, and others. The analysis of the ions is done by separating or filtering them using electromagnetic fields and there are multiple methods that are commonly used, all of which are termed “mass spectrometry”. For measurements of complex systems made up of many components, mass spectrometry experiments are routinely coupled in an on-line manner to the output of a chemical separation device, for example a gas or liquid chromatograph, which delivers the constituent pure components to the mass spectrometer, separated in time. This simplifies the resulting mass spectrum. In cases where this pre-separation is not possible, each chemical compound in the sample can generate anywhere from zero to tens of distinct ions (from the intact molecule or from fragments or rearrangements of it, or formed via gas-phase reactions) which are overlaid in the resulting mass spectrum. Thus, the details of mass spectrometry experiments and their results can be complex.
There has been an explosion in use of mass spectrometers in recent decades (Grayson, 2005), with systems becoming smaller and, in some cases, portable. As ionization sources have been discovered or applied in new ways, a wider array of samples can be converted into gas-phase ions for analysis. And as the need for detailed information about chemical composition has emerged as a critical scientific challenge in many areas, mass spectrometers are attractive as they provide near-universal detection. Particularly in the realm of analysis of biological molecules (Dass, 2000) and in the analysis of environmental samples (Richardson, 2006), mass spectrometry has emerged as a technique of choice. Often, the analysis is done in situ rather than in the laboratory, as more field-deployable mass spectrometer systems are developed (Suess and Prather, 1999, Badman and Cooks, 2000). Mass spectrometers are routinely used to analyze air and water in situ and in real-time. Solid samples, such as soil (organic and mineral components) and food products can also be analyzed in situ, leading to the potential to detect contamination where it exists without traditional sample workup. These advances in instrument design have also been accompanied by increases in sensitivity, making analysis of very small samples possible.
In the past decade, mass spectrometry has been increasingly used for the analysis of airborne aerosol particles, which are solids or liquids and have sizes from a few nanometers to tens of micrometers in diameter (Suess and Prather, 1999, Nash et al., 2006, Canagaratna et al., 2007, Hinz and Spengler, 2007, Murphy, 2007). The specific details of these instruments vary. Many systems are transportable and can be operated in the field or on aircraft for in situ measurements of atmospherically relevant aerosol particles. These could be particles emitted directly from a particle source that is important to air pollution; on-line, real-time and in situ analysis of particles in the ambient atmosphere; or particles formed through chemistry occurring under atmospherically relevant conditions in laboratory settings. The analysis of these particles is typically done by introducing particle-laden air directly into the mass spectrometer and sampling the particles directly from the gas stream. The particles are ionized and a mass spectrum is obtained. Common instruments do this in one of two ways: by analyzing individual particles, such that each mass spectrum obtained represents the overlaid chemical components of a single aerosol particle; or by analyzing a small ensemble of particles, such that the mass spectrum obtained represents the overlaid chemical components of the small ensemble of particles, sampled in a short period of time. In each of these cases the instrumental requirements are strict, as the instrument must be sensitive enough to generate usable data from a very small amount of material (a single spherical particle with a diameter of 1.0 μm and a material density of 1.0 g/cm3 contains ∼0.5 pg of material). Analysis of the resulting data is complicated by the fact that atmospheric particles typically contain multiple compounds, and therefore the resulting mass spectra are generated from mixtures of small amounts of these components. Some instruments generate and/or analyze only one polarity of ions for mass analysis, but numerous instrumental setups are designed to detect both positive and negative ions from individual particles, leading to an even more complex dataset (Suess and Prather, 1999, Nash et al., 2006, Hinz and Spengler, 2007, Murphy, 2007). An example of the mass spectrometric data acquired from a single particle using a commercial single-particle mass spectrometer is presented in Fig. 1 and shows numerous peaks representing atomic and molecular species present in the particle which are observed as either positive or negative ions. Spectra such as those shown in Fig. 1 can be detected at rates up to tens-per-second, depending on the specific instrument configuration and particle concentration. Significant quantities of data (GB/day) can be generated, and because of particle-to-particle variability and temporal variability in the origins of particles that reach the instrument, data should not be averaged as it is acquired. This leads to a significant data analysis challenge.
It is important to realize that, within the realm of mass spectrometers devoted to the on-line, real-time analysis of atmospheric aerosol particles, there are dozens of instrument designs, only two types of which are currently commercially available. Many mass spectrometers have been developed for studies of atmospherically relevant aerosol particles within individual research groups. As there is no standard data format shared by these various instruments, even for those operating on the same fundamental measurement principle, most of the groups involved in instrument development have also developed their own data analysis software. Few of these analysis tools have been reported in the literature, and most are designed specifically to work with the data format of the instrument for which they were designed.
There are many goals that need to be achieved for the analysis of mass spectrometric data from aerosol particles, differing depending on the experiment. Thus, the different analysis tools that have been created contain different features, but in general, tools for use with aerosol mass spectrometry data fall into categories that include: grouping mass spectra by querying databases or by clustering; generating temporal profiles of various particle characteristics (e.g. particle counts matching a query, peak areas for specific ions, etc.); determining traditional aerosol metrics such as mass or number concentration and size distributions for a time period or as a function of time; generating calibrations based on peak area or particle number, compared to other quantitative measurements of particle mass or number concentration; visualization tools; and other tools such as comparison to co-located data and exporting data in forms useful for further analysis.
In the realm of single-particle mass spectrometry, where the initial data analysis goal is often to simplify a data set by grouping particle spectra by similarity, there are a number of important approaches that have been implemented by various research groups. Table 1 illustrates a variety of types of analysis which can be carried out, references to papers which utilize these techniques for the analysis of data from mass spectrometry of aerosol particles, and references to the original source of the technique. We break down the types of analysis that users are carrying out into the following categories: a) Partitioning-based clustering methods to divide data into clusters. Conceptually, items within a cluster are more similar to each other than to items in other clusters. Mathematically, the precise explanation of this idea varies from technique to technique; b) Hierarchical clustering methods which divide data into a hierarchy of clusters. At the lowest level, each item belongs to its own cluster; at the highest level, all items belong to the same cluster. As one moves up the hierarchy, the number of clusters decreases. An item thus belongs to an entire family of clusters; c) Discriminant analysis methods, which operate on data with an assigned classification, and aim to accurately predict that classification based on other features associated with that data; d) Factor-based methods which find underlying patterns within varied matrix problems based on the data of interest. They typically transform the data into other coordinate systems for purposes of finding underlying structure in the data or for reducing the number of features in the data; and e) Other data processing techniques, which include the standard analysis procedures that most users of single-particle mass spectrometers carry out on their data, and which require no special description, such as matching spectra to libraries, averaging spectra, and carrying out database queries.
While a number of software packages that perform data mining and/or data analysis for environmental purposes have been created, (Kanevski et al., 2004, Gibert et al., 2006, Stadler et al., 2006, Wong et al., 2007), some of which specifically target air quality measurements, such as those by Li and Shue (2004) and by Mazzoni et al. (2007), these methods are not designed to directly include the complex data obtained from mass spectrometry experiments. There have been efforts to compare some of the analysis approaches that are commonly used for the analysis of single-particle mass spectrometry data (rather than the analysis tools themselves) (Hinz et al., 2006), focusing on the broader types of techniques. There are only two software packages which have been developed for analysis of single-particle mass-spectral data which are freely available: YAADA (Allen, 2008) and the package entitled Enchilada (Environmental Chemistry through Intelligent Atmospheric Data Analysis) which we introduce here.
Section snippets
Framework for Enchilada
Enchilada was developed, first and foremost, to be generally useful with data from multiple instruments (i.e., varied data formats), and to be freely available. Additionally, Enchilada has been developed with an eye to integrating new and existing data mining tools with other tools which assist in the analysis of atmospheric datasets. Enchilada has the capability of handling the large mass-spectral datasets that are possible with mass spectrometric measurements of aerosol particles, seamlessly
Enchilada software description
Enchilada, which is written in Java, using the canonical distribution by Sun Microsystems (2004), has the capability of importing data in a variety of formats. Single-particle mass-spectral data from the commercial ATOFMS instrument (TSI, inc.) can be imported and calibrated from a raw binary format designed by the instrument vendor. The layout of this file is fairly complex, but well documented (TSI Incorporated, 2004). Additional single-particle mass spectrometry data from the NOAA PALMS
Data analysis methods
Enchilada provides users with some of the commonly used analysis methods, but it does not include all types of analysis that have been shown to be effective with atmospheric mass-spectral data to date (see Table 1). To this end, Enchilada is an open-source application, licensed by the Mozilla Public License. This allows the community to examine and modify Enchilada to serve local needs. We are continuing to develop new features and analysis methods, as well. In addition, data analyzed within
Software implementation and optimization
Enchilada is built on top of Microsoft's (2005) SQL Server database system, which offers very fast access to very large databases. Using SQL Server provided us with dramatically more capability than using Microsoft Access or simple data files, which some other solutions have used. We experimented with MySQL (Sun Microsystems, 2009) as well, and discovered that its query evaluation strategies were not as powerful as those provided by SQL Server. The power provided by SQL Server was not enough on
Summary and future outlook
In its current form, Enchilada offers users of multiple instruments the capability of handling their datasets, of visualizing mass spectral and temporal data, and of carrying out a variety of clustering methods on mass-spectral data. Enchilada can operate on millions of mass spectra in a reasonable amount of time on standard hardware, and provides users with the opportunity to engage meaningfully with large and complex datasets extremely promptly. The opportunity to combine data from multiple
Acknowledgements
In addition to the authors of this paper, there were others who contributed to the vision and implementation of Enchilada that we wish to thank. Efforts by Raghu Ramakrishnan, Bee-Chung Chen, Zheng Huang, Ilari Shafer, Greg Cipriano, and various individuals at the Wisconsin State Laboratory of Hygiene are much appreciated. We also thank Dan Ling and his staff at the MI DEQ air quality division and the Dearborn Public Schools for assistance with data acquisition in Dearborn, MI. Work on
References (69)
- et al.
ART 2-A: an adaptive resonance algorithm for rapid category learning and recognition
Neural Networks
(1991) - et al.
Source apportionment of 1 h semi-continuous data during the 2005 study of organic aerosols in riverside (SOAR) using positive matrix factorization
Atmospheric Environment
(2008) - et al.
GESCONDA: an intelligent data analysis system for knowledge discovery and management in environmental databases
Environmental Modelling and Software
(2006) - et al.
characterisation of single particles from in-port ship emissions
Atmospheric Environment
(2009) - et al.
Chemical classes of atmospheric aerosol particles at a rural site in Central Europe during winter
Journal of Aerosol Science
(2002) - et al.
Data processing in on-line laser mass spectrometry of inorganic, organic, or biological airborne particles
Journal of the American Society for Mass Spectrometry
(1999) - et al.
Comparative parallel characterization of particle populations with two mass spectrometric systems LAMPAS 2 and SPASS
International Journal of Mass Spectrometry
(2006) - et al.
Environmental data mining and modeling based on machine learning algorithms and geostatistics
Environmental Modelling and Software
(2004) - et al.
Data mining to aid policy making in air pollution management
Expert Systems with Applications
(2004) - et al.
A data-mining approach to associating MISR smoke plume heights with MODIS fire measurements
Remote Sensing of Environment
(2007)