Using neural networks to detect patterns in inter-specific data: An example from net-spinning caddisflies (Trichoptera: Annulipalpia)
Introduction
Over the past 20+ years, many statistical techniques have been developed to compare inter-specific data (i.e., phylogenetic comparative methods = PCMs). As currently employed, PCMs are extensions of linear statistics well known to ecologists and evolutionary biologists (e.g., ANOVA, linear regression) that are used to compare data among closely related species without violating the assumptions of the standard statistical techniques (Bell, 1989, Freckleton et al., 2002, Harvey and Pagel, 1991, Cheverud et al., 1985, Felsenstein, 1985, Garland et al., 1992, Grafen, 1989, Huelsenbeck et al., 2000, Lynch, 1991, Martins et al., 2002, Martins and Hansen, 1996). For example, ANOVA and regression require that each data point represent an independent observation. Direct comparisons among species using either method violate this assumption because species are evolutionarily related and therefore are not independent observations (i.e., pseudo-replication). PCMs overcome the problem of non-independence by using phylogenetic information to create a new set of biological data based on the original data. The new, ‘phylogenetically-corrected’ data set is independent of phylogeny and can be used in the context of ANOVA and regression without violating the assumption of independence (Felsenstein, 1985, Grafen, 1989, Martins et al., 2002).
PCMs are the best tool for comparing biological data among different species and, accordingly, PCMs have become the standard statistical tool. However, PCMs have some limitations. In this paper, we introduce neural networks as an alternative to PCMs when statistical or biological assumptions cannot be met (see below) or as a complementary analysis to supplement results from a PCM analysis. Although PCMs are the best tool for inter-specific analyses, the assumptions underlying PCMs often cannot be met with the biological data at hand, and we suggest that neural networks provide a second-best option under these conditions. Two problems arise when using PCMs. First, most PCMs require at least partial, a priori information about phylogenetic relationships. For many taxa, phylogenetic information is inadequate or unavailable. In these cases, PCMs can provide a distorted picture of inter-specific variation or cannot be applied at all. Second, PCMs make non-trivial assumptions which are rarely addressed (Martins, 2000). For example, assumptions are made about the type of evolutionary change being modelled (Felsenstein, 1985), the time since divergence between phylogenetic groups (Felsenstein, 1985, Grafen, 1989), and the underlying relationships among traits (Quader et al., 2004). These assumptions represent fundamental biological processes and therefore exert strong influence over the interpretation of PCM results; however, these assumptions are rarely tested or discussed and are at the centre of current controversy over the use of PCMs (Martins, 2000). Below, we introduce and employ neural networks for pattern recognition, visualization, and simplification multidimensional morphological data when PCMs cannot be used, i.e., when phylogenies are inaccurate or missing or assumptions are violated. Our goal is two-fold: 1) to provide comparative biologists with another potential tool for analyzing large, complex data sets; and 2) to stimulate research on the use of neural networks in evolutionary biology.
Neural networks (NNs) are made up of interconnected processing elements (i.e., “neurons”), which respond in parallel to a set of input signals (i.e., data). NNs have been used to explore biological systems mainly in two ways: 1) as models of biological nervous systems and; 2) as data analytic methods (Sarle, 1994). We suggest here that NNs can be a useful tool for pattern recognition and visualization of multivariate inter-specific morphological data. Neural networks are highly robust to underlying data distributions and make no assumptions about the independence of data points or relationships among variables (Bishop, 1995). Second, in addition to describing inter-specific patterns of phenotypic variation, hidden structure within the data, such as phylogenetic affinities or gender classification, emerge in the output. In this way, NNs utilize continuous variables to predict categorical membership (e.g., taxon, sex). To this end, NNs are widely used as categorization tools (Bishop, 1995, Haykin, 1999) and could be used by biologists to provide guidance in placing individuals or species within appropriate taxonomic groups. Finally, NNs are similar to other multivariate techniques such as principal component analyses (Sarle, 1994) and can be used to collapse complex multivariate data sets into two-dimensional space. The advantage is that complex phenotypic data (e.g., morphological geometry) can be collapsed and scaled so that taxa can be directly compared. To date, NNs have received limited application in biological categorization and pattern recognition (although, see Dopazo and Carazo, 1997). As far as we are aware, no other studies have used NNs to examine continuous morphological data among closely related species.
We used Kohonen self-organizing maps (a type of NN) to study patterns of multivariate sexual dimorphism in 40 species of adult caddisflies whose aquatic larvae spin silken nets to capture food particles in streams and rivers of North America (Trichoptera: suborder Annulipalpia). We chose this group of caddisflies because qualitative studies suggest that different species of these caddisflies might show alternative patterns of adult morphological variation between sexes (a.k.a., sexual dimorphism) (Betten, 1934, Ross, 1944), signifying that diverse mechanisms might underlie this variation. These reports indicate that adult traits involved in dispersal, mating, and oviposition (Deutsch, 1985, Gullefors and Petersson, 1993, Petersson, 1989, Petersson, 1990, Petersson and Solem, 1987) are sexually dimorphic in many species (wings, antennae, tibiae, eyes: Betten, 1934, Deutsch, 1985, Malicky, 1977, Ross, 1944). Although information on larval caddisfly ecology is extensive (Wiggins, 1996), few data are available on adult ecology, behaviour, and reproduction (Halat and Resh, 1997, Plague, 1999). We have previously quantified patterns of body size dimorphism in these species (Jannot and Kerans, 2003). Our goals are to: 1) demonstrate the use of NNs in biological pattern recognition and categorization by 2) quantifying patterns of multivariate sexual dimorphism, and thus 3) provide a basis for generating hypotheses about the ecology and evolution of sexual dimorphism in North American caddisflies.
Section snippets
Specimens and measurements
We have described our taxonomic choices and methods in detail elsewhere (Jannot and Kerans, 2003); therefore, we provide only a brief synopsis here. Adult caddisflies preserved in ethanol were obtained from museums and collections in North America. When possible, we measured a minimum of 20 individuals per sex; however, sample sizes varied among species and sexes, depending upon availability (see Table 1, Table 2). All measurements and sex determinations were made with a zoom stereoscopic
Results
Even though the map represents real, consistently obtainable relationships between vectors in the data set, the procedure for obtaining the map is stochastic. Thus, the position of the clusters on the map can change even as the composition of clusters remains consistent. Therefore, it is useful to investigate the structure of several maps of the same data to discover what features of the Kohonen map are consistent. We examined five independent runs of the Kohonen network and each gave
Discussion
We have shown that Kohonen networks can provide interesting and useful data concerning inter-specific morphological pattern recognition. Kohonen networks were able to accurately predict the sex and taxonomic grouping of individual caddisflies and proved useful in highlighting the importance of single morphological variables in predicting sex or taxa. Overlaying the Kohonen network maps onto our current phylogenetic hypothesis of caddisfly evolution provided striking patterns of
Acknowledgements
We would like to thank the museum curators who provided us with specimens (list can be found in Jannot and Kerans, 2003). Earlier versions of this manuscript benefited from the comments of A. Boyko and 2 anonymous reviewers. J.E.J., O.A. and K.C. were supported by the NSF Cross-disciplinary Research at Undergraduate Institutions (CRUI) grant # DBI-0442412 to D. Whitman, D. Borst, S. A. Juliano, O. Akman.
References (36)
Adaptation and the comparative method
Trends Ecol. Evol.
(2000)A comparative method
Am. Nat.
(1989)The caddisflies of New York
N.Y. State Mus. Bull.
(1934)Neural Networks for Pattern Recognition
(1995)- et al.
The quantitative assessment of phylogenetic constraints in comparative analyses: sexual dimorphism in body weight among primates
Evolution
(1985) Swimming modifications of adult female Hydropsychidae compared with other Trichoptera
Freshw. Invertebr. Biol.
(1985)- et al.
Phylogenetic reconstruction using an unsupervised growing neural network that adopts the topology of a phylogenetic tree
J. Mol. Evol.
(1997) Phylogenies and the comparative method
Am. Nat.
(1985)- et al.
Phylogenetic analysis and comparative data: a test and a review of evidence
Am. Nat.
(2002) - et al.
Procedures for the analysis of comparative data using phylogenetically independent contrasts
Syst. Biol.
(1992)