Keywords

1 Introduction

Seagrass is an important ecological, economic and social well-being component of coastal ecosystems [5, 19]. Ecologically, seagrass provides multiple benefits such as pollution filtering, sediment trapping or organic fertilization. Economically, seagrass ecosystems are considered 23 and 33 times more valuable than terrestrial and oceanic ecosystems respectively [19]. However, reliable information about the distribution of seagrass is not tracked in most parts of the world due to the high cost of comprehensive mapping [19]. In this paper, we propose a deep learning model for the detection of seagrass in a given area based on multispectral images taken from operational satellite remote sensing platforms. The developed method can quantify the leaf area index (LAI) for each valid pixel within a scene. LAI is defined as leaf area per square area [2], and it is considered one of the most important biophysical components of seagrass [19].

The goal of this project is to develop a model that is able to quantify LAI from high resolution satellite imagery with limited field observations that may be ubiquitously applied to other localities. To achieve this goal, the following two questions need to be answered: (1) Do the satellite multispectral images contain enough information for a machine learning model to learn and quantify LAI in the same region? (2) Can a machine learning model trained at one location be generalized to other locations for LAI mapping?

To answer the first question, we utilize high resolution multispectral images taken by the Worldview-2 (WV-2) satellite with a resolution of 1.24 m in the 8 visible and near infrared (VNIR) bands to train a deep capsule network (DCN) for LAI quantification. Historically, an experienced operator labeled the image as four different classes (sea, sand, seagrass and land) and applied a physics model [6] in the seagrass region to map LAI. The physics model has a known error rate of 10% [6], so not all the labeled regions are suitable to train machine learning models. Therefore, an experienced operator selected the most confident regions in the images to train the DCN. To answer the second question, we train a DCN at one location using the multispectral images and develop transfer learning strategies to adapt the trained model to new locations for LAI quantification. The key contributions of this paper are: (1) an innovative deep capsule network that performs classification and regression simultaneously to achieve end-to-end seagrass quantification, and (2) the first attempt to apply transfer learning to DCN for satellite derived seagrass quantification at different locations.

The deep capsule network is jointly optimized for classification and regression so that it is capable of performing end-to-end seagrass quantification. In addition, transfer learning strategies are developed to adapt a deep capsule network trained at one location for seagrass quantification at different locations. To the best of our knowledge, we are among the first to apply capsule network for seagrass mapping in the remote sensing community. The remainder of the paper reviews the literature in Sect. 2, describes the proposed methods in Sect. 3, presents (in Sect. 4) and discusses (in Sect. 5) the experimental results, and finally concludes the paper in Sect. 6.

2 Related Work

2.1 Deep Learning

Deep learning has been successful in numerous fields such as image classification [10], speech recognition [7], medical imaging [15], cybersecurity [3, 11] and remote sensing [12, 14]. Among different deep learning models, deep Convolutional Neural Networks (CNNs) are currently popular for various applications. A CNN consists of a set of convolutional layers followed by several fully connected layers. The convolutional layers learn effective representations for the raw data and the fully connected layers perform classification or regression based on the learned representations. Many CNN based image classification systems can perform an end-to-end learning in which feature extraction (representation learning) is jointly optimized with classification, and it is believed that this automatic feature learning process plays a critical role to achieve the superb performances of CNNs.

2.2 Capsule Networks

Capsule networks are a promising deep learning method recently introduced by Sabour et al. [17]. A capsule is a group of neurons that represents the instantiation parameters of an entity in the input image, while the length of the capsule represents the posterior probability that the entity exists in the input image. The capsule network obtained a 99.75% accuracy on the MNIST dataset, which, at the time of writing, represents state-of-the-art. Additionally, the network has shown promising results on the classification of overlapping images. An improved version of the capsule network was just released by the same research group that achieved the state-of-the-art result on the smallNORB benchmark dataset [8].

An interesting characteristic of the capsule network is that it can reconstruct an input image by using the outputs of the capsule vectors. The last layer of the capsule network consists of a set of capsule vectors, each of them corresponding to one class in the dataset. During training, the capsule vector corresponding to the training label of the image is used to reconstruct the input image as a regularization for the optimization. The error between the reconstructed image and the input image is then used to optimize both the reconstruction weights and the weights in the capsule network through back propagation. It has been shown that the reconstruction part is a significant contributor to the overall excellent performance on the MNIST dataset.

There are a few recent works on the capsule network in the literature. Xi et al. [20] investigated the application of capsule networks on the classification of the CIFAR-10 dataset. The best accuracy they obtained was 71.55%, which is far from the state-of-the-art (96.53%). In addition, it was demonstrated that the reconstruction network did not perform well when applied to a high-dimensional image. In our companion paper [9], a capsule network was designed as a generative model to adapt a trained capsule model to new locations for seagrass identification. To the best of our knowledge, we are the first to design a capsule network for seagrass quantification.

2.3 Seagrass LAI Mapping

The majority of the studies of seagrass mapping focus on assessing the accuracy of manually mapping methods [16, 18]. Yang et al. [21] manually computed the distribution of seagrass from satellite images using a remote sensing method. They obtained an accuracy slightly better than 80%. However, their approach only determined whether seagrass was found in a region instead of quantifying the LAI index. An automatic algorithm for seagrass LAI mapping was implemented by Wicaksono et al. [19]. In this case, they provided regression results and obtained a best standard error of estimates of 0.72.

2.4 Transfer Learning

Transfer learning is a technique that consists of using a model that has gained knowledge from one domain to solve problems in different but similar domains where training data is limited [13]. One of the first successful attempts to use transfer learning in deep learning was reported in DeCaf [4], in which Donahouse et al. investigated the problem of generalizing a CNN trained on ImageNet [10] for other problem domains. Their transfered model outperformed the state-of-the-art by extracting features directly from the trained CNN and training a simple classifier on the features for classification. A different approach was carried out by Yosinki et al. [22]. In the study, they tested whether or not the features learned by a 8-layer CNN with one dataset could be applied to another dataset. To achieve this goal, they froze the first few layers in the model and retrained the remaining layers on the new database. Their experiments demonstrated that fine-tuning the whole network obtained the best accuracy. Banerjee et al. [1] showed how transfer learning with deep belief network (DBN) can be utilized to improve diagnosis of post-traumatic stress disorder (PTSD).

3 Methodology

3.1 Datasets

We collected three different multispectral images captured by the Worldview-2 (WV-2) satellite in Florida coastal areas. For each image, pixels were classified into four classes (sea, land, seagrass and sand) and LAI mappings for seagrass pixels were computed by the physics model [6].

3.2 Data Labeling

In the original physics model, not all pixels were validated with field observations for LAI quantification. Comparison of coincident field observations and satellite pixels demonstrated there was a 10% error rate in the LAI mapping [6]. Therefore, the LAI mapping was not treated as ground truth for model training. However, there were certain regions in the satellite images where the mappings were more reliable.

Fig. 1.
figure 1

Satellite images taken from Saint Joseph Bay on 11/10/2010 (a), Keeton Beach on 05/20/2010 (b) and Saint George Sound on 04/27/2012 (c). The blue, cyan, green and red boxes correspond to the selected regions for training belonging to sea, sand, seagrass and land, respectively. (Color figure online)

Fig. 2.
figure 2

LAI mappings of satellite images by the physics model [6] taken from Saint Joseph Bay on 11/10/2010 (a), Keeton Beach on 05/20/2010 (b) and Saint George Sound on 04/27/2012 (c). (Color figure online)

In this study, an experienced operator (a co-author of the physics model in [6]) selected several regions in each of the three images with highest confidence of the labeling and the LAI mapping. These regions have been identified as blue, cyan, green and red boxes, corresponding to sea, sand, seagrass and land respectively (Fig. 1). The LAI mapping is represented as a continuous color rainbow scale where blue is the minimum LAI index (0) and red is the maximum (Fig. 2). To ensure the reliability of quantification results, we trained our models only on those selected regions. We noticed that the datasets extracted from the selected regions were highly unbalanced, specially those obtained from Keeton Beach and St. George Sound. We balanced the datasets for training by upsampling or downsampling the classes in the data.

3.3 Joint Optimization of Classification and Regression in Capsule Networks

Figure 3 shows the structure of the capsule network that is designed for seagrass LAI mapping. The model needs to handle multispectral image patches with a size of 5\(\,\times \,\)5\(\,\times \,\)8. The first convolutional layer in the capsule network has 32 kernels with a size of 2\(\,\times \,\)2\(\,\times \,\)8 and a stride of 1. The PrimaryCaps layer has 8 blocks of 3\(\,\times \,\)3\(\,\times \,\)8 capsules produced by 64 kernels of size 2\(\,\times \,\)2\(\,\times \,\)32 in the second convolutional layer.

Fig. 3.
figure 3

Structure of the proposed capsule network for classification and regression of LAI.

The reconstruction block in the original capsule network [17] is redesigned as a linear regression model for LAI quantification based on the seagrass vector in the FeatureCaps layer. The LAI index of a patch is defined as the LAI index of the center pixel in the input patch. The structure in Fig. 3 enables us to jointly optimize classification and LAI regression. The FeatureCaps layer performs classification of the four classes (sea, land, seagrass and sand) with a separate margin loss as in [17].

When a seagrass patch is inputted during training, we mask out all but the seagrass vector in the FeatureCaps to regress its LAI. The mapping error is then back propagated to optimize all weights in the network, thus jointly optimizing classification and regression. For all other types of patches during training, the regression part is ignored. The number of routings in the DCN is set to 3.

3.4 Transfer Learning with Capsule Networks

The ultimate goal of this project is to develop a model that is able to quantify LAI from high resolution satellite imagery with limited field observations that may be ubiquitously applied to other locations, but the distribution of seagrass LAI at different locations shows a wide range of variation. It is difficult to collect enough ground truth data from each of the locations and train a machine learning model specific to the location. We propose a transfer learning approach using the features from the FeatureCaps layer to generalize the trained capsule network models to different locations with minimum information from the new locations. First, we train a capsule network with all the patches from the selected regions at St. Joseph Bay. Then, we select a few labeled patches from one of the other 2 images (Keeton Beach or St. George Sound), and pass those samples through the trained capsule network to retrieve all the 64 features from FeatureCaps as new representations for the patches. These labeled representations are then used to classify all other patches from Keeton Beach based on 1-nearest neighbor (1-NN) rule. Separately, we extract the seagrass vector in the FeatureCaps (16 features) corresponding to the selected labeled seagrass patches and train a linear regression model to predict LAI. Finally, we predict LAI using the linear model for all patches that are classified as seagrass by the 1-NN rule and set LAI as ‘0’ for non seagrass patches.

3.5 Models for Comparison

For comparison purposes, we design a CNN with a similar complexity to the capsule network in representation learning. The CNN has 2 convolutional layers with 32 2\(\,\times \,\)2 and 16 4\(\,\times \,\)4 convolutional kernels, respectively. The fully connected layer has 16 hidden units to match the dimension of the seagrass vector in FeatureCaps. The last layer of the CNN performs linear regression to quantify LAI. We also implement a simple linear regression model applied to the image patch directly as the baseline method for comparison.

4 Experiments and Results

4.1 Model Structure Determination

We utilize the selected labeled regions in St. Joseph to determine parameters of the models. After cross-validation (CV) with different choices for the models, the patch size is set as \(5\,\times \,5\,\times \,8\), the capsule network has two convolutional layers and contains 32 and 64 kernels, respectively. We make sure that the number of parameters for representation learning in both CNN and capsule network are roughly the same, having around 9k parameters and 17 parameters for linear LAI mapping in the last layer including the bias term.

It is worth to note that the capsule network has about 38k parameters in the capsule layers for routing, making the total number of parameters as 46k, which is approximately 5 times as that in CNN. Therefore, we decide to train the CNN for 5 times less than the capsule network. Specifically, we train the CNN for 10 epochs and the capsule network for 50 epochs.

4.2 Cross-Validation in Selected Regions

For each image, we perform a 3-fold cross-validation (CV) in the selected regions where the classification and LAI mapping are more reliable. We compute the root mean squared errors (RMSEs) for comparison among different models. Table 1 shows that the capsule network produces the best results at the majority of the locations.

Table 1. 3-fold CV results (RMSEs) on selected regions.

4.3 End-to-End LAI Mapping

We use all patches from the selected regions to train the capsule network for LAI mapping. During training, the capsule model first classifies a patch (with a size of \(5\,\times \,5\,\times \,8\)) as one of the four classes. It then maps seagrass patches to LAI index and non seagrass patches set to ‘0’. Therefore, the model performs end-to-end mapping by jointly optimizing classification and regression. The trained models are then applied to the whole images to produce LAI mappings.

To illustrate the effect of the end-to-end learning, we train the linear model and CNN with seagrass patches and non-seagrass patches (with LAI = 0) in the selected regions and show the full LAI mappings (Figs. 4, 5 and 6). Note that these Figures are shown here for visualization only because the physics model mapping should not be considered as ground truth. Only the accuracies computed in the selected regions (Table 1) should be used as performance metrics for model comparison.

Fig. 4.
figure 4

Mapping of LAI at St. Joseph Bay using a model trained on the patches from the selected regions.

Fig. 5.
figure 5

Mapping of LAI at Keeton Beach using a model trained on the patches from the selected regions.

Fig. 6.
figure 6

Mapping of LAI at St. George Sound using a model trained on the patches from the selected regions.

4.4 Transfer Learning with Capsule Network

We train models using all patches in the selected regions at St. Joseph Bay and apply transfer learning at Keeton Beach and St. George Sound for LAI quantification. In transfer learning, the trained models are used as a feature extractor to convert all patches at the two locations as new representations (outputs of the layer just before the regression layer in capsule network or CNN). Then, we randomly sample 50, 100, 500 and 1,000 patches (roughly balanced among the four classes) from the selected regions at the two new locations, and use the sampled seagrass patches to train a linear model for LAI mapping using their new representations. We also fine-tune the whole model using the sampled patches to optimize the representation at new locations.

To apply the transferred models to the whole images at the two new locations, we first use the randomly sampled patches to classify a given patch to one of the four classes by the 1-nearest neighbor (1-NN) rule. If it is identified as a seagrass patch, we use the trained regression model to predict its LAI. Otherwise, a ‘0’ LAI is assigned. We performed this experiment 5 times. Mean and standard deviation of accuracies of classification and LAI mapping in the selected regions are reported in Tables 2 and 3 respectively.

The classification results of capsule network are always superior to those by CNN in all the scenarios (Table 2). For LAI mapping, capsule network is always superior to CNN at Keeton Beach, it is either much better or comparable to CNN after fine-tuning at St. George Sound (Table 3).

Table 2. Classification accuracies by transfer learning with different number of samples from new locations. For each subsampling size, 5 experiments are conducted. Results are shown as mean ± std.
Table 3. Regression errors (RMSE) by transfer learning and fine tuning with different number of subsamples from new locations. For each subsampling size, 5 experiments are conducted. Results are shown as mean ± std.

4.5 Computational Complexity

All the experiments are conducted using a desktop computer with an Intel Xeon E5-2687W v3 @ 3.10 GHz (10 cores) and 64 GB RAM. On average, training the capsule network on the selected regions of one satellite image takes 85.39 s/epoch, whereas training the CNN model takes 13.17 s/epoch. Testing the capsule network on one entire image takes about 1.5 h, while testing CNN needs 0.42 h. Approximately, training and testing the capsule network takes 6.5 and 3.5 times longer as compared to CNN.

5 Discussions

The capsule network has proven to be the best deep learning model for seagrass distribution quantification in coastal water. The proposed model achieves LAI quantifications with a RMSE of 0.46, 0.07 and 0.11 at St. Joseph Bay, Keeton Beach and St. George Sound, respectively. On average, the RMSE is reduced by 0.02 with respect to the RMSE obtained by the convolutional neural network, and by 0.07 as compared to the linear model.

Note that all the three models use linear regression for LAI quantification. However, in both CNN and capsule network, seagrass LAI is quantified based on the new representations learned by the models. In contrast, the linear model directly uses pixel values in the multispectral image patch of size 5\(\,\times \,\)5\(\,\times \,\)8 to quantify LAI. Both CNN and capsule network outperform the direct linear model, demonstrating that the learned representations by CNN and capsule network may contain more LAI related information and less noise to help the quantification. In addition, capsule network performs an end-to-end learning in which classification and regression are jointly optimized, which helps refine the representation learning process and achieves even better results than CNN.

The mappings of LAI index on whole images usually differ significantly from the mapping provided by the physics model. However, the physics model is not the ground truth and it is shown in the paper for visualization only. To compare the effectiveness of different models, we should focus on the RMSE obtained in the selected regions only as shown in Tables 1, 2 and 3. As a part of the project plan, more on-site validation will be conducted and the data obtained will be used as ground truth for model evaluation.

The seagrass vector in the new representations learned by the capsule network is supposed to represent properties of seagrass patches so that these cleaned representations are more stable and can better quantify LAI. The capsule network is able to achieve much better classification accuracies (Table 2) with 50 and 100 samples from the new locations. For LAI mapping, capsule network performs much better than CNN at Keeton Beach and slightly worse at St. George Sound but both models achieve very good LAI mapping. The performance of fine-tuning is inconsistent, it always makes the CNN model worse than the capsule network. For capsule network, it makes LAI prediction slightly worse at Keeton Beach but it improves the prediction at St. George Sound. Overall, the capsule network performs much better than CNN in transfer learning (Tables 2 and 3).

6 Conclusions

We presented a capsule network model for quantification of seagrass distribution in coastal water. The proposed capsule network jointly optimized regression and classification to learn a new representation for seagrass LAI mapping. We compared the representation learned by the capsule network with that by a traditional CNN model, and capsule network showed much better performances for LAI quantification in both regular learning and transfer learning. To the best of our knowledge, this is the first attempt to apply the capsule network for seagrass quantification in the aquatic remote sensing community.