A heterogeneous double ensemble algorithm for soybean planting area extraction in Google Earth Engine

https://doi.org/10.1016/j.compag.2022.106955Get rights and content

Abstract

Soybeans are one of the main crops grown in the United States. It is crucial to grasp the distribution of soybean cultivation areas for ensuring food security, eradicating hunger and adjusting crop structures. However, the traditional method of extracting soybean planting areas drains on manpower and material resources and takes a long time. The emergence of high-resolution images, such as Sentinel-2A(S2A), enables the identification of soybean at the field scale, and these images can be applied on a large scale with the support of cloud computing technology. This work proposes a heterogeneous double ensemble algorithm to extract soybean planting area. The crop type dataset from the U.S. Department of Agriculture and S2A dataset are applied in this study. Normalized Difference Vegetation Index (NDVI) and Normalized Difference Water Index (NDWI) calculated from S2A data are used to improve the classification accuracy. The proposed method consists of the following steps. Firstly, the S2A data is processed according to phenological information and spectra characteristics. Secondly, the texture features obtained by the grayscale matrix are integrated with spectral features. Thirdly, in order to remove useless features and improve the classification efficiency, only important bands are retained for the next step through feature importance analysis. Fourthly, Random Forest (RF), Classification And Regression Tree (CART), and Support Vector Machines (SVM) serve as base classifiers to train the above-mentioned features. Finally, result maps are obtained by “voting” on three classification results. In this study, three research areas, Guthrie in Iowa, Clinton in Indiana, and Cuming in Nebraska are utilized to validate the effectiveness of the proposed method. Numerical simulations show the increased performance of classification when using these propositions. When compared with the reference methods, the average increase of the overall accuracy obtained by the proposed algorithm is 1.4%, 3.2%, and 1.7% on the Guthrie data, Clinton data, and Cuming data respectively.

Introduction

Food security is a major challenge to the sustainable development of human beings (Xin et al., 2018, Zhong et al., 2014), and the eradication of hunger is the second target of the United Nation (UN) Sustainable Development Goals(SDGs) (Griggs et al., 2013). Soybean planting area accounts for the main part of the earth’s total arable land, ranking third in the main crop category. The United States is the country with the largest soybean planting area and the largest export volume in the world (da Silva Junior and Leonel-Junior, 2020). American soybeans are mainly distributed in the Central Plains, and the level of mechanization and specialization of soybean planting is high. The Table 1 shows the ranking of the U.S. soybean acreage in 2019.

It is necessary to increase agricultural productivity and reduce its environmental impact to meet future demand for food and achieve the SDG, especially for crops such as soybeans and corn. Precision agriculture covers a variety of technical aspects, including satellite data, drone applications, machine learning, navigation, and communications, with a focus on supporting farmers and a healthy environment to achieve sustainability, climate-related goals, and profitability. As a result, plants have access to the right amount of water, fertilizers, and pesticides to achieve optimal yields while reducing resource use and environmental impact. In addition, precision agriculture can also meet temporal requirements to help farmers optimize irrigation or fertilization of crops. In short, the detection of agricultural areas can enable suppliers, policymakers and governments to get an up-to-date overview of crop planting areas and their potential benefits (Michael et al., 2020).

However, traditional methods of obtaining crop information, such as ground surveys and sampling, are time-consuming, laborious, and costly, which is impossible to obtain continuous spatial distribution data of crops (Xin et al., 2018, Tian et al., 2019). Therefore, technologies such as big data and remote sensing should be applied to smart agriculture (Fountas et al., 2020) and precision agriculture (Kagan et al., 2022).

Remote sensing provides multi-spectral data, especially since the launch of the sentinel mission, the amount of data has been increasing. Earth observation satellites that monitor and regularly revisit farmland are a cheap and excellent data source that provides complete spatial detailed information for crop surveying and mapping (Song et al., 2016, Gao et al., 2017). In recent decades, many studies have focused on the use of hyperspectral and multispectral data sets for crop monitoring (Xin et al., 2018, Tian et al., 2019, Peña-Barragán et al., 2011). As a result of the large-scale agricultural areas and complex measurements, several studies have been able to precisely estimate soybean crop areas, based on remote sensing techniques (Silva et al., 2017, Song et al., 2017). However, the monitoring of large-scale crops requires processing a significant number of orbital images, which need complex computing infrastructure to store, manage and process the data (big data issues) (da Silva Junior and Leonel-Junior, 2020, Shimada et al., 2014). The machine learning classification method is an effective way to classify and process remote sensing images. It includes unsupervised, supervised, and reinforcement-learning models to analyze, classify, and predict trends. Unsupervised learning provides a set of algorithms, which uncovers insights without additional labels. On the contrary, supervised learning uses labels such that a target output is associated with various predictors. Reinforcement learning models entail agents, environment, and corresponding interactions to maximize rewards. Supervised learning, with its methods of SVM, CART, and RF (Feng et al., 2021, Feng et al., 2019a) for classification analysis, are used for crop type classification in this work. The scale of all classifiers is 10 m, and numPixels is 10000. The training sample is 2000 points. SVM can perform linear and non-linear classification tasks and regression analysis. Non-linear classification especially uses a kernel, that maps an input space to a high-dimensional feature space. An input set is separated into classes or categories with a clear, maximized gap between these categories (Cortes et al., 1995, CristianintNello, 2000). CART method is used in remote sensing image classification applications because of its fast and efficient characteristics (Rutkowski et al., 2014). The basic principle of CART is: training samples are divided into test variables (feature vectors) and target variables (actual feature types), and a binary decision tree is formed by cyclic analysis of these two variables. The RF classifier uses an ensemble of trees (Breiman, 2001). It has been widely used in recent classification studies (Markus et al., 2016, Feng et al., 2019b, Feng et al., 2019c) and achieves a higher level of accuracy than the maximum likelihood and decision tree methods (Belgiu, 2016). To solve the problem of overfitting and the “curse of dimensionality”, the random selection of features and samples is conducted in the RF classifier.

Google Earth Engine (GEE) is a possible solution to overcome these issues. With the convergence of the newly available moderate resolution satellite imagery, new algorithm developments, and cloud computing infrastructure, the computing capability has been greatly improved in recent years. Google, Amazon, Microsoft, and Alibaba have introduced cloud computing products one after another, Google and Amazon have archived a large catalog of satellite imagery and provided geo-cloud computing for Earth science applications at the global scale in Google Earth Engine (GEE) and Amazon Web Service (AWS), respectively (Dong et al., 2016, Gorelick et al., 2017). GEE is a computing and cloud platform provided by Google for free since 2010. The platform provides petabytes of geospatial data more than 40 years ago, as well as APIs in JavaScript and Python, which can be used to analyze and process the data set. Since the GEE is a cloud platform, the entire data catalog is stored in the Data Center of Google. The data center also hosts high-capacity and high-efficiency CPUs that can handle complex calculations for analysis. Thus, users do not need a powerful computer to perform these activities (Gorelick et al., 2017). In addition, GEE assembles traditional machine learning (Li et al., 2020) methods and provides a visual user interface. The cloud-based platform has also been widely used for mapping forests (Hansen et al., 2013), settlements and populations, African croplands (Yadav et al., 2017), works on crop yield estimates (Lobell et al., 2015), Indian urban boundaries(Ran et al., 2016), land cover (Huang et al., 2017, Azzari and Lobell, 2017), urban areas global mapping (Liu et al., 2018) and the development of an algorithm for automated mapping of agricultural land (Alemayehu et al., 2017). Dong et al. used about 3290 Landsat 7 and 8 scenarios in GEE and millions of servers around the world to identify rice-growing areas in Northeast Asia in one day, which proved the powerful functions of this parallel processing and supercomputing platform in crop type surveying and mapping (Dong et al., 2016). The proposed algorithm is run on the GEE platform relying on its powerful computing power. And it is implemented through Javascript.

Since there are lots of remote sensing datasets on the GEE platform, the selection of datasets is also important.Remote sensors have their own characteristics that distinguish them from each other. Therefore, using different sensors may lead to different results, even though the research goals are the same (Novelli et al., 2016). The Moderate Resolution Imaging Spectroradiometer (MODIS) sensor provides high temporal resolution data in 36 spectral bands with three different spatial resolutions: 250, 500, and 1000 m (Phan et al., 2019), and is aboard the TERRA and AQUA satellites. However, using this data in crop identification may lead to inaccurate predictions due to the low resolution. The Operational Land Imager (OLI) and Multispectral Instrument (MSI) sensors, present in the payload of Landsat-8(L8) and S2 satellites, have a more refined spatial resolution: 30 m and 10 m, respectively. Therefore, the sentinel data is selected as the research data in this study (da Silva Junior and Leonel-Junior, 2020).

The S2 data covers 13 spectral bands with a width of 290 km. The satellites acquire images with a spatial resolution of 10-meters (blue, green, red, and NIR bands) and 20-meters (Red Edge 1, Red Edge 2, Red Edge 3, Red Edge 4, SWIR1, and SWIR2 bands). The revisit period of one satellite is 10 days, and the revisit period of two satellites is 5 days, opening up a brand-new way for crop-specific monitoring at the plot level. The spatial resolution of 10 to 20 meters can describe a single field in many regions (Graesser and Ramankutty, 2017). A relatively short revisit period can provide more detailed phenological information related to individual crop types. In addition, the key spectral wavelength domain includes several red-side bands, which may help distinguish fairly subtle differences between crop types with similar morphology (Griffiths et al., 2019). It has been proven that the red border of S2 can effectively distinguish between corn and soybeans (You and Dong, 2020). With its various satellites and sensors, S2 data has achieved innovation in the fields of agriculture or geology and has a wide range of applications.

The contribution of this work includes: (1) Optimal feature combination. Based on all bands of the original S2A image data, NDVI and NDWI remote sensing indexes are calculated on the GEE. Spectral features are selected according to the phenological information. The texture feature calculation function of the GEE function library is used to complete the construction of texture features. (2) Feature importance analysis. The “explain()” function of the GEE library is used to analyze the importance of band features to retain high-quality features. (3) Heterogeneous double ensemble algorithm. RF, CART, and SVM these three classifiers are used to obtain the soybean planting area distribution in study areas. The final results are the “voting” of these three results. This paper aims to quickly complete remote sensing data collection and preprocessing, classifier testing, crop spatial distribution information extraction on the GEE cloud platform. It solves the traditional remote sensing classification problems on a single classifier, especially the difficulty of time-consuming preprocessing for large-scale regional remote sensing data collection. It also improves the efficiency of information processing and solves the problem that when using multi-spectral time-space resolution satellite data (such as S2), the effective classification features are not well recorded, resulting in an insufficient understanding of feature performance, missing important features, or containing the irrelevant features.

The structure of this paper is as follows. Section 2 introduces the study areas and remote sensing data sets used in this paper. Section 3 describes the proposed algorithm in detail. Section 4 presents the classification maps and accuracy results, including result comparison and feature importance analysis. Section 5 shows the advantages and disadvantages of the algorithm. The concluding remarks are given in Section 6.

Section snippets

Study area

The high-value areas of soybean planting in the United States are mainly distributed in the Mississippi River Basin and the Missouri River Basin in the Central Plains. Combined with the data in the Table 1, we choose the areas with high soybean yield and denser planting as the study areas. To prove the universality of the algorithm, three groups of research areas are selected, including six areas. They are located in Iowa, Indiana, and Nebraska. In each data, one part is used for training and

Methodology

The methodology workflow used in this research is exhibited in Fig. 3, which is composed of four main steps: extract phenology and spectral features, analyze texture features, supervised classification, and the accuracy assessment.

Experimental results

The purpose of the algorithm is to use the S2 image collection combined with phenological information, spectral features, and texture features to generate a pixel-based resultant soybean image in GEE. Hereinafter, the algorithm in this paper is called PSTF. In this part, we mainly compare the results before and after using the algorithm in this paper. The input bands of the basic classifier before using the PSTF algorithm are B1, B2, B3, B4, B5, B6, B7, B8, B8A, B9, B11, and B12 bands of S2.

Discussion

In this study, we proposed a heterogeneous double ensemble method based on phenology, spectral information, and texture features on the GEE platform for soybean classification. As a supplementary benefit, the training of random forest provides measurements of feature importance which are used to feature selection, and ensemble learning (Feng et al., 2020) also highlights its role. The results of three study areas demonstrate the effectiveness of the proposed method.

  • (1)

    Not only is a large amount of

Conclusion

In the context of using machine learning classifiers, more specifically RF, CART and SVM three classifiers to extract soybean planting areas, we proposed a heterogeneous double ensemble algorithm and studied two propositions. It is evaluated on three multispectral data sets with publicly available ground truths, namely Guthrie, Clinton, and Cuming. The first proposal is to initially process the bands based on soybean phenological information and spectral features. Two indexes are constructed,

Data availability statement

The data used in this study are Sentinel 2A and USDA NASS Cropland Data Layers, which are derived from public domain resources. The data that support the findings of this study are available in “Earth Engine Data Catalog” in Google earth engine at https://developers.google.com/earthengine/datasets/catalog/COPERNICUS_S2_SR and https://developers.google.com/earthengine/datasets/catalog/USDA_NASS_CDL?hl=en.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work is supported by the National Natural Science Foundation of China (61772397,12005159), the Natural Science Basic Research Program of Shaanxi(2021JQ-074), the research program on major theoretical and practical issues in Shaanxi social sciences in 2020 (20ST-81), Science Technology and Development Project of Yulin Science and Technology Bureau (CXY-2020-094).

References (51)

  • Q. Li et al.

    Trend and forecasting of the covid-19 outbreak in china

    J. Infect.

    (2020)
  • X. Liu et al.

    High-resolution multi-temporal mapping of global urban land using landsat images based on the google earth engine platform

    Remote Sens. Environ.

    (2018)
  • D.B. Lobell et al.

    A scalable satellite-based crop yield mapper

    Remote Sens. Environ.

    (2015)
  • R.S. Lunetta et al.

    Land-cover change detection using multi-temporal modis ndvi data

    Remote Sens. Environ.

    (2006)
  • B. Melville et al.

    Object-based random forest classification of landsat etm+ and worldview-2 satellite imagery for mapping lowland native grassland communities in tasmania, australia

    Int. J. Appl. Earth Obs. Geoinf.

    (2018)
  • A. Novelli et al.

    Performance evaluation of object based greenhouse detection from sentinel-2 msi and landsat 8 oli data: A case study from almería (spain)

    Int. J. Appl. Earth Obs. Geoinf.

    (2016)
  • J.M. Peña-Barragán et al.

    Object-based crop identification using multiple vegetation indices, textural features and crop phenology

    Remote Sens. Environ.

    (2011)
  • L. Rutkowski et al.

    The cart decision tree for mining data streams

    Inf. Sci.

    (2014)
  • M. Shimada et al.

    New global forest/non-forest maps from alos palsar data (2007-2010)

    Remote Sens. Environ.

    (2014)
  • X.-P. Song et al.

    National-scale soybean mapping and area estimation in the united states using medium resolution satellite imagery and field survey

    Remote Sens. Environ.

    (2017)
  • N. You et al.

    Examining earliest identifiable timing of crops using all available sentinel 1/2 imagery and google earth engine

    ISPRS J. Photogramm. Remote Sens.

    (2020)
  • L. Zhong et al.

    Efficient corn and soybean mapping with temporal extendability: A multi-year experiment using landsat imagery

    Remote Sens. Environ.

    (2014)
  • Alemayehu, M., Felix, H., Savory, D.J., Ricardo, A.P., Gething, P.W., Adam, B., Sturrock, H., S.G.J.-P., 2017. Mapping...
  • G. Azzari et al.

    Landsat-based classification in the cloud: An opportunity for a paradigm shift in land cover monitoring

    Remote Sens. Environ.

    (2017)
  • L. Baetens et al.

    Validation of copernicus sentinel-2 cloud masks obtained from maja, sen2cor, and fmask processors using reference cloud masks generated with a supervised active learning procedure

    Remote Sens.

    (2019)
  • Cited by (7)

    View all citing articles on Scopus
    1

    Shuo Wang and Wei Feng contributed equally to this work.

    View full text