Wild blueberry yield prediction using a combination of computer simulation and machine learning algorithms

https://doi.org/10.1016/j.compag.2020.105778Get rights and content

Highlights:

  • The study provides inspiration to use simulation data for yield predictive modeling.

  • The proposed machine learning algorithm approach has achieved 93.8% accuracy.

  • Interactions between bee species composition and weather are significant in yield variability.

  • Prediction showed that wet rainy springs will greatly reduce blueberry yield in the future.

Abstract

The most challenging task in the agricultural sector is to accurately predict crop yield. A typical machine learning algorithm often uses real data to predict crop yield. In this study, we used data generated by the Wild Blueberry Pollination Model, a spatially explicit simulation model validated by field observation and experimental data collected in Maine USA during the last 30 years. The main aim of this study is to evaluate the relative importance of bee species composition and weather factors in regulating wild blueberry agroecosystems. Specifically, we sought to reveal how bee species composition and weather affect yield and to predict optimal bee species composition and weather conditions that achieve the best yield using computer simulation and machine learning algorithms. Multiple linear regression (MLR), boosted decision trees (BDT), random forest (RF), and extreme gradient boosting (XGBoost) were evaluated as predictive tools. We also performed a predictor selection before submitting our data to the learning algorithms. In this way, we are able to reduce the dimension of the input without a significant drop in prediction accuracy. As a result, clone size, honeybee, bumblebee, Andrena bee species, Osmia bee species, maximum of upper-temperature ranges, and the number of days with precipitation were chosen as the best predictor variable subset. The results showed that the XGBoost outperformed other algorithms in all measures of model performance for predicting the yield of wild blueberry by achieving a coefficient of determination (R2) of 0.938, root mean square error (RMSE) of 343.026, mean absolute error (MAE) of 206 and relative root mean square error (RRMSE) of 5.444%. The results are consistent with previous work on predicting wild blueberry fruit yield using digital color photography by (Zaman et al., 2008). This study showed that crop yield predictions can be based on computer simulation modeling datasets. Therefore, if a reasonable prediction can be reached, this study should have a significant impact, especially when data collection in the field is challenging.

Introduction

The wild blueberry also known as lowbush blueberry (Vaccinium angustifolium Aiton) is the predominant species in a complex of blueberry species cultivated in Maine, which is the largest producer of wild blueberry in the United States, and accounts for about 97% of U.S. wild blueberry total production (Jones et al., 2014, Strik and Yarborough, 2005). The wild blueberry is one of only a few commercially grown crops native to North America. This crop is not planted. It is a complex of five ericaceous plant species that are natural forest understory plants. Forests are harvested and the blueberry plants are then managed to produce fruit (Drummond, 2019a). The productivity of this crop is largely influenced by cross-pollination that requires bees (Asare et al., 2017, Drummond, 2016). The yield of wild blueberry is not always a continuous non-linear relationship to bee density (Asare et al., 2017), it may also be subject to variation of weather, climate warming, soil fertility, pests and disease, and other temporal and spatial abiotic and biotic factors (Tasnim et al., 2020; Aras et al., 1996).

The greatest and the most challenging activities in precision agriculture are accurate predictions of crop yield. Extensive research is underway in agriculture to better predict crop yield using machine learning algorithms (Chlingaryan et al., 2018, Crane-Droesch, 2018, Jeong et al., 2016). Many machine learning algorithms require large amounts of data to provide reliable results (Johnson et al., 2016). One of the major challenges in training and experimenting with machine learning algorithms is the availability of training data in sufficient quality and quantity. For most studies, this remains a limiting factor.

One way to overcome the problem of collecting large training data for machine-learning algorithms is to generate data by using computer simulation modeling techniques. The Naval Postgraduate Schools (NPS) Simulation, Experiments and Efficient Design (SEED) Center for Data Farming defines data farming as the process of using simulations and computer modeling to compile data sets. As such, data farming seeks to provide decision-makers with insights into complex issues by using simulations to produce data (Tolk, 2015). Hence, for this study, we used previously validated simulation-based modeling of wild blueberry pollination (Qu and Drummond, 2018) and designed computational experiments to ‘grow’ training data, which can then be used to train and validate machine learning algorithms.

The machine learning algorithms trained on a simulation model are called meta-models (Simpson et al., 2001), which possess two obvious advantages over using either machine learning models trained directly on empirical datasets or running computationally intensive simulation models for direct predictions. On one hand, meta-models are surrogates to simulation models which represent detailed causalities of interacting ecological processes that are helpful to precisely predict system behaviors (Drummond et al., 2003, Puntel et al., 2019). Because meta-models can learn patterns of connections among inputs and outputs of the original simulation model and have the ability to extrapolate across varying temporal and spatial scales (Fienen et al., 2015). On the other hand, meta-modeling techniques have been widely accepted by the scientific community in constructing predictive models because in most cases it is more practical to use a real-time meta-model to make predictions than run the often much slower (higher cost) simulation model if the prediction accuracy between the two is within an acceptable range (e.g., less than 20% error) (Fienen et al., 2015).

Simulation data is increasingly being used by researchers and companies for machine learning applications especially in computer vision where a model is trained on a simulation generated dataset (Bohn et al., 2013). Efforts have been made to construct general-purpose simulated data generators to enable data science experiments (Patki et al., 2016). In general, simulated data has several advantages: (1) it is fast and usually inexpensive to produce as much data as needed once the simulation environment is ready; (2) simulation data can have perfectly accurate labels; (3) including labeling that may be very expensive or impossible to obtain by hand, the simulation environment can be modified to improve the model and training; (4) simulation data can be used as a substitute for certain real data segments that contain sensitive information; and (5) it is useful in cases where the generation of training data involves expensive sample acquisition or training data cannot be obtained in sufficient quantity for ethical reasons (Dahmen et al., 2019).

Many researchers have used linear regression models to predict crop yield (Ji et al., 2007, Matsumura et al., 2015, Zaman et al., 2008, Zhang et al., 2019). As far as we know, there is no published study except that by (Shahhosseini et al., 2019) who used simulation data to train machine learning algorithms to predict crop productivity. These authors used the APSIM (Agricultural Production Systems SiMulator; (Holzworth et al., 2014) cropping systems model to generate maize yield and nitrogen loss data for seven locations in the US Midwest. Simulations were conducted for 5–7 years and several management treatments resulting in more than 3 million data points representing maize yields and nitrogen losses.

Our research objective was to develop a predictive model with the assistance of computer simulation and machine learning algorithms. Once we determined the best model to predict yield, our study aimed to address three scientific objectives. First, we conducted simulations and prescreened the simulated data used for developing predictive models of wild blueberry yield. Second, we wanted to determine the important factors that predict yield most effectively. Third, we elucidated, through sensitivity analysis, the optimal bee species composition and weather conditions that predict the best yield estimates compared to the actual simulation derived yields. Furthermore, our goal was to determine the most robust model for yield prediction by comparing traditional and modern machine learning algorithms while at the same time using a minimal number of features.

Section snippets

Data generation

The study of predictive models for blueberry yield prediction requires data that sufficiently characterize the influence of plant spatial traits, bee species composition, and weather conditions on production. In a multi-step process, we designed simulation experiments and conducted the runs on the calibrated version of the blueberry simulation model. The simulated dataset was then examined, and critical features were selected to build four machine-learning-based predictive models.

Results

We carried out several experiments to: (1) evaluate the strength of predictive models comparing traditional and modern machine learning techniques; (2) identify important factors that affect yield most; (3) assess the effects of bee composition and weather conditions on yield; and (4) seek optimal bee composition and weather conditions that achieve the highest predicted yield.

Model development and interpretation

This research has investigated an integrative method for crop yield prediction, which overcomes the limitations of approaches using either empirical modeling or computer simulation. Empirical modeling enables quick prediction. However, when the predicted values of important predictors go beyond the range that was used for model calibration, the validity of future extrapolation is questionable (Soltani et al., 2016). In contrast with empirical models, computer simulations which are based on

Conclusions

We investigated four machine learning algorithms: multiple linear regression, boosted decision tree, random forest, and extreme gradient boosting algorithms to develop predictive models for wild blueberry yield. The input dataset was developed from a simulation model of wild blueberry pollination. This model takes advantage of field data accumulated over 30 years of wild blueberry pollination research in Maine and the Canadian Maritimes (Qu and Drummond, 2018). The performances of the

CRediT authorship contribution statement

Efrem Yohannes Obsie: Methodology, Software, Formal analysis, Writing - original draft. Hongchun Qu: Conceptualization, Methodology, Formal analysis, Writing - original draft, Writing - review & editing, Supervision, Project administration, Funding acquisition. Francis Drummond: Conceptualization, Investigation, Supervision, Resources, Formal analysis, Writing - review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

We acknowledge funding by the National Natural Science Foundation of China (61871061) and Program of Basic Research and Frontier Technology of Chongqing Municipal Science Commission (cstc2017jcyjAX0453) to HQ.

References (56)

  • H. Qu et al.

    A spatially explicit agent-based simulation platform for investigating effects of shared pollination service on ecological communities

    Simul. Model. Pract. Theory

    (2013)
  • L. Zhang et al.

    Using boosted tree regression and artificial neural networks to forecast upland rice yield under climate change in Sahel

    Comput. Electron. Agric.

    (2019)
  • P. Aras et al.

    Effect of a honey bee (Hymenoptera: Apidae) gradient on the pollination and yield of lowbush blueberry

    J. Econ. Entomol.

    (1996)
  • E. Asare et al.

    Economic risk of bee pollination in Maine wild blueberry, Vaccinium angustifolium

    J. Econ. Entomol.

    (2017)
  • D.J. Bell et al.

    Yield variation among clones of lowbush blueberry as a function of genetic similarity and self-compatibility

    J. Am. Soc. Hortic. Sci.

    (2010)
  • L. Breiman

    Random forests

    Mach. Learn.

    (2001)
  • S. Chakraborty et al.

    Weather-based prediction of anthracnose severity using artificial neural network models

    Plant Pathol.

    (2004)
  • T. Chen et al.

    Xgboost: A scalable tree boosting system, in

  • A. Crane-Droesch

    Machine learning methods for crop yield prediction and climate change impact assessment in agriculture

    Environ. Res. Lett.

    (2018)
  • T. Dahmen et al.

    Digital reality: a model-based approach to supervised learning from synthetic data

    AI Perspect.

    (2019)
  • E.D. De Wolf et al.

    Neural network classification of tan spot and Stagonospora blotch infection periods in a wheat field environment

    Phytopathology

    (2000)
  • H. Drucker

    Improving regressors using boosting techniques

    ICML.

    (1997)
  • F. Drummond

    Reproductive biology of wild blueberry (Vaccinium angustifolium Aiton)

    Agriculture

    (2019)
  • F. Drummond

    Factors That Affect Yield in Wild Blueberry, (Vaccinium Angustifolium Aiton)

    Agric. Res. Technol. Open Access J.

    (2019)
  • F.A. Drummond

    Behavior of bees associated with the wild blueberry agro-ecosystem in the USA

    Int. J. Entomol. Nematol.

    (2016)
  • F.A. Drummond et al.

    A natural history of change in native bees associated with lowbush blueberry in Maine

    Northeast. Nat.

    (2017)
  • F.A. Drummond et al.

    The ecology of autogamy in wild blueberry (Vaccinium angustifolium Aiton): Does the early clone get the bee?

    Agron

    (2020)
  • S.T. Drummond et al.

    Statistical and neural methods for site–specific yield prediction

    Trans. ASAE

    (2003)
  • Cited by (65)

    • Evaluation of UAV spraying quality based on 1D-CNN model and wireless multi-sensors system

      2024, Information Processing in Agriculture
      Citation Excerpt :

      Moreover, the MLR models could further prove the above conclusions. Obsie pointed out that the impact of each input of the MLR model on the output can be reflected by the coefficients of the model to a certain extent [38]. Eq. (4) reveals that the first three inputs, which contributed the most to the droplet deposition, were the applied volume rate, the average wind speed, and the UAV spraying height.

    View all citing articles on Scopus
    View full text