Wild blueberry yield prediction using a combination of computer simulation and machine learning algorithms
Introduction
The wild blueberry also known as lowbush blueberry (Vaccinium angustifolium Aiton) is the predominant species in a complex of blueberry species cultivated in Maine, which is the largest producer of wild blueberry in the United States, and accounts for about 97% of U.S. wild blueberry total production (Jones et al., 2014, Strik and Yarborough, 2005). The wild blueberry is one of only a few commercially grown crops native to North America. This crop is not planted. It is a complex of five ericaceous plant species that are natural forest understory plants. Forests are harvested and the blueberry plants are then managed to produce fruit (Drummond, 2019a). The productivity of this crop is largely influenced by cross-pollination that requires bees (Asare et al., 2017, Drummond, 2016). The yield of wild blueberry is not always a continuous non-linear relationship to bee density (Asare et al., 2017), it may also be subject to variation of weather, climate warming, soil fertility, pests and disease, and other temporal and spatial abiotic and biotic factors (Tasnim et al., 2020; Aras et al., 1996).
The greatest and the most challenging activities in precision agriculture are accurate predictions of crop yield. Extensive research is underway in agriculture to better predict crop yield using machine learning algorithms (Chlingaryan et al., 2018, Crane-Droesch, 2018, Jeong et al., 2016). Many machine learning algorithms require large amounts of data to provide reliable results (Johnson et al., 2016). One of the major challenges in training and experimenting with machine learning algorithms is the availability of training data in sufficient quality and quantity. For most studies, this remains a limiting factor.
One way to overcome the problem of collecting large training data for machine-learning algorithms is to generate data by using computer simulation modeling techniques. The Naval Postgraduate Schools (NPS) Simulation, Experiments and Efficient Design (SEED) Center for Data Farming defines data farming as the process of using simulations and computer modeling to compile data sets. As such, data farming seeks to provide decision-makers with insights into complex issues by using simulations to produce data (Tolk, 2015). Hence, for this study, we used previously validated simulation-based modeling of wild blueberry pollination (Qu and Drummond, 2018) and designed computational experiments to ‘grow’ training data, which can then be used to train and validate machine learning algorithms.
The machine learning algorithms trained on a simulation model are called meta-models (Simpson et al., 2001), which possess two obvious advantages over using either machine learning models trained directly on empirical datasets or running computationally intensive simulation models for direct predictions. On one hand, meta-models are surrogates to simulation models which represent detailed causalities of interacting ecological processes that are helpful to precisely predict system behaviors (Drummond et al., 2003, Puntel et al., 2019). Because meta-models can learn patterns of connections among inputs and outputs of the original simulation model and have the ability to extrapolate across varying temporal and spatial scales (Fienen et al., 2015). On the other hand, meta-modeling techniques have been widely accepted by the scientific community in constructing predictive models because in most cases it is more practical to use a real-time meta-model to make predictions than run the often much slower (higher cost) simulation model if the prediction accuracy between the two is within an acceptable range (e.g., less than 20% error) (Fienen et al., 2015).
Simulation data is increasingly being used by researchers and companies for machine learning applications especially in computer vision where a model is trained on a simulation generated dataset (Bohn et al., 2013). Efforts have been made to construct general-purpose simulated data generators to enable data science experiments (Patki et al., 2016). In general, simulated data has several advantages: (1) it is fast and usually inexpensive to produce as much data as needed once the simulation environment is ready; (2) simulation data can have perfectly accurate labels; (3) including labeling that may be very expensive or impossible to obtain by hand, the simulation environment can be modified to improve the model and training; (4) simulation data can be used as a substitute for certain real data segments that contain sensitive information; and (5) it is useful in cases where the generation of training data involves expensive sample acquisition or training data cannot be obtained in sufficient quantity for ethical reasons (Dahmen et al., 2019).
Many researchers have used linear regression models to predict crop yield (Ji et al., 2007, Matsumura et al., 2015, Zaman et al., 2008, Zhang et al., 2019). As far as we know, there is no published study except that by (Shahhosseini et al., 2019) who used simulation data to train machine learning algorithms to predict crop productivity. These authors used the APSIM (Agricultural Production Systems SiMulator; (Holzworth et al., 2014) cropping systems model to generate maize yield and nitrogen loss data for seven locations in the US Midwest. Simulations were conducted for 5–7 years and several management treatments resulting in more than 3 million data points representing maize yields and nitrogen losses.
Our research objective was to develop a predictive model with the assistance of computer simulation and machine learning algorithms. Once we determined the best model to predict yield, our study aimed to address three scientific objectives. First, we conducted simulations and prescreened the simulated data used for developing predictive models of wild blueberry yield. Second, we wanted to determine the important factors that predict yield most effectively. Third, we elucidated, through sensitivity analysis, the optimal bee species composition and weather conditions that predict the best yield estimates compared to the actual simulation derived yields. Furthermore, our goal was to determine the most robust model for yield prediction by comparing traditional and modern machine learning algorithms while at the same time using a minimal number of features.
Section snippets
Data generation
The study of predictive models for blueberry yield prediction requires data that sufficiently characterize the influence of plant spatial traits, bee species composition, and weather conditions on production. In a multi-step process, we designed simulation experiments and conducted the runs on the calibrated version of the blueberry simulation model. The simulated dataset was then examined, and critical features were selected to build four machine-learning-based predictive models.
Results
We carried out several experiments to: (1) evaluate the strength of predictive models comparing traditional and modern machine learning techniques; (2) identify important factors that affect yield most; (3) assess the effects of bee composition and weather conditions on yield; and (4) seek optimal bee composition and weather conditions that achieve the highest predicted yield.
Model development and interpretation
This research has investigated an integrative method for crop yield prediction, which overcomes the limitations of approaches using either empirical modeling or computer simulation. Empirical modeling enables quick prediction. However, when the predicted values of important predictors go beyond the range that was used for model calibration, the validity of future extrapolation is questionable (Soltani et al., 2016). In contrast with empirical models, computer simulations which are based on
Conclusions
We investigated four machine learning algorithms: multiple linear regression, boosted decision tree, random forest, and extreme gradient boosting algorithms to develop predictive models for wild blueberry yield. The input dataset was developed from a simulation model of wild blueberry pollination. This model takes advantage of field data accumulated over 30 years of wild blueberry pollination research in Maine and the Canadian Maritimes (Qu and Drummond, 2018). The performances of the
CRediT authorship contribution statement
Efrem Yohannes Obsie: Methodology, Software, Formal analysis, Writing - original draft. Hongchun Qu: Conceptualization, Methodology, Formal analysis, Writing - original draft, Writing - review & editing, Supervision, Project administration, Funding acquisition. Francis Drummond: Conceptualization, Investigation, Supervision, Resources, Formal analysis, Writing - review & editing.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
We acknowledge funding by the National Natural Science Foundation of China (61871061) and Program of Basic Research and Frontier Technology of Chongqing Municipal Science Commission (cstc2017jcyjAX0453) to HQ.
References (56)
- et al.
Grid-Set-Match, an agent-based simulation model, predicts fruit set for the lowbush blueberry (Vaccinium angustifolium) agroecosystem
Ecol. Modell.
(2017) - et al.
Analysis of car crash simulation data with nonlinear machine learning methods
Procedia Comput. Sci.
(2013) - et al.
Machine learning approaches for crop yield prediction and nitrogen status estimation in precision agriculture: A review
Comput. Electron. Agric.
(2018) - et al.
Evaluating the utility of the medium-spatial resolution Landsat 8 multispectral sensor in quantifying aboveground biomass in uMgeni catchment, South Africa
ISPRS J. Photogramm. Remote Sens.
(2015) Optimal foraging in bumblebees: hunting by expectation
Anim. Behav.
(1981)- et al.
APSIM–evolution towards a new generation of agricultural systems simulation
Environ. Model. Softw.
(2014) - et al.
Crop yield forecasting on the Canadian Prairies by remotely sensed vegetation indices and machine learning methods
Agric. For. Meteorol.
(2016) - et al.
General models for estimating daily global solar radiation for different solar radiation zones in mainland China
Energy Convers. Manag.
(2013) - et al.
Development of a nitrogen recommendation tool for corn considering static and dynamic variables
Eur. J. Agron.
(2019) - et al.
Simulation-based modeling of wild blueberry pollination
Comput. Electron. Agric.
(2018)
A spatially explicit agent-based simulation platform for investigating effects of shared pollination service on ecological communities
Simul. Model. Pract. Theory
Using boosted tree regression and artificial neural networks to forecast upland rice yield under climate change in Sahel
Comput. Electron. Agric.
Effect of a honey bee (Hymenoptera: Apidae) gradient on the pollination and yield of lowbush blueberry
J. Econ. Entomol.
Economic risk of bee pollination in Maine wild blueberry, Vaccinium angustifolium
J. Econ. Entomol.
Yield variation among clones of lowbush blueberry as a function of genetic similarity and self-compatibility
J. Am. Soc. Hortic. Sci.
Random forests
Mach. Learn.
Weather-based prediction of anthracnose severity using artificial neural network models
Plant Pathol.
Xgboost: A scalable tree boosting system, in
Machine learning methods for crop yield prediction and climate change impact assessment in agriculture
Environ. Res. Lett.
Digital reality: a model-based approach to supervised learning from synthetic data
AI Perspect.
Neural network classification of tan spot and Stagonospora blotch infection periods in a wheat field environment
Phytopathology
Improving regressors using boosting techniques
ICML.
Reproductive biology of wild blueberry (Vaccinium angustifolium Aiton)
Agriculture
Factors That Affect Yield in Wild Blueberry, (Vaccinium Angustifolium Aiton)
Agric. Res. Technol. Open Access J.
Behavior of bees associated with the wild blueberry agro-ecosystem in the USA
Int. J. Entomol. Nematol.
A natural history of change in native bees associated with lowbush blueberry in Maine
Northeast. Nat.
The ecology of autogamy in wild blueberry (Vaccinium angustifolium Aiton): Does the early clone get the bee?
Agron
Statistical and neural methods for site–specific yield prediction
Trans. ASAE
Cited by (65)
Improved feature ranking fusion process with Hybrid model for crop yield prediction
2024, Biomedical Signal Processing and ControlIntegrating the Sentinel-1, Sentinel-2 and topographic data into soybean yield modelling using machine learning
2024, Advances in Space ResearchThe effect of dataset construction and data pre-processing on the eXtreme Gradient Boosting algorithm applied to head rice yield prediction in Australia
2024, Computers and Electronics in AgricultureEvaluation of UAV spraying quality based on 1D-CNN model and wireless multi-sensors system
2024, Information Processing in AgricultureCitation Excerpt :Moreover, the MLR models could further prove the above conclusions. Obsie pointed out that the impact of each input of the MLR model on the output can be reflected by the coefficients of the model to a certain extent [38]. Eq. (4) reveals that the first three inputs, which contributed the most to the droplet deposition, were the applied volume rate, the average wind speed, and the UAV spraying height.
A robust model for diagnosing water stress of winter wheat by combining UAV multispectral and thermal remote sensing
2024, Agricultural Water ManagementDeveloping a comprehensive evaluation model of variety adaptability based on machine learning method
2024, Field Crops Research