A hybrid ensemble for classification in multiclass datasets: An application to oilseed disease dataset

doi:10.1016/j.compag.2016.03.026

Computers and Electronics in Agriculture

Volume 124, June 2016, Pages 65-72

https://doi.org/10.1016/j.compag.2016.03.026 Get rights and content

Highlights

•
New hybrid ensemble algorithm for multiclass classification problems is proposed.
•
It uses machine learning algorithms, feature ranking method and an instance filter.
•
Its aim is to improve the performance results of ensemble-Vote.
•
It is tested on four standard datasets and is applied on oilseed disease dataset.
•
It outperforms ensemble-Vote with classification accuracy as 94.73%.

Abstract

The paper presents a new hybrid ensemble approach consisting of a combination of machine learning algorithms, a feature ranking method and a supervised instance filter. Its aim is to improve the performance results of machine learning algorithms for multiclass classification problems. The performance of new hybrid ensemble approach is tested for its effectiveness over four standard agriculture multiclass datasets. It performs better on all these datasets. It is applied on multiclass oilseed disease dataset. It is observed that ensemble-Vote performs better than Logistic Regression and Naïve Bayes algorithms. The performance results of hybrid ensemble are compared with ensemble-Vote. The performance results prove that the new hybrid ensemble approach outperforms ensemble-Vote with improved oilseed disease classification accuracy up to 94.73%.

Introduction

Machine learning algorithms are useful in effective decision making in agriculture. These algorithms possess a strong capability of extracting complicated relationships that exist in the agricultural data (Rocha et al., 2010). High dimensional agricultural data requires the use of machine learning feature selection algorithms when the most explanatory or important features or attributes are to be selected from large datasets (EI-Bendary et al., 2015, Hill et al., 2014, Kundu et al., 2011, Timmermans and Hulzebosch, 1996). Machine learning classification algorithms viz. Logistic Regression and Naïve Bayes are successfully used for accurate identification of crop diseases (Phadikar et al., 2013, Sankaran et al., 2010, Gutiérrez et al., 2008, Baker and Kirk, 2007).

Soybean, groundnut and rapeseed-mustard are the three most important oilseed crops of the world. They play an important role in the oilseed economy. One of the major concerns in increasing and stabilizing the yield of oilseeds is the incidence of pests and diseases which, to a greater extent are responsible for low and unstable production of these crops. Oilseeds are susceptible to various diseases caused by bacteria, fungi, viruses, nematodes and physiological disorders. Some diseases are largely spread and cause great economic losses while others are limited in distribution and are not of much economic importance during present times, but may become major diseases in the course of time by favorable climatic conditions. Oilseed diseases considered in the present work include Alternaria leaf spot, Anthracnose, Cercospora leaf spot, Charcoal rot, Collar rot, Myrothecium leaf spot, Powdery mildew, Sclerotinia stem rot, Phyllosticta leaf spot and Rust. Crop disease diagnosis is a multiclass classification problem.

In several classification problems ensembles have proved to be effective as compared to single classification algorithm (Bolón-Canedo et al., 2012, Sun et al., 2007, Stamatatos and Widmer, 2005). Ensembles have great potential in the domain of multiclass classification. Ensemble machine learning methods have been recommended in the literature for different types of classification problems (Hsu, 2012, Kotsiantis, 2007, Dietterich, 2000, Bay, 1999, Opitz, 1999, Ting and Witten, 1999, Zheng and Webb, 1999, Ho, 1998, Breiman, 1996, Wolpert, 1992, Hansen and Salamon, 1990, Schapire, 1990).

Vote is an ensemble of Logistic Regression and Naïve Bayes algorithms in the present work. This work proposes a new hybrid ensemble approach with an aim to improve the performance results of machine learning algorithms for multiclass classification problems. The aim of the present work is also to compare proposed hybrid ensemble approach with ensemble-Vote. The proposed new hybrid ensemble approach is applied on oilseed disease diagnosis multiclass problem for accurate identification of disease(s).

The paper is organized as follows: Section 2 describes materials and methods used in the present work. Section 3 presents new hybrid ensemble approach for multiclass classification problems. Section 4 describes results and discussion. Section 5 presents the conclusions drawn.

Section snippets

Materials and methods

The tool WEKA (Hall Mark, 2009, Witten and Frank, 2005) is used for the generation of predictive models. It is an open-source tool developed at University of Waikato, New Zealand (http://www.cs.waikato.ac.nz/ml/Weka/).

The proposed hybrid ensemble approach

The hybrid ensemble design is based upon the principle that combining the results of multiple machine learning algorithms is superior to the result of single algorithm.

Results and discussion

Ten-fold cross validation has been successfully used for evaluating the performance of a machine learning algorithm(s) as it offers reliable approximates for classification accuracy on each classification task (Arora and Jain, 2014, Azar et al., 2014, Baldi et al., 2000). The experiments conducted for evaluating the performance of hybrid ensemble are performed using 10-fold cross validation strategy.

Conclusions

The paper proposes a new hybrid ensemble approach for improvement of classification accuracy for multiclass classification problems. It is successfully applied for accurate diagnosis of oilseed diseases. The performance of proposed hybrid ensemble is tested for classification accuracy with 10-fold cross validation on four standard agriculture datasets. The accuracy results obtained for these standard datasets prove that the hybrid ensemble approach shows better classification accuracies as

References (47)

A.T. Azar et al.
A random forest classifier for lymph diseases
Comput. Meth. Programs Biomed.
(2014)
K.M. Baker et al.
Comparative analysis of models integrating synoptic forecast data into potato late blight risk estimate systems
Comput. Electron. Agric.
(2007)
R. Battiti et al.
Democracy in neural nets: voting schemes for classification
Neural Netw.
(1994)
S.D. Bay
Nearest neighbor classification from multiple feature subsets
Intell. Data Anal.
(1999)
V. Bolón-Canedo et al.
An ensemble of filters and classifiers for microarray data classification
Pattern Recogn.
(2012)
R. Danger et al.
A comparison of machine learning techniques for detection of drug target articles
J. Biomed. Inform.
(2010)
P.A. Gutiérrez et al.
Logistic regression product-unit neural networks for mapping Ridolfia segetum infestations in sunflower crop using multitemporal remote sensed data
Comput. Electron. Agric.
(2008)
M.G. Hill et al.
The use of data mining to assist crop protection decisions on kiwifruit in New Zealand
Comput. Electron. Agric.
(2014)
A. Özçift
Random forests ensemble classifier trained with data resampling strategy to improve cardiac arrhythmia diagnosis
Comput. Biol. Med.
(2011)
S. Phadikar et al.
Rice diseases classification using feature selection and rule generation techniques
Comput. Electron. Agric.
(2013)

A. Rocha et al.

Automatic fruit and vegetable classification from images

Comput. Electron. Agric.

(2010)

S. Sankaran et al.

A review of advanced techniques for detecting plant diseases

Comput. Electron. Agric.

(2010)

L.O.L.A. Silva et al.

Comparative assessment of feature selection and classification techniques for visual inspection of pot plant seedlings

Comput. Electron. Agric.

(2013)

E. Stamatatos et al.

Automatic identification of music performers with learning ensembles

Artif. Intell.

(2005)

S. Sun et al.

An experimental evaluation of ensemble methods for EEG signal classification

Pattern Recogn. Lett.

(2007)

A.J.M. Timmermans et al.

Computer vision system for on-line sorting of pot plants using an artificial neural network classifier

Comput. Electron. Agric.

(1996)

D.H. Wolpert

Stacked generalization

Neural Netw.

(1992)

E. Yen et al.

Relaxing instance boundaries for the search of splitting points of numerical attributes in classification trees

Inf. Sci.

(2007)

A. Arora et al.

Machine learning for diagnosis of soybean diseases

Soybean Res.

(2014)

P. Baldi et al.

Assessing the accuracy of prediction algorithms for classification and overview

Bioinformatics

(2000)

Bartaria, A.M., Shukla, A.K., Kaushik, C.D., Kumar, P.R., Singh, N.B., 2001. Major diseases of Rapeseed-Mustard and...

E. Bauer et al.

An empirical comparison of voting classification algorithms: bagging, boosting, and variants

Mach. Learn.

(1999)

L. Breiman

Bagging predictors

Mach. Learn.

(1996)

Cited by (51)

Heterogeneous learning method of ensemble classifiers for identification and classification of power quality events and fault transients in wind power integrated microgrid
2022, Sustainable Energy, Grids and Networks
This paper proposes heterogeneous based ensemble Classifiers (voting and stacking method) to identify and classify different power system disturbances (power quality (PQ), faults, transients, and wind power variation) in wind integrated microgrid network. In the pre-processing stage of classification, a Discrete wavelet transform (DWT) technique is applied to extract the features from power system disturbance signals. The classification process for the proposed ensemble models involves two levels of classification. At the first level, the extracted features from the simulated power system events are used to learn the different individual base classifiers (logistic regression (LR), K-Nearest Neighbor (KNN), and J48 Decision tree (JDT)]. In second stage, a Meta-level classification is carried out based on predictions of base classifiers to get final predictions of class labels. First, the proposed ensemble models are utilized to discriminate the power system disturbances under random varying wind power condition and the predictive results (classification accuracy and performance indices) of ensemble models are compared with individual base classifiers (LR, KNN., and JDT). In addition, a sensitivity analysis is carried out under real time varying wind power condition and noisy environment of event signals to verify the effectiveness of ensemble models in further level. Furthermore, the robustness of proposed stacking ensemble model is verified with classification of single and combined PQ events of synthetic data, generated from the mathematical based PQ model software Predictions results under the different conditions show that stacking ensemble model offers substantial performance and discriminates the power disturbances with higher accuracy of classification than base classifiers and voting ensemble model.
Ensemble of hybrid Bayesian networks for predicting the AMEn of broiler feedstuffs
2022, Computers and Electronics in Agriculture
Citation Excerpt :
Thus, an ensemble can outperform and be more robust than individual predictors. Ensembles are attractive for presenting robustness to new data, noise, outliers, missing values, uncertainties, applications and theory development (Polikar, 2006; Kuncheva, 2014; Leite and Skrjanc, 2019; Archana et al., 2016a; Archana et al., 2016b; Archana et al., 2020). The steps to develop an ensemble are (i) separately training z base models (each operating with different hyperparameters and/or different subsets of data) and (ii) combining local estimates to provide a global estimate.
To adequately meet the nutritional needs of broilers, it is necessary to know the values of apparent metabolizable energy corrected by the nitrogen balance (AMEn) of the feedstuffs. To determine AMEn values, biological assays, feedstuff composition tables, or prediction equations are used as a function of the chemical composition of feedstuffs, the latter using statistical methodologies such as multiple linear regression, neural networks, and Bayesian networks (BN). BN is a statistical and computational methodology that consists of graphical (graph) and probabilistic models of quantitative and/or qualitative variables. Ensembles of BN in the area of broiler nutrition are expected, as there is no research showing their AMEn prediction performance. The purpose of this article is to propose and use ensembles of hybrid Bayesian networks (EHBNs) and obtain prediction equations for the AMEn of feedstuffs used in broiler nutrition from their chemical compositions. We trained 100, 1,000, and 10,000 EHBN, and in this way, empirical distributions were found for the coefficients of the covariates (crude protein, ether extract, mineral matter, and crude fiber). Thus, the mean, median, and mode of these distributions were calculated to build prediction equations for AMEn. It is observed that the method for obtaining the coefficients of the covariates discussed in this article is an unprecedented proposal in the field of broiler nutrition. The data used to obtain the equations were obtained by meta-analysis, and the data for the validation of the equations were obtained from metabolic tests. The proposed equations were evaluated by precision measures such as the mean square error (MSE), mean absolute deviation (MAD), and mean absolute percentage error (MAPE). The best equations for predicting AMEn were derived from the mean or mode coefficients for the 10,000 EHBN results. In conclusion, the methodology used is a good tool to obtain prediction equations for AMEn as a function of the chemical composition of their feedstuffs. The coefficients were found to differ from those found by other methodologies, such as the usual neural network or multiple linear regressions. The field of broiler nutrition contributed with new equations and with a never-applied methodology and differentiated in obtaining its coefficients by empirical distributions.
Tomato disease and pest diagnosis method based on the Stacking of prescription data
2022, Computers and Electronics in Agriculture
Citation Excerpt :
For ensemble-Voting method, when the base-classifier selects GDBT, XGBoost, and LGBM, the prediction accuracy is 79.98%, and it is better than that produced by base-classifiers Simple Logistic and Naïve Bayes. Moreover, for the same base-classifier combination of GDBT, XGBoost, and LGBM, the accuracy (80.36%) produced by Stacking is better than that (79.98%) of Voting (Chaudhary et al., 2016; Chaudhary et al., 2020) and Blending (Wu et al., 2021). In other studies, Stacking was also proved more effective than Voting. (
Crop prescription data contains an extensive amount of information on crops, environment and pests, and has notable diagnostic capabilities. At present, there is lack of feasible methods for efficiently mining crop prescription data to perform accurate diagnoses. In view of the above problems, the purpose of our study is to mine prescription data information and assist the accurate diagnosis of crop diseases. In this paper, six tomato diseases and pests, namely, the tomato virus disease, tomato late blight, tomato gray mold, aphid, thrips and whiteflies, were explored to construct a diagnosis model based on prescription data mining. Original prescription data was subjected to pre-processing, text labeling and one-hot coding. The recursive feature elimination (RFE) method was then employed to extract 37 key features relating to crop diseases and pests from original 50 features. We constructed a tomato disease and pest diagnosis model based on two-stage Stacking ensemble learning to improve the diagnosis accuracy. The experimental results demonstrated the proposed diagnosis model in this paper exhibits a slightly superior performance compared to the best model (LGBM) among ten diagnosis models. The optimal Stacking model is composed of two layers: base-classifiers including GDBT, XGBoost and LGBM, and meta-classifier RF. The diagnosis accuracy of the proposed model for the tomato virus disease reached 94.84%, with an F₁-score of 95.98% and overall accuracy of 80.36%. It also performed well on the multi-classification metrics: Macro avg (Precision: 76.55%, Recall: 78.17%, F₁-score: 77.05%) and Weighted avg (Precision: 80.96%, Recall: 80.36%, F₁-score: 80.50%). Moreover, following feature selection, the Stacking-based diagnosis model can reduce the running time by 12.08% with unchanged diagnosis accuracy. The proposed diagnosis model meets the real-world diagnosis requirements. This work provides new research concepts and a methodological foundation for future crop disease and pest diagnosis.
Machine learning-based farm risk management: A systematic mapping review
2022, Computers and Electronics in Agriculture
Farms face various risks such as uncertainties in the natural growth process, obtaining adequate financing, volatile input and output prices, unpredictable changes in farm-related policy and regulations, and farmers‘ personal health problems. Accordingly, farmers have to make decisions to be prepared for such situations under risk or mitigate their impacts to maintain essential functions. Increasingly, a data-driven perspective is warranted where machine learning (ML) has become an essential tool for automatic extraction of useful information to support decision-making in farm management as well as risk management. ML’s role in farm risk management (FRM) has recently increased with advances in technology and digitalization. This paper provides a literature review in the form of a systematic mapping study to identify the publications, trends, active research communities, and detailed reviews on the use of ML methods for FRM. Accordingly, nine research/mapping questions are designed to extract the required information. In total, we retrieved 1819 papers, of which 746 papers were selected based on the defined exclusion criteria for a detailed review. We categorized the studies based on the addressed risk types (e.g., production risk), assessments that addressed risk components (e.g., resilience), used ML types (e.g., supervised learning) and algorithms ranging from regression modeling to deep learning, addressed ML tasks (e.g., classification), data types (e.g., images), and farm types (e.g., crop-based farm). The results reveal that there is a significant increase in employing ML methods including deep learning and convolutional neural networks for FRM in recent years. The production risk and impact/damage assessment are the most frequently addressed risk type and assessment that addressed risk components in ML-FRM, respectively. In addition, research gaps and open problems are identified and accordingly insights and recommendations from risk management and machine learning perspectives are provided for future studies including the need for ML methods for different risk types (e.g., financial risk), assessments addressing different risk components (e.g., resilience assessment), and developing more advanced ML methods (e.g., reinforcement learning) for FRM.
On the suitability of stacking-based ensembles in smart agriculture for evapotranspiration prediction
2021, Applied Soft Computing
Citation Excerpt :
Finally, nowadays Agriculture 4.0 is evolving thanks to the employment of current technologies such as Internet of Things, big data and artificial intelligence [2,3]. The application of these methodologies improves aspects such as the diagnosis of diseases in agriculture [4,5] and the water management efficiency [6], making farm activities more sustainable, saving energy and preserving the hydrological balance of the ecosystem. Specifically, data mining methods are frequently applied to water management issues in agriculture [3,6].
Smart agriculture aims at generating high harvest yields with an efficient resource management, such as the estimation of crop irrigation. One of the factors on which a productive crop irrigation depends on is evapotranspiration, defined as the water loss process from the soil. This is mainly measured by empirical equations, even though they are conditioned by the specific climatological variables they require. In recent years, data mining techniques are proposed as a powerful alternative to predict evapotranspiration. Among them, ensembles are notable in that they provide accurate estimators in different scenarios. Stacking is an ensemble-building technique aimed at strengthening the prediction capabilities of the system by the combined learning from the original features in the data and synthetic features created from the predictions of multiple models. This research proposes the usage of stacking for evapotranspiration prediction, which has been overlooked in the specialized literature, with the aim of a more sustainable management of water resources. The proposal is compared to other state-of-the-art empirical equations and data mining methods over several real-world climatological datasets of different agricultural areas in Spain. This comparison is performed considering separate datasets with features based on temperature, mass transfer, radiation and, finally, using the main meteorological variables together. The results obtained show that stacking is the best approach in all datasets and each group of features evaluated, running as good alternative to predict evapotranspiration when using data of a different nature and under different conditions.
An ensemble approach of classification model for detection and classification of power quality disturbances in PV integrated microgrid network
2021, Applied Soft Computing
In this study, different Power Quality Disturbances (PQDs) in Photovoltaic (PV) integrated Microgrid (MG) network have been detected and classified using a voting method of ensemble classification model along with Discrete Wavelet Transform (DWT) analysis. The proposed ensemble classification model is useful to classify the most common PQDs (voltage sag, voltage swell, and harmonics) in islanded MG network, and different PQ transients in both grid-connected and islanded MG network. For this study, a PV integrated MG model has been developed in the Matlab/Simulink software environment with introduction of different PQDs. The result obtained reveals that the performance of proposed ensemble classification model-2 (combination of Bayesian net, Multi-layer perceptron (MLP) and J48 decision tree (JDT) classifiers) attains higher classification accuracy (100%) as compared to other ensemble classification model-1 (combination of Bayes net and MLP classifiers) and base classifiers such as Bayesian net, MLP and JDT. Further, the effectiveness of classifiers has been assessed using performance indices (PI) such as Kappa statistics, Mean absolute error (MAE), Root mean square error (RMSE), Precision, Recall, F-measure, and Receiver operating characteristics (ROC). From the results of PI, it can be concluded that the proposed ensemble model-2 outperforms ensemble model-1 and other base classifiers.

View all citing articles on Scopus

View full text

Original papersA hybrid ensemble for classification in multiclass datasets: An application to oilseed disease dataset

Highlights

Abstract

Introduction

Section snippets

Materials and methods

The proposed hybrid ensemble approach

Results and discussion

Conclusions

Comput. Meth. Programs Biomed.

Comput. Electron. Agric.

Neural Netw.

Intell. Data Anal.

Pattern Recogn.

J. Biomed. Inform.

Comput. Electron. Agric.

Comput. Electron. Agric.

Comput. Biol. Med.

Comput. Electron. Agric.

Comput. Electron. Agric.

Comput. Electron. Agric.

Comput. Electron. Agric.

Artif. Intell.

Pattern Recogn. Lett.

Comput. Electron. Agric.

Neural Netw.

Inf. Sci.

Machine learning for diagnosis of soybean diseases

Soybean Res.

Assessing the accuracy of prediction algorithms for classification and overview

Bioinformatics

An empirical comparison of voting classification algorithms: bagging, boosting, and variants

Mach. Learn.

Bagging predictors

Mach. Learn.

Original papers
A hybrid ensemble for classification in multiclass datasets: An application to oilseed disease dataset