Abstract
Conventional image processing techniques have been applied to the field of agricultural machine vision for the purposes of identifying crops for quality control, weed detection, automated spraying and harvesting. With the recent advancements in computational hardware Region-based Convolutional Networks have met with varying levels of success in the area of object detection and classification. In this study we found that a Region-based Convolutional Neural Network was able to achieve a 92% accuracy rating while a Region-based Fully Convolutional Network was able to achieve an 87% accuracy rating in the area of object detection operating on a newly create agricultural mushroom dataset.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
- Agriculture
- Convolutional Networks
- Object detection
- Machine vision
- Region-Based Convolutional Neural Network
- Region-Based Fully Convolutional Networks
1 Introduction
As it stands today, 10% of the world’s population currently does not have access to enough food in order to maintain a healthy lifestyle [1] and by the year 2050 there will be between an estimated 9 and 10 billion people to feed [2]. The field of agritech is constructed around applying technology in order to solve agricultural based problems, one of which being, how to feed the planets expanding population. Mushrooms face a number of machine vision based issues that are encountered by other agricultural crops including: growth overlap, clustering, and difference in size due to growth cycle position. Due to their rapid growth cycle, potential for error reduction as a result of automation, and popularity as an important food source, agricultural mushrooms represent a prime agricultural crop for analyzing the potential of Region-based Convolutional models. As agricultural mushrooms bare similarities to other agricultural crops in both their feature set distribution and machine vision issues faced, the techniques learned through the examination of Region-based Convolutional Networks with mushrooms could be transferable to other agricultural based crops.
In this study we propose the training of a series of Region-based Convolutional models that will be able to outperform a standard image processing technique in the area of end-to-end object detection. Due to the lack in availability of agricultural based datasets of sufficient size needed for the training of convolutional based machine learning models, these models will be trained using a newly constructed mushroom dataset. The end result post training, will be a machine learning model that is able to detect agricultural mushrooms to a high degree of accuracy. To the knowledge of these researchers, this is the first potential method of using machine intelligence for the purpose of identifying agricultural mushrooms.
2 Related Work
Neural networks are highly advantageous over standard image processing algorithms as they are capable of learning by examples opposed to programming [3]. Classifying apples into quality grades was undertaken by researchers [4] who found that although a Neural Network was able to produce highly accurate results in the top quality grades, the network mistakenly classified bruises as vines on apples leading to misclassification on lower grade apples. Neural Networks have a greater ability to generalize than standard image processing algorithms as shown by researchers who employed a Neural Network to classify barley, wheat, oats and rye with a high degree of success [5]. Areas that have traditionally relied on statistical variables to produce values, such as yield predictions, have also seen benefits from the application of neural networks. An Artificial Neural Network was previously trained by researcher for the purpose of predicting soy bean crop yields [6].
Researchers studying the ability of Convolutional Neural Networks to generalize developed a fruit detection network that when trained on one type of fruit and was able to recognize other classes of fruit with moderate levels of accuracy [7]. Research into Convolutional Neural Networks in the field of agriculture has shown that they can generalize well [7] and can produce high levels of segmentation [8] accuracy. First proposed in 2012 [9], Fully Convolutional Networks replaced the fully connected layers that terminate the traditional Convolutional Neural Network with a series of upsampling layers. Researchers applied Fully Convolutional Networks to the identification of weeds in cereal grain fields [10] with a moderate level of success. Fully Convolutional Networks were later used by researchers to learn pixel-wise semantic segmentation to identify if a particular object is present or not in a particular pixel [11]. Although the researchers level of accuracy was moderate, the model still performed significantly better than the baseline classifier and had a extremely quick inference speed [11].
Standard Convolutional and Fully Convolutional networks have been shown to provide high levels of accuracy among image segmentation and classification tasks; however, object detection remains a complex process due to the nature of identifying and classifying multiple objects in a single image. First proposed in 2014 [12] Region-based Convolutional Networks are one of the most recent advances in Neural Networks and have been limited due to their complexity and computational requirements. The main issue with the original Region-based Convolutional Network was that each proposed object was passed through the convolutional portion of the network independently resulting in the consumption of large amounts of time and computational resources. The fast Region-Convolutional Network [13] rectified some of the issues with the original architecture by introducing a Region of Interest pooling layer and a series of fully connected layers to produce predictions.
Due to their resource requirements and level of complexity, there has been limited application of Region-based Convolutional Network with most focused on training with the PASCAL VOC and COCO training sets. Applied applications of these networks have been performed in the areas of face detection [14], cell phone detection in cars [15] and real time object detection [16]. A Faster Region-based Convolutional Network was developed [16] to further increase the speed and efficiency of the fast Region-based Convolutional Neural Network [13] by sharing the convolutional layers used in the region proposal network with the object detection network. This type of architecture forms the basis for all Faster Region-based Convolutional Neural Networks although the structure of network used can determine the speed and accuracy of the results [17].
3 Method
The Region-based Convolutional Network uses a series of modules in order to classify and detect objects in an image. The first module is a convolutional feature extractor and can be structured in a series of different ways to adjust for speed and accuracy [17]. The module is designed to produce a series of convolutional based feature maps generated from an input image. The network behaves in the same manner as a standard Convolutional Network with the exception being that the network is not terminated in a neural network or upsampling layer and the output from the final pooling layer is passed to the next module. The next module of the network is known as the proposal generator, or Region Proposal Network, and is responsible for compiling a list of proposed objects that it believes have been detected. In order to perform this task the detection module will attempt to identify a series of anchor points and perform two predictions for each point: a class prediction and an offset the anchor needs to be shifted by in order to match the ground truth box [17]. In order to determine these anchor points, a Region of Interest pooling layer that takes the maps from the feature extraction module and distills them into vectors of pooled features per detected region is used within the Region Proposal Network.
In the case of a Region-based Convolutional Neural Network [16] the output from the Region Proposal Network is received by a module known as the box classifier module. The output from the Region Proposal Network is fed through a series of fully connected networks which diverge at the end into two output streams which provide object classification using the Softmax function and bounding box coordinates. Figure 1 is a visual representation of a Region-based Convolutional Neural Network. The most recent derivative of the Region-based Convolutional Network is the faster Region-based Fully Convolutional Network which was first proposed in 2016 [18]. The network uses the same two modules structure for proposals and feature map generation but replaces the final fully connected layers with convolutional layers allowing for the sharing of information end-to-end in the network. In its second stage, the Fully Convolutional Network uses positional based layers to predict the relative position of an object. The pooling layer then aggregates these maps to form a map with a set of positional coordinates which is also fed through a Softmax function in order to determine a classification score. Figure 2 represents a visualization of a Region-based Fully Convolutional Network where each positional layer in the 2\(\,\times \,\)2 pooling layer is related to an individual pixel coordinate i.e. top left, top right, bottom left and bottom right, extracted from the feature maps of the feature extraction network.
In order to address the issue of end-to-end object detection on agricultural mushrooms, we proposed the creation of two Region-Based Convolutional models and the application of an image processing algorithm to determine the best approach. The convolutional models included a Region-Based Convolutional Network (RCNN) and a Region-Based Fully Convolutional Network (RFCN). For the feature extraction portion of the Region-based Convolutional Networks, the Inception V2 network was selected due to its superior accuracy results [17]. The difference between the RCNN and RFCN was applied in stage 2 of the network, where in the case of the RCNN, was terminated with a fully connected network and in the case of the RFCN was terminated in a series of convolutional layers. Both networks were terminated with the Softmax function which is used for multi-label classification scenarios. The image processing algorithm was a Laplacian of Gaussian (LoG) blob detection algorithm that was suited for the identification of bright blobs on a dark background. The models were trained using a newly constructed agricultural mushroom data set on the high performance computing platform SHARCNET. Accuracy results for the models were extracted from the training phases while a series of inference experiments were conducted to gather model timings.
4 Experiments
The experiments were conducted using a newly constructed agricultural data set based upon white agricultural mushrooms and contained 310 images of 18 different mushroom beds. The dataset consisted of ground truth images and collections of coordinates representing the centroid and radius of each identified mushroom. The annotations themselves were found to not be entirely correct as the process used to generate them involved applying a watershed image processing procedure and then having a human correct as many annotations as possible by hand. It was decided that the time that it would take to re-annotate the images with bounding boxes was not worthwhile. This conclusion was drawn from the fact that the correctly annotated images outnumbered the incorrectly and approximated the boundaries of the mushrooms to a high degree of accuracy. In order to determine bounding box coordinates each record was read into a Python preprocessing program and output file for each image containing a list of records was produced. Within the file, each record contained four values: \(xmin\), \(ymin\), \(xmax\) and \(ymax\) which represented the top left and bottom right corners of a single bounding box that encompassed a single mushroom.
Once the collection of images and ground truth datasets were generated they were fed into a Python image preprocessor that segmented the input image into 256\(\,\times \,\)256 sub images for consumption by the Convolutional Network experiments. A sliding window was passed over each image with a stride of 80 pixels on both the X and Y axis to ensure an entire mushroom appeared in at least one image - 80 pixels was derived from one of the largest mushrooms contained within the dataset. This process produced, on average, 250 sub images for each individual primary image. In order to increase the size of the dataset to a level that would produce meaningful results, when consumed by a convolutional network, the standard image preprocessing technique of image rotation was applied. After this process was completed, a dataset of 78,120 images was produced, which in total, contained approximately 5.1 billion pixels of image data.
The code for all networks was developed using Python 2.7.10 with the sklearn 0.19.1 [19] package used for the LoG image process algorithm and TensorFlow 1.5 [20] used for the convolutional models. The Region-based networks were implemented using Google’s object detection API [17] built on the TensorFlow platform. The experiments were carried out on the Copper SHARCNET host that specializes in GPU model execution. The Laplacian of Gaussian experiment was given access to 4 CPU nodes and 32 MB of memory while all the convolutional models were given access to 4 CPU nodes, 8 GPU nodes and 24 GB of memory. This difference of resources between the convolutional network experiments and the Laplacian of Gaussian experiment did not impact the results as the image processing algorithm did not have the capability to be GPU optimized and did not require the same amount of RAM due to the fact there was no complex model to store in memory.
In order to evaluate the performance of the models and discuss the advantages and disadvantages of each model two types of main metrics were considered: timing and accuracy. Timing was represented in two forms: the time it took to train a model (training time) and the time it took a model to infer a single image (inference time). Accuracy was measured with a single metric called average precision. Average precision was calculated using an Intersection of Union (IoU) threshold of 0.5. The results of the experiments were validated using \(k\)-fold cross validation with a \(k\) value of 10.
4.1 Results and Discussion
The primary purpose of the experiments were to determine which model operated as the best end-to-end agricultural mushroom detector. The results in Table 1 demonstrate that the Region-based Convolutional Network (RCNN) took approximately 10.2 min longer to train on the same number of steps leading to an almost 7 s per step longer training time than the Region-based Fully Convolutional Network (RFCN). Laplacian of Gaussian (LoG), being a algorithm and not a trainable model, had no training time to endure nor did it posses a training model size. Once training was complete inference experiments were conducted and their results displayed in Table 2. From the displayed results the RFCN was able to infer images the quickest while the LoG algorithm took approximately 0.3 s slower. The RCNN was the slowest inference model with almost a 3 s inference difference between itself and the RFCN. Looking at both the training time and inference results the RFCN is clearly the faster and more efficient model especially when factoring in the small model size which would be critical in an applied platform scenario.
The training accuracy displayed in Table 3 for the LoG algorithm indicates that the model is only able to predict the location of approximately 30% of all mushrooms within the provided images. The issue found was that the parameters tuned for this dataset focused more on larger single mushrooms forcing lower accuracy on images that contained clusters of mushrooms. These clusters were often predicted as single large mushrooms rather than the grouping of mushrooms they actually were. This highlights the key problem in using image processing algorithms, which is, that parameters need to be tuned for the individual image in order to maximize accuracy. In the data set of images, the highest image accuracy was approximately 77% while the lowest was approximately 2% for the LoG algorithm.
The RCNN achieved the highest accuracy rating during the training cycle besting the RFCN by approximately 5%. From the images in Fig. 3 it is clear that the RCNN was able to accurately predict the location of all the bounding boxes within the ground truth image as well as approximate the size of the boxes to a high degree. The RFCN was able to predict the location and size of all of the ground truth bounding boxes to a high degree, however, the network did encounter two issues. The RFCN predicted two mushrooms in the lower right portion of the network where they are not identified in the ground truth image. They are obscured from the ground truth image because the entire mushroom is not contained within the frame of the image and are therefore not included in the annotation for that image. The RCNN was able to successfully exclude mushrooms that were not contained within the frame of the network, where as, mushrooms that were mostly contained within the frame of the image caused the RFCN a degree of issue. Mushrooms that were mostly out of the image frame were ignored by both networks successfully as evident by the mushrooms in the top left of the image. This difference in inference was likely caused by the structure of the final layers of each network. The output from the final pooling layer of RCNN is passed into a series of fully connected layers which attempt to adjust their weights in order to minimize the end loss function. In doing so, the spatial information about the images gained from the convolutional layers is essentially destroyed, and the network makes weight adjustments primarily based on loss minimization over learned features. With the RFCN and its end-to-end convolutional structure, the spatial information learned from the training of the network is passed through to the decision portion of the network and could likely be the cause of the RFCNs ability to identify miss-classified and partially out of frame mushrooms.
The second issue that was encountered by the RFCN was duplicate box predictions. Duplicate box prediction were also the prime issue that caused a reduction in accuracy with the LoG image processing algorithm. In the case of the RFCN, the model predicted multiple boxes of different shapes that it believes were mushrooms. This can be seen by examining the top left corner of the RFCN prediction image in Fig. 3. This differs slightly from the problem faced by the LoG image processing algorithm that tended to predict clusters of mushrooms as single bounding boxes, or larger bounding boxes overlapping a series of smaller bounding boxes. This contrasts the errors in the RFCN that tended to predict multiple boxes of varying overlapping sizes for the same mushroom. Interestingly, the RFCN variant is able to predict mushrooms that are not present in the ground truth image, where as, the convolutional variant tends to adhere strictly to the ground truth predictions. Figure 4 shows that the RCNN predictions match the ground truth annotations identically, while the RFCN successfully predicted the location of a mushroom that did not exist in the ground truth annotation. The result for the RFCN prediction was that its overall accuracy was penalized as the mushroom did not exist in the ground truth annotation. This may limit the applied capabilities of the RCNN as well as result in a decrease in effectiveness when utilizing datasets with incomplete annotations. The lower accuracy results experienced by the RFCN may not be a result of poor network performance, but rather a sign of poor dataset annotations.
5 Conclusion
The experiments conducted in this paper focused on using Region-Based Convolutional Networks for end-to-end agricultural mushroom detection. Two types of Region-based Convolutional Networks were compared against the Laplacian of Gaussian (LoG) image processing algorithm that has capabilities of operating in both a segmentation and detection capacity. Our experiments found that the Region-based Convolutional Neural Network (RCNN) outperformed the LoG image processing algorithm by a significant margin while also outperforming the Region-based Fully Convolutional Network (RFCN) by a 5% margin in accuracy. The increase in accuracy experienced by the RCNN came at the expense of a marginal increase in training time - 10.2 min overall - equating to an approximate increase of 7 s per step in the training cycle. Although the RCNN achieved a 92.142% detection accuracy, outperforming the RFCN and LoG algorithm, its ability to adhere strictly to the ground truth dataset may limit its applied capabilities. This is especially the case when using a dataset that has incomplete dataset annotations. Our results suggest that the RCNN is currently a more appropriate network for end-to-end detection when used with a well annotated dataset. If the primary issue of duplicate box predictions encountered by RFCN’s can be rectified through number of dataset training examples, or a form of post processing, RFCN’s with their faster training and inference times and learning of spatial features, may represent the most appropriate network of choice.
References
Food and Agriculture Organization of the United Nations. http://www.fao.org
World Population. http://www.worldometers.info/
Kondo, N., Ahmad, U., Monta, M., Murase, H.: Machine vision based quality evaluation of Iyokan orange fruit using neural networks. Comput. Electron. Agric. 29(1–2), 135–147 (2000)
Nakano, K.: Application of neural networks to the color grading of apples. Comput. Electron. Agric. 18(2–3), 105–116 (1997)
Paliwal, J., Visen, N.S., Jayas, D.S.: Evaluation of neural network architectures for cereal grain classification using morphological features. J. Agric. Eng. Res. 79(4), 361–370 (2001)
Kaul, M., Hill, R.L., Walthall, C.: Artificial neural networks for corn and soybean yield prediction. Agric. Syst. 85(1), 1–18 (2005)
Sa, I., Ge, Z., Dayoub, F., Upcroft, B., Perez, T., McCool, C.: Deepfruits: a fruit detection system using deep neural networks. Sensors 16(8), 1222 (2016)
Ren, M., Zemel, R.S.: End-to-End Instance Segmentation and Counting with Recurrent Attention. arXiv preprint arXiv:1605.09410 (2016)
Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440 (2015)
Dyrmann, M., Jørgensen, R.N., Midtiby, H.S.: RoboWeedSupport-detection of weed locations in leaf occluded cereal crops using a fully convolutional neural network. Adv. Anim. Biosci. 8(2), 842–847 (2017)
Pathak, D., Shelhamer, E., Long, J., Darrell, T.: Fully convolutional multi-class multiple instance learning. arXiv preprint arXiv:1412.7144 (2014)
Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 580–587 (2014)
Girshick, R.: Fast R-CNN. arXiv preprint arXiv:1504.08083 (2015)
Sun, X., Wu, P., Hoi, S.C.: Face detection using deep learning: an improved faster RCNN approach. arXiv preprint arXiv:1701.08289 (2017)
Le, T.H.N., Zheng, Y., Zhu, C., Luu, K., Savvides, M.: Multiple scale faster-RCNN approach to driver?s cell-phone usage and hands on steering wheel detection. In: IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 46–53. IEEE, June 2016
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, pp. 91–99 (2015)
Huang, J., Rathod, V., Sun, C., Zhu, M., Korattikara, A., Fathi, A., Fischer, I., Wojna, Z., Song, Y., Guadarrama, S., Murphy, K.: Speed/accuracy trade-offs for modern convolutional object detectors. In: IEEE CVPR, July 2017
Dai, J., Li, Y., He, K., Sun, J.: R-FCN: Object detection via region-based fully convolutional networks. In: Advances in Neural Information Processing Systems, pp. 379–387 (2016)
Scikit-Learn. http://scikit-learn.org/stable/
Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis, A., Dean, J., Devin, M., Ghemawat, S.: TensorFlow: large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467 (2016)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this paper
Cite this paper
Olpin, A.J., Dara, R., Stacey, D., Kashkoush, M. (2018). Region-Based Convolutional Networks for End-to-End Detection of Agricultural Mushrooms. In: Mansouri, A., El Moataz, A., Nouboud, F., Mammass, D. (eds) Image and Signal Processing. ICISP 2018. Lecture Notes in Computer Science(), vol 10884. Springer, Cham. https://doi.org/10.1007/978-3-319-94211-7_35
Download citation
DOI: https://doi.org/10.1007/978-3-319-94211-7_35
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-94210-0
Online ISBN: 978-3-319-94211-7
eBook Packages: Computer ScienceComputer Science (R0)