A novel directional object detection method for piled objects using a hybrid region-based convolutional neural network

https://doi.org/10.1016/j.aei.2021.101448Get rights and content

Abstract

Digital transformation is an information technology (IT) process that integrates digital information with operating processes. Its introduction to the workplace can promote the development of progressively efficient manufacturing processes, accelerating competition in terms of speed and production capacity. Equipment combined with computer vision has begun to replace manpower in certain industries including manufacturing. However, current object detection methods are unable to identify the actual rotation angle of a specific grasped target while objects are piled. Hence this study proposes a framework based on deep learning that integrates two object detection models. Faster R-CNN (region based convolutional neural network) is utilized to search for the direction reference point of the target, and Mask R-CNN is adopted to obtain the segmentation that not only forms the basis of an area filter but also generates a rotated bounding box by minAreaRect function. After integrating the output from two models, the location and actual rotated angle of target can be obtained. The purpose of this research is to provide the robot arm with the position and angle information of the object located on the top for grasping. An empirical dataset of piled footwear insoles was employed to test the proposed method during the assembly process. Results show that the accuracy of the detection reached 96.26%. The implementation of proposed method in the manufacturing process not only can save man power who responsible for sorting out products but also reduce process time to enlarge production capacity. The proposed method can serve as a part of smart manufacturing system to enhance the enterprise’s competitiveness in the future.

Introduction

Paralleling the advance of technology, many factories have gradually introduced automation equipment to minimize the burdens of production, to reduce human error, to increase production capacity and to lower costs. Among these automations, the use of robotic arms to grasp specific objects presents a classic problem associated with the object detection task in computer vision. As it happens, piled and irregularly arranged objects are very common in real industrial scenarios. For example, certain processes utilize injection molding machines to manufacture products with identical specifications. These products are often stacked at slightly different angles when placed onto the collection platform. Consequently, an additional process to rearrange these objects is needed, that not only increases the extra manpower but also the cycle time in the manufacturing process. Manufacturers have used two different methods to grasp such targets at the correct angle from a group of piled objects. One is to dispatch personnel to pick up the targets manually. As the working time increases, the working efficiency of workers decrease quickly because physically tired. The accuracy of identifying and rectifying the problem decreases and the related costs (such as time cost or rework cost) will increase. The second method, incorporating automation, is to engage computer vision technology to help a robotic arm grasp the target objects. Compared to the first method, the second one enables the overall performance to be more evenly stable, and the cost is relatively low and consistent. A case of shoe manufacturer is facing the same problem that they mass-produce insoles by injection molding machines in production line and need to allocate extra worker to sort out the production. Thus, we applied deep learning method with computer vision to achieve automation which can reduce process time to enlarge production capacity and further enhances the enterprise’s competitiveness.

In order to achieve certain goals involving artificial intelligence, researchers have developed a systematic approach that can automatically extract information from raw data, learn the features of that information, and enable the system with the capability to make judgements about it. This is known as machine learning [12]. Deep learning is a popular branch of machine learning [51]. It involves an algorithm that uses artificial neural networks as a framework to perform analysis on data [41]. Because deep learning can automatically extract and learn features from data through neurons in what is known as a network’s hidden layers, it can be very effective at analyzing data [16]. Deep learning methods exhibit good performance and can be applied in many fields, such as computer vision [33] and natural language processing [2], [7]. Deep learning is now considered as one of the main enablers for digital transformation of many industries. It has been widely employed in companies to improve productivity and competitiveness, while helping accelerate digital transformation. Using the data collected from smart devices, digital transformation can easily be adapted for use in the Industry 4.0 era [40].

Computer vision has a wide range of applications in real life such as image recognition [6], action recognition [33], object localization, image restoration, tracking [20], and motion analysis [9]. To train a robot arm to accurately grasp an object, we need to explore the problem of object detection—that is, combining object recognition with precise localization [45]. Deep learning related models can be used to deal with this challenge of object detection. Krizhevsky et al. [22] proposed AlexNet, a convolutional neural network (CNN) which won the 2012 ImageNet competition, that was a major milestone in the arena of deep learning. Since then, deep learning models related to images have developed rapidly, and more and more researchers have conducted investigations related to object detection. The current object detection models can be divided into two types, namely “two-stage detectors” and “one-stage detectors”. Girshick et al. [11] first proposed a region-based convolutional neural network (R-CNN) machine-learning model to help solve the object detection problem. Since then, many related algorithms have been developed based on this model, including Fast R-CNN [10], Faster R-CNN [37], and Feature Pyramid Networks [26]. These models extract the candidate region of the target through a neural network or algorithm at the beginning of the detection process and then use another neural network for classification, which means they belong in the two-stage detector category. The other model types can complete both recognition and localization in one neural network, and are thus recognized as one-stage detectors, such as YOLO (you only look once) [35], SSD (single shot MultiBox detector) [28], and RetinaNet [27]. Both types of object detection algorithms (two-stage and one-stage) have advantages and disadvantages owing to their architecture. Sultana et al., [44] review recent object detection models based on convolutional neural network and obtained the results as shown in Table 1. The two-stage detector has higher object recognition and positioning accuracy. However, its inference speed is slower than the speed of the one-stage detector because it needs to propose candidate regions through an algorithm first [18]. One main goal of this current research, in addition to obtaining the bounding box of objects, was to calculate the rotation angle of the detected object. The detection accuracy is more important than inference speed. Although Faster RCNN is not the optimal solution. Considering the similarity of the architecture, we think that Faster RCNN and Mask RCNN can be further integrated into one model in the future to shorten detection time and save computer memory. Thus, we select Faster R-CNN, a two-stage method, as the detection model of the direction reference point for our investigation.

Several models for object detection have been mentioned. Most of them output horizontally aligned bounding boxes, including R-CNN variants and YOLO series. While many object detection models for rotated bounding boxes have been proposed, it has remained a challenge to determine how to grasp a target at the correct angle when objects are piled. Most of these researches are applied on aerial images which without stacking of objects. For example, Bhat [5] developed a YOLO-based model and took angle into the calculation of loss. Zhong and Ao [57] adopted a new rotation-decoupled anchor matching strategy on FPN-based architecture to detect arbitrarily oriented targets. Once the two overlapping objects are detected, the model is unable to recognize which one is on the top and can be picked up. We utilized Mask R-CNN to solve this problem. The inference results of Mask R-CNN generate a pixel-level bounding box drawn along the contours of an object. We used this property to determine the unobstructed objects in the image and to obtain rotated bounding boxes. Additionally, those detected mask with smaller area are abandoned because they might be covered by other objects.

We proposed a framework based on deep learning that integrates two object detection models. Faster R-CNN is utilized to search for the direction reference point of the object, and Mask R-CNN is adopted to obtain the segmentation that not only forms the basis of an area filter but also generates a rotated bounding box. The proposed framework could identify relatively complete objects and to assess their rotation angle for grasping in piled objects. The proposed Artificial Intelligence (AI) using deep learning approaches brought several benefits through analyzing digitalized image data and further utilizing the result to replace manpower with stable robotic device. Therefore, it also encourages companies to carry out digital transformation to improve production efficiency and corporate competitiveness. The remainder of this study is described as follows. In Section 2, we present a literature review to identify the research gap. In Section 3, we describe the methodology and process in greater detail. In Section 4, a case study is presented to validate the proposed method. Finally, we conclude the work and provide directions for future research in Section 5.

Section snippets

Literature review

This section presents a literature review of related work. In Section 2.1, we address the application of deep learning in the industrial area, including defect detection and the prediction of remaining useful life (RUL). In Section 2.2, we introduce the development history of object detection and briefly review common object detection models. Finally, we summarize the shortcomings of these models and compare them with the proposed model.

Methodology

The framework for this research is divided into three stages, as shown in Fig. 3. The first stage involved preparation of the dataset for training the deep learning models, including collecting data and labelling images. In the second stage, image data and annotated files were used to train the Faster R-CNN and Mask R-CNN models. The third stage involved integrating and analyzing the results generated by Faster R-CNN and Mask R-CNN, which enabled identifying the position and rotation angle of

Case study

The focal company for this case study was the shoe original equipment manufacturer. An original equipment manufacturer (OEM) is a company that produces parts and equipment that may be marketed by another manufacturer. Its shoemaking and development technology leads other worldwide competitors and is widely trusted by the world's major leading brands. The company has an annual production capacity of about 17 million pairs of shoes, an annual value of production of about 2.2 billion dollars, and

Conclusion

This research integrates two deep learning models to address the problem of piled objects with irregular arrangement. In a final test involving 30 fine-tuned images, the results show that the grasping accuracy, referring to the outcome generated by integrating Faster R-CNN and Mask R-CNN, achieved a success rate of 96.26% with a reasonable computation time. The main contributions of this research can be divided into academic and practical aspects. Academically, the proposed method integrates

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgement

The authors would like to thank the Ministry of Science and Technology of Taiwan for financially supporting this research under Contract no. MOST 109-2628-E-007-002-MY3.

References (61)

  • X. Yin et al.

    Ensemble deep learning based semi-supervised soft sensor modeling method and its application on quality prediction for coal preparation process

    Adv. Eng. Inform.

    (2020)
  • J.P. Yun et al.

    Automated defect inspection system for metal surfaces based on deep learning and data augmentation

    J. Manuf. Syst.

    (2020)
  • J. Zhang et al.

    Long short-term memory for machine remaining life prediction

    J. Manuf. Syst.

    (2018)
  • N.H. Aung, Y.K. Thu, S.S. Maung, Feature Based Myanmar Fingerspelling Image Classification Using SIFT, SURF and BRIEF,...
  • S. Bacchi et al.

    Deep learning natural language processing successfully predicts the cerebrovascular cause of transient ischemic attack-like presentations

    Stroke

    (2019)
  • B. Benjdira, T. Khursheed, A. Koubaa, A. Ammar, K. Ouni, Car detection using unmanned aerial vehicles: Comparison...
  • A. Bhat, Aerial Object Detection using Learnable Bounding Boxes,...
  • M.C. Chiu et al.

    Applying transfer learning to achieve precision marketing in an omni-channel system–a case study of a sharing kitchen platform

    Int. J. Prod. Res.

    (2021)
  • M.C. Chiu et al.

    An integrative machine learning method to improve fault detection and productivity performance in a cyber-physical system

    J. Comput. Inform. Sci. Eng.

    (2020)
  • S.L. Colyer et al.

    A review of the evolution of vision-based motion analysis and the integration of advanced computer vision methods towards developing a markerless system

    Sports Med.-open

    (2018)
  • R. Girshick, Fast r-cnn. InProceedings of the IEEE international conference on computer vision, 2015, pp....
  • R. Girshick et al.

    Rich feature hierarchies for accurate object detection and semantic segmentation

  • I. Goodfellow et al.

    Deep learning

    (2016)
  • D. Guo, F. Sun, H. Liu, T. Kong, B. Fang, N. Xi, A hybrid deep architecture for robotic grasp detection, in: 2017 IEEE...
  • K. He et al.

    Mask r-cnn

  • K. He et al.

    Spatial pyramid pooling in deep convolutional networks for visual recognition

    IEEE Trans. Pattern Anal. Mach. Intelligence

    (2015)
  • A.K. Jain et al.

    Artificial neural networks: A tutorial

    Computer

    (1996)
  • Y. Jiang, X. Zhu, X. Wang, S. Yang, W. Li, H. Wang, P. Fu, Z. Luo, R2CNN: Rotational region CNN for orientation robust...
  • L. Jiao et al.

    A Survey of Deep Learning-Based Object Detection

    IEEE Access

    (2019)
  • H.S. Kang et al.

    Smart manufacturing: Past research, present findings, and future directions

    Int. J. Precision Eng. Manuf.-green Technol.

    (2016)
  • Cited by (21)

    View all citing articles on Scopus
    View full text