Vision-based method for semantic information extraction in construction by integrating deep learning object detection and image captioning

doi:10.1016/j.aei.2022.101699

Advanced Engineering Informatics

Volume 53, August 2022, 101699

https://doi.org/10.1016/j.aei.2022.101699 Get rights and content

Highlights

•
Proposes a novel method for semantic information extraction from images in construction engineering.
•
Integrates deep learning object detection and image captioning.
•
Achieves the Consensus-based Image Description Evaluation (CIDEr) of 1.84 in experiments.
•
Develops an algorithm for visualizing construction scene graph.

Abstract

Recently, vision-based monitoring has been widely adopted in construction management to improve crew productivity, reduce safety risks, and facilitate site planning. However, automated retrieval of semantic information (e.g., objects, activities, and interactions between objects) from construction images remains challenging due to the complex nature of construction sites. This paper proposes a novel semantic information extraction method by integrating deep learning object detection and image captioning, which aims to explore salient information from construction images or videos. In the proposed method, object detection has been employed as an encoder to extract the feature maps of construction object zones and the holistic image. The image captioning has been selected as the decoder to extract the semantic information. A post-processing method has been proposed to parse the semantic information into a graph format for better accessibility and visualization. In experiments, the proposed method has achieved the Consensus-based Image Description Evaluation (CIDEr) of 1.84. By adopting the proposed method, semantic information behind construction images can be presented to construction managers to assist their decision-making.

Graphical abstract

Introduction

The construction industry is one of the largest industry sectors in North America, which contributes to 4.3% of the Gross Domestic Product (GDP) of the United States in 2021 [1]. Cameras have recently become standard equipment in construction engineering, allowing construction professionals to monitor their sites remotely [[2], [3], [4]]. Analyzing construction images/videos by vision-based methods is beneficial to construction management in terms of improving crew productivities [5], machine's well-being [6], and project environmental performance [7], monitoring progress [8], [9], reducing safety risks [10], and enhancing construction logistics [11]. Extracting semantic information (e.g., objects, activities, and interactions between objects) from construction images is the fundamental step for many vision-based applications in construction management [2], [12], [13], [14], [15].

Object detection is a vision-based technology that can extract pre-defined classes of objects and their location information from construction images, which has been applied to defect detection [16], [17], equipment classification [18], and safety monitoring [19] in construction engineering. However, object detection can only provide information of object category and localization, which may not be sufficient for advanced construction applications (e.g., activity recognition and interaction analysis) [15]. Therefore, other vision-based technologies have been employed for extracting semantic information from images or videos. For example, Kim and Chi [5] utilized an additional model based on the Recurrent Neural Network (RNN) upon the object detector to recognize the activity of excavators. Liu et al. [20] proposed a method to extract the semantic information from the site image as natural language descriptions. Tang et al. [9] combined the object detection and human-object-interactions recognition module to ground the interaction information of workers onto the image. Moreover, Natural Language Processing (NLP) technologies can play an important role in conducting similar roles. When current techniques could not directly extract target information from the images, NLP techniques extract information from human observation reports. For example, researcher utilized Named Entity Recognition techniques to extract information about safety accident [21], [22], equipment and labor information [23], [24], and relations [25], [26].

Currently, executing separate dedicated models on the site image could extract the object, activity, and interaction information, achieving the goal of semantic information extraction [27]. However, executing separate models is time-consuming, and separate models may lack the consistency of the entity label since they are trained on different datasets. Also, extracted semantic information by separate models lacks visual connections between the recognized labels with image regions. Visual connection is vital because the semantic information extracted in labels cannot provide enough information for analysis or decision-making [10], [28], [29], [30]. For example, to track the activity of equipment or labor, it is required to know the object's location so that its trajectory can be generated and analyzed [31], [32], [33]. Sometimes, an activity cannot be identified as unsafe for safety management unless this activity happens in some restricted areas [27]. This means providing the location of the object and its activity is required [13], [14]. Therefore, combining object detection and semantic information extraction into an integrated model is a promising way to provide richer information for downstream vision-based analysis and decision making.

In this research, the authors proposed a novel vision-based method by integrating object detection, image captioning, and data post-processing to extract semantic information for the construction machine images with the visual connection. This method contains a novel process to integrate object detection and image captioning to extract information about object categorization and location, object activity, and interactions between objects. A novel attention mechanism has been added to the integrated model to obtain the visual connection between object detection results and extracted semantic information. This study provides a novel integrated model that will extract semantic information with visual connection for construction images and videos. The extracted information could enhance the visualization ability of current methods by providing object categorization, location, and activity information. It also has potential to facilitate object tracking and safety management. For example, provided the location and activity information, the activity trajectory of the equipment could be generated, and the unsafety behavior could also be analyzed and determined.

Section snippets

Information extraction from construction images

Existing research on vision-based information retrieval of construction images can be categorized into three types: (1) construction object detection, (2) operation activity recognition, and (3) interaction and scene analysis, which will be reviewed comprehensively in the following subsections.

Methodology

To extract the related semantic information (objects, activities, and interactions), the authors combined an image object detector and a language decoder as an integrated method. Fig. 1 presents the overall architecture of this method. The proposed method is an extension of a typical encoder-decoder image captioning method. In the typical method, the encoder is a CNN that extracts the useful semantic feature from the whole image. The decoder then predicts the description words of the image

Experimental setup

The authors trained the detection encoder and the captioning decoder by using two seperate datasets. The metadata and examples of datasets are provided in Table 1. To ensure the robustness of trained models, the authors include data instances from different construction scenarios (environment, view of angle, weather, etc.). For training the encoder, this study utilized the dataset for Moving Objects in Construction Sites (MOCS) [68]. The MOCS dataset contains 41,668 images collected from 174

Feasibility of encoder

We evaluated the object detection and instance segmentation performance on the MOCS validation set of the encoder to indicate the feasibility of the encoder. We utilize Mean Average Precision (mAP) to evaluate the performance of the object detection and instance segmentation [53].

The evaluation metric is based on average precision (AP). Given a certain level of confidence value $α$ , the ${AP}_{α}$ integrates the precision scores at different recall levels $r$ : $\begin{matrix} {AP}_{α} = \frac{1}{11} \sum_{r \in \{0.0, 0.1, \dots, 1.0\}} P r e c i s i o n (r) \end{matrix}$

and the $m {AP}_{α}$

Conclusions and future works

This study presented an integrated information extraction method for on-site images, extracting semantic information such as objects, activities, and interactions. It contains three modules: (1) the object detector-based encoder, which detects the object in the image and extracts the feature maps of the image and objects; (2) the image captioning decoder, which extracts the semantic information as a natural language sentence according to feature maps; (3) the post-processing module, which

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgment

This work was supported by The Hong Kong Polytechnic University under Grant P0040522 and China Scholarship Council under Grant CSC202007970002.The authors would like to express great appreciation to the volunteers who participated in annotating the construction equipment image caption dataset.

References (81)

P. Martinez et al.
A scientometric analysis and critical review of computer vision applications for construction
Autom. Constr.
(2019)
J. Kim et al.
Action recognition of earthmoving excavators based on sequential pattern analysis of visual features and operation cycles
Autom. Constr.
(2019)
H. Kim et al.
Application of dynamic time warping to the recognition of mixed equipment activities in cycle time measurement
Autom. Constr.
(2018)
T. Slaton et al.
Construction activity recognition with convolutional recurrent networks
Autom. Constr.
(2020)
M. Zhang et al.
A critical review of vision-based occupational health and safety monitoring of construction site workers
Saf. Sci.
(2020)
W. Fang et al.
Automated detection of workers and heavy equipment on construction sites: a convolutional neural network approach
Adv. Eng. Inf.
(2018)
S. Paneru et al.
Computer vision applications in construction: current state, opportunities & challenges
Autom. Constr.
(2021)
J. Seo et al.
Computer vision techniques for construction safety and health monitoring
Adv. Eng. Inf.
(2015)
B. Zhong et al.
Mapping computer vision research in construction: developments, knowledge gaps and implications for research
Autom. Constr.
(2019)
B.E. Mneymneh et al.
Automated hardhat detection for construction safety applications
Procedia Eng.
(2017)

A. Xuehui et al.

Dataset and benchmark for detecting moving objects in construction sites

Autom. Constr.

(2021)

Statista, U.S. construction industry share of GDP 2007-2020, Statista. (n.d.). Available from:...

B. Sherafat et al.

Automated methods for activity recognition of construction workers and equipment: state-of-the-art review

J. Constr. Eng. Manage.

(2020)

S. Xu et al.

Computer vision techniques in construction: a critical review

Arch Comput. Methods Eng.

(2021)

R. Akhavian, A.H. Behzadan, Simulation-based evaluation of fuel consumption in heavy construction projects by...

K.M. Rashid et al.

Automated activity identification for construction equipment using motion data from articulated members

Front. Built Environ.

(2020)

W. Fang et al.

Computer vision and deep learning to manage safety in construction: matching images of unsafe behavior and semantic rules

IEEE Trans. Eng. Manage.

(2021)

Y.-J. Cha et al.

Deep learning-based crack damage detection using convolutional neural networks, computer-aided civil and infrastructure

Engineering.

(2017)

H. Kim et al.

Detecting construction equipment using a region-based fully convolutional network and transfer learning

J. Comput. Civil Eng.

(2018)

Cited by (0)

View full text

Full length articleVision-based method for semantic information extraction in construction by integrating deep learning object detection and image captioning

Highlights

Abstract

Graphical abstract

Introduction

Section snippets

Information extraction from construction images

Methodology

Experimental setup

Feasibility of encoder

Conclusions and future works

Declaration of Competing Interest

Acknowledgment

Autom. Constr.

Autom. Constr.

Autom. Constr.

Autom. Constr.

Saf. Sci.

Adv. Eng. Inf.

Autom. Constr.

Adv. Eng. Inf.

Autom. Constr.

Procedia Eng.

Autom. Constr.

Autom. Constr.

Autom. Constr.

Autom. Constr.

Saf. Sci.

Saf. Sci.

Autom. Constr.

Autom. Constr.

Autom. Constr.

Autom. Constr.

Autom. Constr.

Autom. Constr.

Autom. Constr.

Adv. Eng. Inf.

Autom. Constr.

Autom. Constr.

Autom. Constr.

Adv. Eng. Inf.

Autom. Constr.

Autom. Constr.

Autom. Constr.

Autom. Constr.

Automated methods for activity recognition of construction workers and equipment: state-of-the-art review

J. Constr. Eng. Manage.

Computer vision techniques in construction: a critical review

Arch Comput. Methods Eng.

Automated activity identification for construction equipment using motion data from articulated members

Front. Built Environ.

Computer vision and deep learning to manage safety in construction: matching images of unsafe behavior and semantic rules

IEEE Trans. Eng. Manage.

Deep learning-based crack damage detection using convolutional neural networks, computer-aided civil and infrastructure

Engineering.

Detecting construction equipment using a region-based fully convolutional network and transfer learning

J. Comput. Civil Eng.

Full length article
Vision-based method for semantic information extraction in construction by integrating deep learning object detection and image captioning