Full length article
Vision-based method for semantic information extraction in construction by integrating deep learning object detection and image captioning

https://doi.org/10.1016/j.aei.2022.101699Get rights and content

Highlights

  • Proposes a novel method for semantic information extraction from images in construction engineering.

  • Integrates deep learning object detection and image captioning.

  • Achieves the Consensus-based Image Description Evaluation (CIDEr) of 1.84 in experiments.

  • Develops an algorithm for visualizing construction scene graph.

Abstract

Recently, vision-based monitoring has been widely adopted in construction management to improve crew productivity, reduce safety risks, and facilitate site planning. However, automated retrieval of semantic information (e.g., objects, activities, and interactions between objects) from construction images remains challenging due to the complex nature of construction sites. This paper proposes a novel semantic information extraction method by integrating deep learning object detection and image captioning, which aims to explore salient information from construction images or videos. In the proposed method, object detection has been employed as an encoder to extract the feature maps of construction object zones and the holistic image. The image captioning has been selected as the decoder to extract the semantic information. A post-processing method has been proposed to parse the semantic information into a graph format for better accessibility and visualization. In experiments, the proposed method has achieved the Consensus-based Image Description Evaluation (CIDEr) of 1.84. By adopting the proposed method, semantic information behind construction images can be presented to construction managers to assist their decision-making.

Introduction

The construction industry is one of the largest industry sectors in North America, which contributes to 4.3% of the Gross Domestic Product (GDP) of the United States in 2021 [1]. Cameras have recently become standard equipment in construction engineering, allowing construction professionals to monitor their sites remotely [[2], [3], [4]]. Analyzing construction images/videos by vision-based methods is beneficial to construction management in terms of improving crew productivities [5], machine's well-being [6], and project environmental performance [7], monitoring progress [8], [9], reducing safety risks [10], and enhancing construction logistics [11]. Extracting semantic information (e.g., objects, activities, and interactions between objects) from construction images is the fundamental step for many vision-based applications in construction management [2], [12], [13], [14], [15].

Object detection is a vision-based technology that can extract pre-defined classes of objects and their location information from construction images, which has been applied to defect detection [16], [17], equipment classification [18], and safety monitoring [19] in construction engineering. However, object detection can only provide information of object category and localization, which may not be sufficient for advanced construction applications (e.g., activity recognition and interaction analysis) [15]. Therefore, other vision-based technologies have been employed for extracting semantic information from images or videos. For example, Kim and Chi [5] utilized an additional model based on the Recurrent Neural Network (RNN) upon the object detector to recognize the activity of excavators. Liu et al. [20] proposed a method to extract the semantic information from the site image as natural language descriptions. Tang et al. [9] combined the object detection and human-object-interactions recognition module to ground the interaction information of workers onto the image. Moreover, Natural Language Processing (NLP) technologies can play an important role in conducting similar roles. When current techniques could not directly extract target information from the images, NLP techniques extract information from human observation reports. For example, researcher utilized Named Entity Recognition techniques to extract information about safety accident [21], [22], equipment and labor information [23], [24], and relations [25], [26].

Currently, executing separate dedicated models on the site image could extract the object, activity, and interaction information, achieving the goal of semantic information extraction [27]. However, executing separate models is time-consuming, and separate models may lack the consistency of the entity label since they are trained on different datasets. Also, extracted semantic information by separate models lacks visual connections between the recognized labels with image regions. Visual connection is vital because the semantic information extracted in labels cannot provide enough information for analysis or decision-making [10], [28], [29], [30]. For example, to track the activity of equipment or labor, it is required to know the object's location so that its trajectory can be generated and analyzed [31], [32], [33]. Sometimes, an activity cannot be identified as unsafe for safety management unless this activity happens in some restricted areas [27]. This means providing the location of the object and its activity is required [13], [14]. Therefore, combining object detection and semantic information extraction into an integrated model is a promising way to provide richer information for downstream vision-based analysis and decision making.

In this research, the authors proposed a novel vision-based method by integrating object detection, image captioning, and data post-processing to extract semantic information for the construction machine images with the visual connection. This method contains a novel process to integrate object detection and image captioning to extract information about object categorization and location, object activity, and interactions between objects. A novel attention mechanism has been added to the integrated model to obtain the visual connection between object detection results and extracted semantic information. This study provides a novel integrated model that will extract semantic information with visual connection for construction images and videos. The extracted information could enhance the visualization ability of current methods by providing object categorization, location, and activity information. It also has potential to facilitate object tracking and safety management. For example, provided the location and activity information, the activity trajectory of the equipment could be generated, and the unsafety behavior could also be analyzed and determined.

Section snippets

Information extraction from construction images

Existing research on vision-based information retrieval of construction images can be categorized into three types: (1) construction object detection, (2) operation activity recognition, and (3) interaction and scene analysis, which will be reviewed comprehensively in the following subsections.

Methodology

To extract the related semantic information (objects, activities, and interactions), the authors combined an image object detector and a language decoder as an integrated method. Fig. 1 presents the overall architecture of this method. The proposed method is an extension of a typical encoder-decoder image captioning method. In the typical method, the encoder is a CNN that extracts the useful semantic feature from the whole image. The decoder then predicts the description words of the image

Experimental setup

The authors trained the detection encoder and the captioning decoder by using two seperate datasets. The metadata and examples of datasets are provided in Table 1. To ensure the robustness of trained models, the authors include data instances from different construction scenarios (environment, view of angle, weather, etc.). For training the encoder, this study utilized the dataset for Moving Objects in Construction Sites (MOCS) [68]. The MOCS dataset contains 41,668 images collected from 174

Feasibility of encoder

We evaluated the object detection and instance segmentation performance on the MOCS validation set of the encoder to indicate the feasibility of the encoder. We utilize Mean Average Precision (mAP) to evaluate the performance of the object detection and instance segmentation [53].

The evaluation metric is based on average precision (AP). Given a certain level of confidence value α, the APα integrates the precision scores at different recall levels r:APα=111r0.0,0.1,,1.0Precisionr

and the mAPα

Conclusions and future works

This study presented an integrated information extraction method for on-site images, extracting semantic information such as objects, activities, and interactions. It contains three modules: (1) the object detector-based encoder, which detects the object in the image and extracts the feature maps of the image and objects; (2) the image captioning decoder, which extracts the semantic information as a natural language sentence according to feature maps; (3) the post-processing module, which

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgment

This work was supported by The Hong Kong Polytechnic University under Grant P0040522 and China Scholarship Council under Grant CSC202007970002.The authors would like to express great appreciation to the volunteers who participated in annotating the construction equipment image caption dataset.

References (81)

  • H. Liu et al.

    Manifesting construction activity scenes via image captioning

    Autom. Constr.

    (2020)
  • A.-J.-P. Tixier et al.

    Automated content analysis for construction safety: a natural language processing system to extract precursors and outcomes from unstructured injury reports

    Autom. Constr.

    (2016)
  • Y. Mo et al.

    Automated staff assignment for building maintenance using natural language processing

    Autom. Constr.

    (2020)
  • W. Fang et al.

    Knowledge graph for identifying hazards on construction sites: Integrating computer vision with ontology

    Autom. Constr.

    (2020)
  • M.D. Martínez-Aires et al.

    Building information modeling and safety management: a systematic review

    Saf. Sci.

    (2018)
  • B.H.W. Guo et al.

    Computer vision technologies for safety science and management in construction: a critical review and future research directions

    Saf. Sci.

    (2021)
  • Z. Zhu et al.

    Integrated detection and tracking of workforce and equipment from construction jobsite videos

    Autom. Constr.

    (2017)
  • M.-W. Park et al.

    Continuous localization of construction workers via integration of detection and tracking

    Autom. Constr.

    (2016)
  • H. Tajeen et al.

    Image dataset development for measuring construction equipment recognition performance

    Autom. Constr.

    (2014)
  • J.C.P. Cheng et al.

    Automated detection of sewer pipe defects in closed-circuit television images using deep learning techniques

    Autom. Constr.

    (2018)
  • D. Kim et al.

    Remote proximity monitoring between mobile construction resources using camera-mounted UAVs

    Autom. Constr.

    (2019)
  • H. Kim et al.

    Analyzing context and productivity of tunnel earthmoving processes using imaging and simulation

    Autom. Constr.

    (2018)
  • J. Kim et al.

    Interaction analysis for vision-based activity identification of earthmoving excavators and dump trucks

    Autom. Constr.

    (2018)
  • M. Golparvar-Fard et al.

    Vision-based action recognition of earthmoving equipment using spatio-temporal features and support vector machine classifiers

    Adv. Eng. Inf.

    (2013)
  • H. Luo et al.

    Convolutional neural networks: Computer vision-based workforce activity assessment in construction

    Autom. Constr.

    (2018)
  • H. Luo et al.

    Full body pose estimation of construction equipment using computer vision and deep learning techniques

    Autom. Constr.

    (2020)
  • J. Cai et al.

    Two-step long short-term memory method for identifying construction activities through positional and attentional cues

    Autom. Constr.

    (2019)
  • J. Cai et al.

    A context-augmented deep learning approach for worker trajectory prediction on unstructured and dynamic construction sites

    Adv. Eng. Inf.

    (2020)
  • H. Kim et al.

    Data-driven scene parsing method for recognizing construction site objects in the whole image

    Autom. Constr.

    (2016)
  • Y. Ham et al.

    Automated content-based filtering for enhanced vision-based documentation in construction toward exploiting big visual data from drones

    Autom. Constr.

    (2019)
  • S. Tang et al.

    Human-object interaction recognition for automatic construction site safety inspection

    Autom. Constr.

    (2020)
  • A. Xuehui et al.

    Dataset and benchmark for detecting moving objects in construction sites

    Autom. Constr.

    (2021)
  • Statista, U.S. construction industry share of GDP 2007-2020, Statista. (n.d.). Available from:...
  • B. Sherafat et al.

    Automated methods for activity recognition of construction workers and equipment: state-of-the-art review

    J. Constr. Eng. Manage.

    (2020)
  • S. Xu et al.

    Computer vision techniques in construction: a critical review

    Arch Comput. Methods Eng.

    (2021)
  • R. Akhavian, A.H. Behzadan, Simulation-based evaluation of fuel consumption in heavy construction projects by...
  • K.M. Rashid et al.

    Automated activity identification for construction equipment using motion data from articulated members

    Front. Built Environ.

    (2020)
  • W. Fang et al.

    Computer vision and deep learning to manage safety in construction: matching images of unsafe behavior and semantic rules

    IEEE Trans. Eng. Manage.

    (2021)
  • Y.-J. Cha et al.

    Deep learning-based crack damage detection using convolutional neural networks, computer-aided civil and infrastructure

    Engineering.

    (2017)
  • H. Kim et al.

    Detecting construction equipment using a region-based fully convolutional network and transfer learning

    J. Comput. Civil Eng.

    (2018)
  • Cited by (0)

    View full text