Part-of-speech tagging of building codes empowered by deep learning and transformational rules

https://doi.org/10.1016/j.aei.2020.101235Get rights and content

Abstract

Automated building code compliance checking systems were under development for many years. However, the excessive amount of human inputs needed to convert building codes from natural language to computer understandable formats severely limited their range of applicable code requirements. To address that, automated code compliance checking systems need to enable an automated regulatory rule conversion. Accurate Part-of-Speech (POS) tagging of building code texts is crucial to this conversion. Previous experiments showed that the state-of-the-art generic POS taggers do not perform well on building codes. In view of that, the authors are proposing a new POS tagger tailored to building codes. It utilizes deep learning neural network model and error-driven transformational rules. The neural network model contains a pre-trained model and one or more trainable neural layers. The pre-trained model was fine-tuned on Part-of-Speech Tagged Building Codes (PTBC), a POS tagged building codes dataset. The fine-tuning of pre-trained model allows the proposed POS tagger to reach high precision with a small amount of available training data. Error-driven transformational rules were used to boost performance further by fixing errors made by the neural network model in the tagged building code. Through experimental testing, the authors found a well-performing POS tagger for building codes that had one bi-directional LSTM trainable layer, utilized BERT_Cased_Base pre-trained model and was trained 50 epochs. This model reached a 91.89% precision without error-driven transformational rules and a 95.11% precision with error-driven transformational rules, which outperformed the 89.82% precision achieved by the state-of-the-art POS taggers.

Introduction

Efforts to automate code compliance checking started more than half a century ago when Fenves (1966) developed decision tables to automatically check the design of steel structures [1]. The success of compliance checking decision table inspired more researches in this area. Examples include a computer-aided design (CAD) system for 2D and 3D steel structure called STEEL-3D [2], an expert system for reinforcement concrete design [3], a rule-based application for structure members [4], and a knowledge-based system for multiple building codes [5]. More advanced code compliance checking software was then developed. The Construction and Real Estate Network (CORENET) by Singapore Building Construction Authority was capable of checking 3D industry foundation classes (IFC) data model [6]. The Express Data Manager (EDM) Suite by Jotne EPM Technology allowed code checking on Building Information Modeling (BIM) data [7]. The BCAider by the Commonwealth Scientific and Industrial Research Organisation (CSIRO) in Australia enabled automatic compliance checking against Building Code of Australia (BCA) [8]. The Solibri Model Checker (SMC), a BIM-powered automated code compliance checking system by Solibri, achieved rule-based code compliance checking by user-customized plugins [9]. Patlakas et al. developed a BIM-based system to check code compliance of timber structure design automatically [10]. Fang et al. developed a deep learning-based method to automatically check if a site worker complies to code of their certification [11]. The combination of BIM and automated code compliance checking systems increases the theoretical benefit of BIM in the construction industry. However, according to a survey by Smits et al. (2017), the actual benefit of implementing BIM in construction projects is still limited [12]. The authors suggest that the narrow range of checkable codes of most recent automated code compliance checking tools may limit the actual benefit of BIM. Even for the narrow range of checkable codes, they are usually oversimplified. The oversimplified codes are not enough to support the increased project complexity and creativity of designers and, therefore, could negatively affect the benefit of adopting BIM for users and owners [13].

The narrow range of checkable codes also limit wide applications of these automated code compliance checking systems. Extending the range of checkable building code requirements emerges as an urgent need in the development of automated code compliance checking systems. Natural Language Processing (NLP) powered by Part-of-Speech (POS) tagging has been proposed to automate the building code requirements extraction [30] and, therefore, extended the range of checkable building codes of automated code compliance checking systems and reduced the needed manual efforts in such extraction [14], [15], [16]. NLP and deep learning have many applications in the Architecture, Engineering, and Construction industry (AEC) [21]. For example, Fang et al. developed a text classification method with deep learning to spot near misses in safety reports [17]. Zhong et al. used a deep learning method to classify building quality problems [18]. Trappey et al. used attention mechanism to generate summaries of engineering patents [19]. High performance was achieved but POS tagging error was identified as one major source of error of the whole system. Accurately POS-tagged building codes are desired to support such NLP-based automated building code compliance checking. Existing generic POS taggers, however, cannot provide such high accuracy on processing building codes [20].

The authors are therefore proposing a new POS tagger that is tailored to building codes. The intent of the study is to improve the accuracy of POS tagging on building codes. Accurate POS tagging results are needed to support successful code requirements processing for accurate automated code compliance checking. The proposed POS tagger combines neural network model and error-driven transformational rules. Neural network model and error-driven transformational rules together make the proposed POS tagger outperformed the state of the art. The proposed POS tagger reached a 95.11% accuracy, which is higher than the 89.82% accuracy achieved by the state of the art.

In practice, this POS tagger plays an important role in those NLP-based automated code compliance checking system frameworks similar to [14] (Fig. 1), and in NLP-based automation systems in the AEC domain in general [22]. This research can boost the accuracy of the POS tagging therefore support automated building code compliance checking systems and NLP-based systems in the AEC domain. Accurate POS tagging results of building codes is vital to a high performance of the extraction of engineering knowledge embedded in the building codes. The background automated code compliance checking system framework in Fig. 1 contains an automated regulatory information extraction component (which uses a POS tagger) that converts building code requirements to logic clauses, an automated building design information extraction component that extracts building design information from Building Information Models (BIMs), and an automated reasoning component that outputs the code compliance report. The automated regulatory information extraction component can use the proposed POS tagger, which is illustrated in Fig. 3. This system is fully automated from the end-user’s perspective. The automated building code compliance checking system takes a rule-based approach to extract information from building codes automatically. Although the POS tagger uses neural network model which is probabilistic in training, the developed POS tagger as a result of the training is deterministic. The weights of the neural network are fixed after the training, leading to determinist results when applying the POS tagger. Therefore, with a robust POS tagger and other well-performing components, the NLP-based automated building code compliance checking system has a better chance to detect all noncompliance cases in a building design without intervention from the user. Due to the imperfect (i.e., less than 100%) precision and recall in the state-of-the-art NLP-based building code compliance checking systems, some manual intervention will still be needed to fix errors in the extraction results of the embedded engineering knowledge in the building codes. Such manual intervention is expected from the developers, not from end users. In addition, the amount of manual efforts needed to fix automatic extraction errors is minor comparing to those needed in a manual extraction. In this paper, the authors propose to boost the performance of NLP-based automated code compliance checking systems by providing more accurate POS tagging results to such systems.

The remainder of this paper is organized as follows. Section 2 explains the technical details of part-of-speech tagging, error-driven transformational rules, recurrent neural network, and computing techniques to avoid overfitting, used in this research. Section 3 describes the proposed POS tagger. Section 4 presents the experiment to test the performance of the proposed POS tagger. Section 5 illustrates and discusses the results of the experiment. Finally, 6 Contributions to the body of knowledge, 7 Limitations and future work, 8 Conclusion present the limitations and future work, contributions to the body of knowledge, and conclusion of this research, respectively.

Section snippets

Part-of-Speech

A word’s POS category provides its syntactic information in a sentence [24]. In English, there are eight main POS categories: (1) noun, (2) verb, (3) adjective, (4) adverb, (5) pronoun, (6) preposition, (7) conjunction, and (8) interjection. POS taggers are systems that automatically assign POS categories to words according to their contextual information in a sentence [25]. POS taggers have a variety of applications in the AEC domain. For example, Le et al. POS tagged construction contracts to

Methodology

To develop a POS tagger tailored to building codes, the authors combined multiple state-of-the-art techniques such as error-driven transformational rules, recurrent neural networks, dropout layers, and pretrained models. At the core, the proposed POS tagger has two main components, a neural network model and a set of error-driven transformational rules. The neural network model initially predicts the POS tag of a word. The error-driven transformational rules fix errors made by the neural

Textual data

The proposed POS tagger was trained on the POS tagged building codes (PTBC) dataset [71], a dataset that consists of 1522 POS tagged sentences in chapters 5 and 10 of the 2015 International Building Code (IBC). In total, the PTBC dataset has 39875 tokens. A token is the smallest unit in POS tagging, such as a word or a punctuation. For example, the word “means” and the period are two tokens in the sentence “The means of egress shall have a ceiling height of not less than 7 feet 6 inches.” which

Results and discussion

To find a well-performing combination of epochs of training, pre-trained models, and trainable layers to use in the POS tagger, the authors trained 14 models (Table 2). The best-performing POS tagger had a combination of one bi-directional LSTM trainable layer, BERT_Cased_Base pre-trained model, and was trained for 50 epochs. This model (Model 9 in Table 2) reached the highest accuracy after applying transformational rules. The optimization of the deep learning component of this POS tagger is

Contributions to the body of knowledge

This research has contributions in both theory and practice. Theoretically, it has two main contributions to the body of knowledge. First, it provides a hybrid deep-learning and rule-based method to enhance performance of POS taggers on domain-specific texts. The combination of deep learning neural network models and error-fixing transformational rules makes the proposed POS tagger outperform the state-of-the-art POS taggers with limited amount of training data. Many current state-of-the-art

Limitations and future work

One main limitation of this work is acknowledged: the POS tagger still is not error-free. In spite of its improvement over the state of the art, this POS tagger may still not be accurate enough to support an error-free extraction of embedded engineering knowledge in building codes. Errors in POS tagging may have negative effect on the performance of NLP-based automated building code compliance checking systems that leverage it. The authors suggest that research to further increase the accuracy

Conclusion

The ability to provide accurate POS tagging results of building codes paves the way to automated regulatory information extraction and widens the possible range of applicable code requirements of automated code compliance checking systems. The authors proposed a new POS tagger to support such systems. This is the first POS tagger that is tailored to building codes. The POS tagger gained information on general English by incorporating pre-trained deep learning models and captured AEC domain

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgement

The authors would like to thank the National Science Foundation (NSF). This material is based on work supported by the NSF under Grant No. 1827733. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the NSF.

References (78)

  • J. Wang et al.

    Deep learning for sensor-based activity recognition: A survey

    Pattern Recogn. Lett.

    (2019)
  • J.L. Elman

    Finding structure in time

    Cogn. Sci.

    (1990)
  • S.J. Fenves

    Tabular decision logic for structural design

    J. Struct. Div.

    (1966)
  • C.I. Pesquera et al.

    Advanced graphical CAD system for 3D steel frames

    Comput. Aid. Design Civil Eng. ASCE

    (1984)
  • P. Fazio et al.

    Knowledge-based system development tools for processing design specifications

    Comput.-Aided Civ. Infrastruct. Eng.

    (1988)
  • L. Khemlani, CORENET e-PlanCheck: Singapore's automated code checking system, AECbytes, October,...
  • Q. Yang

    IFC-compliant design information modelling and sharing

    J. Inf. Technol. Constr. (ITcon)

    (2003)
  • L. Ding, R. Drogemuller, M. Rosenman, D. Marchant, J. Gero, Automating code checking for building designs-DesignCheck,...
  • W. Smits et al.

    Yield-to-BIM: impacts of BIM maturity on project performance

    Build. Res. Inf.

    (2017)
  • J.K. Whyte et al.

    How Digitizing Building Information Transforms the Built Environment

    (2017)
  • J. Zhang et al.

    Automated information transformation for automated regulatory compliance checking in construction

    J. Comput. Civil Eng.

    (2015)
  • S. Li et al.

    Integrating natural language processing and spatial reasoning for utility compliance checking

    J. Constr. Eng. Manage.

    (2016)
  • X. Xue et al.

    Evaluation of Eight Part-of-Speech Taggers in Tagging Building Codes: Identifying the Best Performing Tagger and Common Sources of Errors

    (2020)
  • X. Xu et al.

    Semantic frame-based information extraction from utility regulatory documents to support compliance checking

  • H. Cunningham

    GATE, a general architecture for text engineering

    Comput. Humanit.

    (2002)
  • L. Abzianidze, J. Bos, Towards universal semantic tagging, arXiv preprint arXiv:1709.10381,...
  • H. Schmid, Part-of-speech tagging with neural networks, Proceedings of the 15th conference on Computational...
  • J. Lee et al.

    Effective risk positioning through automated identification of missing contract conditions from the contractor’s perspective based on FIDIC contract cases

    J. Manage. Eng.

    (2020)
  • F.U. Hassan et al.

    Automated requirements identification from construction contract documents using natural language processing

    J. Legal Affairs Dispute Resolut. Eng. Constr.

    (2020)
  • P. Zhou et al.

    Automated matching of design information in BIM to regulatory information in energy codes

    Constr. Res. Congr.

    (2018)
  • J. Zhang et al.

    Semantic NLP-based information extraction from construction regulatory documents for automated compliance checking

    J. Comput. Civil Eng.

    (2013)
  • T. Young et al.

    Recent trends in deep learning based natural language processing

    IEEE Comput. Intell. Mag.

    (2018)
  • R. Collobert et al.

    Natural language processing (almost) from scratch

    J. Mach. Learn. Res.

    (2011)
  • N.C. Marques et al.

    Tagging with Small Training Corpora

    (2001)
  • X. Yu, A. Faleńska, N.T. Vu, A general-purpose tagger with convolutional neural networks, arXiv preprint...
  • F. Chollet, Deep Learning with Python,...
  • J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language...
  • J. Lee et al.

    BioBERT: a pre-trained biomedical language representation model for biomedical text mining

    Bioinformatics

    (2020)
  • B. He, D. Zhou, J. Xiao, Q. Liu, N.J. Yuan, T. Xu, Integrating graph contextualized knowledge into pre-trained language...
  • Cited by (25)

    • Extracting interrelated information from road-related social media data

      2022, Advanced Engineering Informatics
      Citation Excerpt :

      Aside from SMD classification, entity recognition has also been widely conducted in SMDSAs of road conditions. It initially aims to recognize the entities of persons, organizations, places, administrative districts, and facilities [22] and currently has been extended to any entity of interest [34,54,64]. The entity recognition methods in SMDSAs (Table 1) fall into rule-based and learning-based ones, and either of them has advantages and disadvantages.

    View all citing articles on Scopus
    View full text