Part-of-speech tagging of building codes empowered by deep learning and transformational rules

doi:10.1016/j.aei.2020.101235

Advanced Engineering Informatics

Volume 47, January 2021, 101235

https://doi.org/10.1016/j.aei.2020.101235 Get rights and content

Abstract

Automated building code compliance checking systems were under development for many years. However, the excessive amount of human inputs needed to convert building codes from natural language to computer understandable formats severely limited their range of applicable code requirements. To address that, automated code compliance checking systems need to enable an automated regulatory rule conversion. Accurate Part-of-Speech (POS) tagging of building code texts is crucial to this conversion. Previous experiments showed that the state-of-the-art generic POS taggers do not perform well on building codes. In view of that, the authors are proposing a new POS tagger tailored to building codes. It utilizes deep learning neural network model and error-driven transformational rules. The neural network model contains a pre-trained model and one or more trainable neural layers. The pre-trained model was fine-tuned on Part-of-Speech Tagged Building Codes (PTBC), a POS tagged building codes dataset. The fine-tuning of pre-trained model allows the proposed POS tagger to reach high precision with a small amount of available training data. Error-driven transformational rules were used to boost performance further by fixing errors made by the neural network model in the tagged building code. Through experimental testing, the authors found a well-performing POS tagger for building codes that had one bi-directional LSTM trainable layer, utilized BERT_Cased_Base pre-trained model and was trained 50 epochs. This model reached a 91.89% precision without error-driven transformational rules and a 95.11% precision with error-driven transformational rules, which outperformed the 89.82% precision achieved by the state-of-the-art POS taggers.

Introduction

Efforts to automate code compliance checking started more than half a century ago when Fenves (1966) developed decision tables to automatically check the design of steel structures [1]. The success of compliance checking decision table inspired more researches in this area. Examples include a computer-aided design (CAD) system for 2D and 3D steel structure called STEEL-3D [2], an expert system for reinforcement concrete design [3], a rule-based application for structure members [4], and a knowledge-based system for multiple building codes [5]. More advanced code compliance checking software was then developed. The Construction and Real Estate Network (CORENET) by Singapore Building Construction Authority was capable of checking 3D industry foundation classes (IFC) data model [6]. The Express Data Manager (EDM) Suite by Jotne EPM Technology allowed code checking on Building Information Modeling (BIM) data [7]. The BCAider by the Commonwealth Scientific and Industrial Research Organisation (CSIRO) in Australia enabled automatic compliance checking against Building Code of Australia (BCA) [8]. The Solibri Model Checker (SMC), a BIM-powered automated code compliance checking system by Solibri, achieved rule-based code compliance checking by user-customized plugins [9]. Patlakas et al. developed a BIM-based system to check code compliance of timber structure design automatically [10]. Fang et al. developed a deep learning-based method to automatically check if a site worker complies to code of their certification [11]. The combination of BIM and automated code compliance checking systems increases the theoretical benefit of BIM in the construction industry. However, according to a survey by Smits et al. (2017), the actual benefit of implementing BIM in construction projects is still limited [12]. The authors suggest that the narrow range of checkable codes of most recent automated code compliance checking tools may limit the actual benefit of BIM. Even for the narrow range of checkable codes, they are usually oversimplified. The oversimplified codes are not enough to support the increased project complexity and creativity of designers and, therefore, could negatively affect the benefit of adopting BIM for users and owners [13].

The narrow range of checkable codes also limit wide applications of these automated code compliance checking systems. Extending the range of checkable building code requirements emerges as an urgent need in the development of automated code compliance checking systems. Natural Language Processing (NLP) powered by Part-of-Speech (POS) tagging has been proposed to automate the building code requirements extraction [30] and, therefore, extended the range of checkable building codes of automated code compliance checking systems and reduced the needed manual efforts in such extraction [14], [15], [16]. NLP and deep learning have many applications in the Architecture, Engineering, and Construction industry (AEC) [21]. For example, Fang et al. developed a text classification method with deep learning to spot near misses in safety reports [17]. Zhong et al. used a deep learning method to classify building quality problems [18]. Trappey et al. used attention mechanism to generate summaries of engineering patents [19]. High performance was achieved but POS tagging error was identified as one major source of error of the whole system. Accurately POS-tagged building codes are desired to support such NLP-based automated building code compliance checking. Existing generic POS taggers, however, cannot provide such high accuracy on processing building codes [20].

The authors are therefore proposing a new POS tagger that is tailored to building codes. The intent of the study is to improve the accuracy of POS tagging on building codes. Accurate POS tagging results are needed to support successful code requirements processing for accurate automated code compliance checking. The proposed POS tagger combines neural network model and error-driven transformational rules. Neural network model and error-driven transformational rules together make the proposed POS tagger outperformed the state of the art. The proposed POS tagger reached a 95.11% accuracy, which is higher than the 89.82% accuracy achieved by the state of the art.

In practice, this POS tagger plays an important role in those NLP-based automated code compliance checking system frameworks similar to [14] (Fig. 1), and in NLP-based automation systems in the AEC domain in general [22]. This research can boost the accuracy of the POS tagging therefore support automated building code compliance checking systems and NLP-based systems in the AEC domain. Accurate POS tagging results of building codes is vital to a high performance of the extraction of engineering knowledge embedded in the building codes. The background automated code compliance checking system framework in Fig. 1 contains an automated regulatory information extraction component (which uses a POS tagger) that converts building code requirements to logic clauses, an automated building design information extraction component that extracts building design information from Building Information Models (BIMs), and an automated reasoning component that outputs the code compliance report. The automated regulatory information extraction component can use the proposed POS tagger, which is illustrated in Fig. 3. This system is fully automated from the end-user’s perspective. The automated building code compliance checking system takes a rule-based approach to extract information from building codes automatically. Although the POS tagger uses neural network model which is probabilistic in training, the developed POS tagger as a result of the training is deterministic. The weights of the neural network are fixed after the training, leading to determinist results when applying the POS tagger. Therefore, with a robust POS tagger and other well-performing components, the NLP-based automated building code compliance checking system has a better chance to detect all noncompliance cases in a building design without intervention from the user. Due to the imperfect (i.e., less than 100%) precision and recall in the state-of-the-art NLP-based building code compliance checking systems, some manual intervention will still be needed to fix errors in the extraction results of the embedded engineering knowledge in the building codes. Such manual intervention is expected from the developers, not from end users. In addition, the amount of manual efforts needed to fix automatic extraction errors is minor comparing to those needed in a manual extraction. In this paper, the authors propose to boost the performance of NLP-based automated code compliance checking systems by providing more accurate POS tagging results to such systems.

The remainder of this paper is organized as follows. Section 2 explains the technical details of part-of-speech tagging, error-driven transformational rules, recurrent neural network, and computing techniques to avoid overfitting, used in this research. Section 3 describes the proposed POS tagger. Section 4 presents the experiment to test the performance of the proposed POS tagger. Section 5 illustrates and discusses the results of the experiment. Finally, 6 Contributions to the body of knowledge, 7 Limitations and future work, 8 Conclusion present the limitations and future work, contributions to the body of knowledge, and conclusion of this research, respectively.

Section snippets

Part-of-Speech

A word’s POS category provides its syntactic information in a sentence [24]. In English, there are eight main POS categories: (1) noun, (2) verb, (3) adjective, (4) adverb, (5) pronoun, (6) preposition, (7) conjunction, and (8) interjection. POS taggers are systems that automatically assign POS categories to words according to their contextual information in a sentence [25]. POS taggers have a variety of applications in the AEC domain. For example, Le et al. POS tagged construction contracts to

Methodology

To develop a POS tagger tailored to building codes, the authors combined multiple state-of-the-art techniques such as error-driven transformational rules, recurrent neural networks, dropout layers, and pretrained models. At the core, the proposed POS tagger has two main components, a neural network model and a set of error-driven transformational rules. The neural network model initially predicts the POS tag of a word. The error-driven transformational rules fix errors made by the neural

Textual data

The proposed POS tagger was trained on the POS tagged building codes (PTBC) dataset [71], a dataset that consists of 1522 POS tagged sentences in chapters 5 and 10 of the 2015 International Building Code (IBC). In total, the PTBC dataset has 39875 tokens. A token is the smallest unit in POS tagging, such as a word or a punctuation. For example, the word “means” and the period are two tokens in the sentence “The means of egress shall have a ceiling height of not less than 7 feet 6 inches.” which

Results and discussion

To find a well-performing combination of epochs of training, pre-trained models, and trainable layers to use in the POS tagger, the authors trained 14 models (Table 2). The best-performing POS tagger had a combination of one bi-directional LSTM trainable layer, BERT_Cased_Base pre-trained model, and was trained for 50 epochs. This model (Model 9 in Table 2) reached the highest accuracy after applying transformational rules. The optimization of the deep learning component of this POS tagger is

Contributions to the body of knowledge

This research has contributions in both theory and practice. Theoretically, it has two main contributions to the body of knowledge. First, it provides a hybrid deep-learning and rule-based method to enhance performance of POS taggers on domain-specific texts. The combination of deep learning neural network models and error-fixing transformational rules makes the proposed POS tagger outperform the state-of-the-art POS taggers with limited amount of training data. Many current state-of-the-art

Limitations and future work

One main limitation of this work is acknowledged: the POS tagger still is not error-free. In spite of its improvement over the state of the art, this POS tagger may still not be accurate enough to support an error-free extraction of embedded engineering knowledge in building codes. Errors in POS tagging may have negative effect on the performance of NLP-based automated building code compliance checking systems that leverage it. The authors suggest that research to further increase the accuracy

Conclusion

The ability to provide accurate POS tagging results of building codes paves the way to automated regulatory information extraction and widens the possible range of applicable code requirements of automated code compliance checking systems. The authors proposed a new POS tagger to support such systems. This is the first POS tagger that is tailored to building codes. The POS tagger gained information on general English by incorporating pre-trained deep learning models and captured AEC domain

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgement

The authors would like to thank the National Science Foundation (NSF). This material is based on work supported by the NSF under Grant No. 1827733. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the NSF.

References (78)

V.E. Saouma et al.
Architecture of an expert-system-based code-compliance checker
Eng. Appl. Artif. Intell.
(1989)
P.M. Evans
Rule-based applications for checking standards compliance of structural members
Build. Environ.
(1990)
C. Eastman et al.
Automatic rule-based checking of building designs
Autom. Constr.
(2009)
P. Patlakas et al.
Automatic code compliance with multi-dimensional data fitting in a BIM context
Adv. Eng. Inf.
(2018)
Q. Fang et al.
A deep learning-based method for detecting non-certified work on construction sites
Adv. Eng. Inf.
(2018)
X. Xu et al.
Semantic approach to compliance checking of underground utilities
Autom. Constr.
(2020)
W. Fang et al.
Automated text classification of near-misses from safety reports: An improved deep learning approach
Adv. Eng. Inf.
(2020)
B. Zhong et al.
Convolutional neural network: Deep learning-based classification of building quality problems
Adv. Eng. Inf.
(2019)
A.J.C. Trappey et al.
Intelligent compilation of patent summaries using machine learning and natural language processing techniques
Adv. Eng. Inf.
(2020)
S. Singaravel et al.
Deep-learning neural-network architectures and methods: Using component-based models in building-design energy prediction
Adv. Eng. Inf.
(2018)

J. Wang et al.

Deep learning for sensor-based activity recognition: A survey

Pattern Recogn. Lett.

(2019)

J.L. Elman

Finding structure in time

Cogn. Sci.

(1990)

S.J. Fenves

Tabular decision logic for structural design

J. Struct. Div.

(1966)

C.I. Pesquera et al.

Advanced graphical CAD system for 3D steel frames

Comput. Aid. Design Civil Eng. ASCE

(1984)

P. Fazio et al.

Knowledge-based system development tools for processing design specifications

Comput.-Aided Civ. Infrastruct. Eng.

(1988)

L. Khemlani, CORENET e-PlanCheck: Singapore's automated code checking system, AECbytes, October,...

Q. Yang

IFC-compliant design information modelling and sharing

J. Inf. Technol. Constr. (ITcon)

(2003)

L. Ding, R. Drogemuller, M. Rosenman, D. Marchant, J. Gero, Automating code checking for building designs-DesignCheck,...

W. Smits et al.

Yield-to-BIM: impacts of BIM maturity on project performance

Build. Res. Inf.

(2017)

J.K. Whyte et al.

How Digitizing Building Information Transforms the Built Environment

(2017)

J. Zhang et al.

Automated information transformation for automated regulatory compliance checking in construction

J. Comput. Civil Eng.

(2015)

S. Li et al.

Integrating natural language processing and spatial reasoning for utility compliance checking

J. Constr. Eng. Manage.

(2016)

X. Xue et al.

Evaluation of Eight Part-of-Speech Taggers in Tagging Building Codes: Identifying the Best Performing Tagger and Common Sources of Errors

(2020)

X. Xu et al.

Semantic frame-based information extraction from utility regulatory documents to support compliance checking

H. Cunningham

GATE, a general architecture for text engineering

Comput. Humanit.

(2002)

L. Abzianidze, J. Bos, Towards universal semantic tagging, arXiv preprint arXiv:1709.10381,...

H. Schmid, Part-of-speech tagging with neural networks, Proceedings of the 15th conference on Computational...

J. Lee et al.

Effective risk positioning through automated identification of missing contract conditions from the contractor’s perspective based on FIDIC contract cases

J. Manage. Eng.

(2020)

F.U. Hassan et al.

Automated requirements identification from construction contract documents using natural language processing

J. Legal Affairs Dispute Resolut. Eng. Constr.

(2020)

P. Zhou et al.

Automated matching of design information in BIM to regulatory information in energy codes

Constr. Res. Congr.

(2018)

J. Zhang et al.

Semantic NLP-based information extraction from construction regulatory documents for automated compliance checking

J. Comput. Civil Eng.

(2013)

T. Young et al.

Recent trends in deep learning based natural language processing

IEEE Comput. Intell. Mag.

(2018)

R. Collobert et al.

Natural language processing (almost) from scratch

J. Mach. Learn. Res.

(2011)

N.C. Marques et al.

Tagging with Small Training Corpora

(2001)

X. Yu, A. Faleńska, N.T. Vu, A general-purpose tagger with convolutional neural networks, arXiv preprint...

F. Chollet, Deep Learning with Python,...

J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language...

J. Lee et al.

BioBERT: a pre-trained biomedical language representation model for biomedical text mining

Bioinformatics

(2020)

B. He, D. Zhou, J. Xiao, Q. Liu, N.J. Yuan, T. Xu, Integrating graph contextualized knowledge into pre-trained language...

Cited by (25)

Automatic Fine-Grained BIM element classification using Multi-Modal deep learning (MMDL)
2024, Advanced Engineering Informatics
In Building Information Modeling (BIM)-based domain-specific applications, elements should be classified into fine-grained sub-categories concerning their graphical and/or non-graphical characteristics to support downstream tasks. Traditional rule-based and machine learning methods are either time-consuming or cannot meet fine-grained classification requirements in domain-specific applications. To overcome this challenge, this paper presents a novel framework based on BIM and Multi-Modal Deep Learning (MMDL) for automatic fine-grained BIM element classification. It begins with transforming multi-modal (i.e., graphical and non-graphical) element features from BIM models. A feature selection algorithm is then designed to determine relevant BIM element features automatically. Subsequently, an MMDL model is developed and deployed to fuse the selected multi-modal BIM element features for end-to-end fine-grained classification. The framework is validated with a BIM element classification dataset. The results show that fine-grained elements can be classified with high accuracy (over 98%) in an end-to-end manner, which is unattainable by other BIM element classification methods.
Automatic quality compliance checking in concrete dam construction: Integrating rule syntax parsing and semantic distance
2024, Advanced Engineering Informatics
The compliance checking of concrete dam construction quality often relies on the manual reference of specifications, which is time-consuming, labor-intensive, and prone to errors. Automating compliance checking is an effective means of ensuring the quality of dams. However, using computers to match and compare quality record texts with specification provisions remains challenging. Due to the numerous specifications and diverse forms of constraints in dam construction, as well as unstructured and non-standard quality records, semantic differences exist between quality records and specifications that prevent timely quality control, making it necessary to develop a complete and efficient construction quality compliance checking framework. To address these issues, an automatic construction quality compliance checking method based on rule syntax parsing and phrase semantic distance was proposed in this work. First, key entity knowledge was determined and automatically extracted. Then, syntax parsing rules were defined, and the parsing algorithm was used to assemble the key entities into syntax trees. Then, specification and quality record syntax trees were aligned based on node semantic distance. Finally, a scoring process was designed to achieve automatic compliance checking of concrete dam construction quality. The experiments showed that the entity knowledge extraction F1 value of the proposed method was 12.89% higher than similar models, and the checking accuracy was 88.89%, with excellent applicability to both quantitative and qualitative specification constraints. The proposed method constructed a complete quality compliance checking framework for the construction field, automatically obtaining precise scores as checking results, saving considerable time compared to manual checking, and promoting compliance checking throughout the entire lifecycle of construction.
Text mining and natural language processing in construction
2024, Automation in Construction
Text mining (TM) and natural language processing (NLP) have stirred interest within the construction field, as they offer enhanced capabilities for managing and analyzing text-based information. This highlights the need for a systematic review to identify the status quo, gaps, and future directions from the perspective of construction management. A review was conducted by aligning the objectives of 205 publications with the specific domains, areas, tasks, and processes outlined in construction management practices. This review reveals multiple facets of the construction sector empowered by TM/NLP approaches and highlights essential voids demanding consideration for automation possibilities and minimizing manual tasks. Ultimately, following identified obstacles, the review results indicate potential research opportunities: (1) strengthening overlooked construction aspects, (2) coupling diverse data formats, and (3) leveraging pre-trained language models and reinforcement learning. The findings will provide vital insights, fostering further progress in TM/NLP research and its applications in academia and industry.
Autonomous complex knowledge mining and graph representation through natural language processing and transfer learning
2023, Automation in Construction
Regulatory documents play a significant role in securing engineering project quality, standard process management and long-term sustainable developments. With the digitisation of knowledge in the AEC industry, the demand for automated knowledge mining has emerged when confronted with substantial regulations. However, the current interpretation approaches for regulatory documents are still mostly labour-intensive and flawed in complex knowledge. Based on transfer learning (BERT) and natural language processing (e.g., NLP-Syntactic Parsing), this paper proposes a fully automated knowledge mining framework to convert complex knowledge in textual regulations to graph-based knowledge representations. The framework uses a BERT-based engine to extract clauses from regulation documents through fine-tuning with the self-developed domain dataset. A constituent extractor is developed to process the provisions with complex knowledge and extract constituents. A knowledge modelling engine integrates the extracted constituents into a graph-based regulation knowledge model, which can be queried, visualised, and directly applied to downstream applications. The outcome has demonstrated promising performance in complex knowledge mining and knowledge graph modelling based on ISO 19650 case study. This research can effectively convert textual regulation documents to their counterpart regulatory knowledge base, contributing to automated knowledge acquisition and multi-domain knowledge fusion toward regulation digitalization.
Semi-automatic representation of design code based on knowledge graph for automated compliance checking
2023, Computers in Industry
Automated compliance checking (ACC) intends to verify the compliance of designs in construction industry by design codes. The ability to interpret and represent semantic information of design codes determines the maximum application scope of ACC. However, design codes are clause texts written in natural languages and existing ACC studies usually use relatively low-complexity code clause samples. At present, the lack of an accurate representation model for design codes leads to difficulties in representing the implicit information, nested logic, and complex relations contained in high-complexity clauses in codes. To address this problem, this research establishes a new representation model based on knowledge graph (KG). Four schemas are proposed into the model including order, complex, event and integration schemas. Further, an accompanying methodology for semi-automatic construction of design code KG (DCKG) is proposed. It includes four parts: interpretation, reconstruction, organization, and implementation. Where the implementation part develops a code annotation platform. In the case study and experiment, a scenario of checking a building information model (BIM) of metro station by GB50157–2013 Code for Design of Metro is adopted to validate the newly proposed representation model and the automated compliance process. The results show that the proposed model and method are correct and feasible, and our model outperforms other models in the representation ability of design codes.
Extracting interrelated information from road-related social media data
2022, Advanced Engineering Informatics
Citation Excerpt :
Aside from SMD classification, entity recognition has also been widely conducted in SMDSAs of road conditions. It initially aims to recognize the entities of persons, organizations, places, administrative districts, and facilities [22] and currently has been extended to any entity of interest [34,54,64]. The entity recognition methods in SMDSAs (Table 1) fall into rule-based and learning-based ones, and either of them has advantages and disadvantages.
The social media data (SMD) have been viewed as a potential and promising information source of road conditions. However, most existing SMD-based sensing approaches (SMDSAs) either ignore interrelations among information items (e.g., name, direction, and status of the road) or rely on rigid grammar rules to establish entities’ interrelations. Additionally, current SMDSAs in the transportation domain are unable to link the extracted text-formatted information with domain-specific models (e.g., virtual road model, VRM). In order to fill such gaps, this work proposes an improved SMDSA of road conditions, which involves a three-stage (i.e., SMD classification, relation inference, and entity pair recognition) interrelated information extraction model, as well as a semantic converter to feed the SMD-provided text-formatted information into VRMs. The proposed SMDSA is demonstrated by the newly annotated datasets of tweets in Lexington, USA. The three-stage interrelated information extraction model outperforms conventional rule-based methods and deep-learning algorithms (e.g., Text CNN, Bi-LSTM, Piecewise CNN, and Capsule Net). The SMD-enabled VRM also preliminarily shows its capacity to optimize signal timings during incidents that change the road network topology. This work contributes to circumventing the reliance on human-made rules during SMDSAs’ development, bridging user-generated SMD with operable VRMs for potential real-world road management, and providing a standard tweet dataset annotated with interrelation triplets to help promote SMDSA studies.

View all citing articles on Scopus

View full text

Part-of-speech tagging of building codes empowered by deep learning and transformational rules

Abstract

Introduction

Section snippets

Part-of-Speech

Methodology

Textual data

Results and discussion

Contributions to the body of knowledge

Limitations and future work

Conclusion

Declaration of Competing Interest

Acknowledgement

Eng. Appl. Artif. Intell.

Build. Environ.

Autom. Constr.

Adv. Eng. Inf.

Adv. Eng. Inf.

Autom. Constr.

Adv. Eng. Inf.

Adv. Eng. Inf.

Adv. Eng. Inf.

Adv. Eng. Inf.

Pattern Recogn. Lett.

Cogn. Sci.

Tabular decision logic for structural design

J. Struct. Div.

Advanced graphical CAD system for 3D steel frames

Comput. Aid. Design Civil Eng. ASCE

Knowledge-based system development tools for processing design specifications

Comput.-Aided Civ. Infrastruct. Eng.

IFC-compliant design information modelling and sharing

J. Inf. Technol. Constr. (ITcon)

Yield-to-BIM: impacts of BIM maturity on project performance

Build. Res. Inf.

How Digitizing Building Information Transforms the Built Environment

Automated information transformation for automated regulatory compliance checking in construction

J. Comput. Civil Eng.

Integrating natural language processing and spatial reasoning for utility compliance checking

J. Constr. Eng. Manage.

Evaluation of Eight Part-of-Speech Taggers in Tagging Building Codes: Identifying the Best Performing Tagger and Common Sources of Errors

Semantic frame-based information extraction from utility regulatory documents to support compliance checking

GATE, a general architecture for text engineering

Comput. Humanit.

Effective risk positioning through automated identification of missing contract conditions from the contractor’s perspective based on FIDIC contract cases

J. Manage. Eng.

Automated requirements identification from construction contract documents using natural language processing

J. Legal Affairs Dispute Resolut. Eng. Constr.

Automated matching of design information in BIM to regulatory information in energy codes

Constr. Res. Congr.

Semantic NLP-based information extraction from construction regulatory documents for automated compliance checking

J. Comput. Civil Eng.

Recent trends in deep learning based natural language processing

IEEE Comput. Intell. Mag.

Natural language processing (almost) from scratch

J. Mach. Learn. Res.

Tagging with Small Training Corpora

BioBERT: a pre-trained biomedical language representation model for biomedical text mining

Bioinformatics