1 Introduction

With the rapid development of the Internet and big data technology, data in various fields grows rapidly. For example, the big data generated in forestry [1], smart grids [2], Internet of Things [3], social networks [4] and other fields has brought new opportunities and challenges to research in various fields. At the same time, the data in the field of coal mine safety is also rapidly accumulating, resulting in big data [5, 6] in the field of coal mines. The coal mine safety data contains important information such as the time of the accident, the location of the coal mine where the accident occurred, and the main personnel and equipment caused by the accident. If this unstructured information is effectively mined and integrated, the occurrence of coal mine safety accidents can be well predicted and monitored. In this paper, 8,569 pieces of coal mine safety accident news are retrieved from coal mine safety websites by crawler, including a series of accidents such as roof falling accident, gas accident, mechanical and electrical accident and the specific process of accident occurrence.

In the past, information extraction in the coal mining field was mainly based on manual extraction. However, this method can no longer meet the increasing scale of coal mine safety data, and it is time-consuming and labor-intensive. With the development of hardware [7, 8], software [9, 10], and new algorithms [11,12,13], large amounts of data can be collected and processed quickly. Artificial intelligence technology is widely used in various fields with the background of big data [14], and the application of deep learning in the field of coal mining is gradually emerging [15, 16]. How to effectively extract a large number of entity relation triples and entity attribute triples from massive texts has become a hot issue in the field of coal mine safety. In addition, entities and relations are the basis of knowledge graph construction [17], and they are also of great significance for the automatic construction of coal mine safety knowledge graph [18,19,20].

At present, there have been many studies on entity recognition and relation extraction, but they are mainly focused on general datasets, and there are still few studies on entity relation extraction in specific fields. The difficulty of entity relation extraction in the coal mine safety field lies in the extraction of overlapping relations, entity nesting [21], the excessive length of entity, and the polysemy of the entity relation extract in the coal mine field. Entity recognition and relation extraction [22] are related to information extraction [23] and event extraction [24], and they are the basic techniques for building knowledge graph. Entity relation extraction is mainly divided into two categories, pipeline approaches and joint extraction approaches [25]. Pipeline entity relation extraction separates entity recognition and relation extraction into two separate subtasks, named entity recognition in unstructured text first, followed by relation classification. However, the pipeline entity relation extraction ignores the connection between the two subtasks of extracting entities and relations. The unrelated entities identified in entity extraction will affect the relation extraction of related entities, and errors caused in the entity recognition subtask are also easily propagated to the relation extraction subtask. But in the joint extraction of entity relations, the problems existing in the pipeline entity relation extraction can be solved.

The joint extraction method of entity relation extraction can enhance the connection between entities and relations, which is more effective than the pipeline method of entity relation extraction. The joint extraction method of entity relation extraction is further divided into entity relation joint extraction based on parameter sharing and entity relation joint extraction based on sequence annotation. Miwa and Bansal [26] first proposed to use an end-to-end model to extract entities and relations in 2016. Through the Bidirectional Long Short-Term Memory (BiLSTM) neural network model, the parameters of the entity recognition task and the relation extraction task are shared. Accordingly, a dependency tree is formed in the relation classification subtask, and the relation is extracted according to the shortest path between entities in the dependency tree. The parameter sharing between these two subtasks solves the problem of lack of connection between entity extraction and relation classification. However, after entity extraction, there are still entities that have no relation after relation classification, so it is not a strictly entity relation joint extraction.

In 2017, Zheng et al. [27] proposed to use sequence annotation to achieve joint extraction of entity relations. The annotation includes three types of entity information: its location information, type, and role. This solves the problem of entity redundancy and realizes the joint extraction of entity relations from end to end, but the problem of extraction of overlapping entity relations [28] still exists. After 2018, with the introduction of pre-training models such as Bidirectional Encoder Representation from Transformers (Bert) [29], pre-training models were introduced into various tasks of Natural Language Processing (NLP). As a character-based model, Bert avoids the problem of Chinese word segmentation errors [30,31,32], and entity relation joint extraction based on pre-training models has been widely used [33,34,35].

For example, Wu et al. [30] adopted the pre-trained model Bert-Whole Word Masking (Bert-wwm) [36] in cardiovascular disease, which outperformed other comparative models in entity relation extraction in this field. Ge et al.[37] also proposed to use characters and words fusion to solve the problem of wrong segmentation of Chinese word segmentation boundaries in specific fields, and also integrated some vocabulary information, and the experimental effect of entity relation extraction on Baidu DuIE dataset reached the best. In the field of coal mining, Zhang et al. [38] proposed an end-to-end entity relation extraction model, but it did not solve the problem of entity relation overlap. However, the phenomenon of overlapping entity relations in coal mine safety accidents is obvious, and it is difficult to extract. And for specific entity annotations, the entity boundaries are different, which increases the difficulty of extracting entity relation joint extraction tasks.

Therefore, in view of the common phenomenon of overlapping entity relations in coal mine safety accidents, this paper proposes a joint extraction of coal mine safety entity relations based on Multi-heads Self-attention (MHA) mechanism and characters and words fusion, abbreviated as CWT-Joint, which realizes the joint extraction of entity relations in the field of coal mine safety accidents.

2 Joint Entity Relation Extraction Model for Coal Mine Safety Accident

The coal mine safety entity relation joint extraction model is mainly composed of three parts: coal mine safety accident text preprocessing, CWT-Joint model, and triple extraction.

2.1 Coal Mine Safety Dataset Preprocessing

Coal mine safety news' data preprocessing includes the following steps:

  1. 1.

    Data crawling: Coal mine safety accident news are crawled from coal mine safety accident-related websites and WeChat official accounts. WeChat official accounts need to log in to WeChat to crawl news; unstructured news on the website can be crawled directly.

  2. 2.

    Data cleaning: When crawling coal mine accidents, there will be some special symbols such as “\(\backslash {n}\)”, “\(\backslash {r}\)”, some content unrelated to the accident, as well as repeated sentences in news reports. After processing, the coal mine safety dataset is finally obtained.

  3. 3.

    Manual labeling: The entity relation labeling software fastlabel is used to label the original corpus with sentences as the form in Fig. 1. Json is the exported file type and then python is used to process it into the sequence labeling form shown in Fig. 2.

Figure 1
figure 1

Manual labeling example.

Figure 2
figure 2

Sequence annotation example.

A single entity relation joint extraction sequence annotation consists of three parts: entity boundary, entity relation, and entity role. The entity boundary represents the location information of the entity, “B” represents the start position of the entity, “I” represents the middle or end part of the entity, and “O” represents not an entity. Four entity relations, two entity attributes and subjects, objects are shown in Tables 1 and 2. The definition of entity relation and entity attribute consists of the first letter of the Chinese relation or attribute name. For example, the relation “Occurrence point” is defined as “op”. In addition, the subject and object roles of relations and attributes are also specified. Entity roles represent the roles of entities in attribute triples and relation triples, “1” for subject and “2” for object or attribute.

Table 1 Entity relations.
Table 2 Entity attributes.

In overlapping entity relations, overlapping entity relations are labeled by entity boundaries and “cw”, i.e. “B-cw” and “I-cw”. There is no entity role in the overlapping entity relation. In the entity triplet or attribute triplet, if the entity role that matches the overlapping relation is “1”, then the entity role in the overlapping relation corresponds to “2”.

There are 3 types of entity labels, with a total of 27 tags, as shown in Table 3.

Table 3 Types of coal mine safety entity labels.

2.2 CWT-Joint Model

The overall structure of the CWT-Joint model is shown in Fig. 3, which consists of Word2vec [39], RoBERTa-wwm-ext, BiLSTM, MHA and CRF layers. First, the input sentence is divided into characters and words, and the word vectors of all words are obtained by Word2vec training. RoBERTa-wwm-ext generates character vectors in the corresponding sentence. Then, character vectors and word vectors are respectively passed through MHA. MHA model is used to analyze the structures and relations between the character vectors and the word vectors, and the weights of character vectors and word vectors are adjusted. After that, the character vectors and word vectors are fused to form the embedding layer, and then input into the BiLSTM layer to extract features. Finally, according to the label dependencies of the CRF model, the global optimal sequence is obtained.

Figure 3
figure 3

CWT-Joint model.

2.2.1 Characters and Words Fusion Model

Computers cannot process words directly. Not until text is converted into numbers can computers begin the task of processing data. The general one-hot method is prone to cause high dimension and sparse matrix. There are also lacks of connection between the word vectors generated by the one-hot method. As an unsupervised model, Word2vec can train the original corpus to obtain low-dimensional and dense word vectors without manual annotation.

The word vector contains a wealth of lexical information, but due to the polysemy characteristics of Chinese words, when using Word2vec to train the word vector to segment the corpus, it is easy to cause word segmentation errors. For example, in the sentence “云南红河州新平县新华乡布者煤矿发生一起顶板事故 (A roof accident occurred in Buzhe coal mine, Xinhua Township, Xinping County, Honghe Prefecture, Yunnan Province)”. For the general word segmentation method, it is easy to divided the word into “新华乡布者煤矿 (Xinhua Township Buzhe coal mine)” as “新华 (Xinhua)”, “乡布者 (Xiangbuzhe)” and “煤矿 (coal mine)”, and the correct division result is “新华乡 (Xinhua Township)” and “布者煤矿 (Buzhe coal mine)”. As a result, the trained coal mine safety accident word vector is wrong, which leads to the wrong segmentation of entities and attribute boundaries in the entity, attribute triplet, and affects the final entity relation extraction result. For character vectors, such as the character vectors generated by RoBERTa-wwm-ext, there is no such problem of word segmentation. However, the amount of text information contained in characters is much different than that of words, and the features shared by characters and words cannot be considered. Therefore, this paper proposes entity relation joint extraction of characters and words fusion, and integrates the vectors generated by RoBERTa-wwm-ext and Word2vec to enrich semantic expressions. The fusion process of character and word vectors is shown in Fig. 4.

Figure 4
figure 4

Characters and words fusion.

For character-level and word-level vectors, this paper first uses MHA for processing, then different weights are assigned to characters and words with different degrees of importance. Finally, character vectors and word vectors fusion are performed.

As shown in Fig. 4, the phrase “禾甸镇万泉煤矿 (Hedian Town Wanquan coal mine)” is divided into three words and six characters: “禾甸镇 (Hedian Town)”, “万泉 (Wanquan)” and “煤矿 (coal mine)”. Let the trained word vector be \(w_{1i}\), i to represent the i-th word, the character-vector to be \(b_{1j}\), and j to represent the j-th word. After the weights are redistributed by the MHA, the fusion vector \(e_{1}\) of the phrase is as follows.

$$\begin{aligned} e_{1}=M H A\left( w_{11} \oplus \ldots \oplus w_{13}\right) \oplus M H A\left( b_{11} \oplus \ldots \oplus b_{16}\right) \end{aligned}$$
(1)

For each input sentence, let the character vector set obtained from RoBERTa-wwm-ext be spliced as \(b_{k}\), the word vector set obtained from Word2vec is spliced into \(w_{k}\). According to the above word segmentation results, and the final sentence splicing vector \(e_{k}\) is as follows, where k represents the k-th sentence.

$$\begin{aligned} e_{k}=M H A\left( b_{k}\right) \oplus M H A\left( w_{k}\right) \end{aligned}$$
(2)

2.2.2 RoBERTa-wwm-ext Pretraining Model

The RoBERTa-wwm-ext model comes from the encoder of Transformer. RoBERTa-wwm-ext is a Bert optimized for Chinese, and it has been optimized in the following three aspects:

  1. 1.

    Masked Language Model (MLM) task in the pretraining task: Bert’s two tasks in pretraining are MLM and Next Sentence Prediction (NSP), both of which have certain defects. MLM absorbs the methods in Continuous Bag-Of-Words and Skip Gram in Word2vec, and uses masks to replace tokens in sentences. It selects 15\(\%\) of the tokens, replaces these tokens with [Mask], and predicts these tokens. However, RoBERTa-wwm adopts dynamic masking. Instead of replacing the tokens in the sentence with [Mask] during data processing, it dynamically masks the sentence during training. The positions that need to be masked in each round of training sentences may be different, which greatly improve the generalization ability of the model. In addition, an optimization especially for Chinese word segmentation is to use wwm for masking. The original Bert masks words while wwm masks continuous words. In the case of Chinese, Bert learns less vocabulary information, and the whole word mask optimizes this shortcoming.

  2. 2.

    NSP task in the pretraining task: The NSP task in the original Bert selects two consecutive paragraphs in the same article as positive samples, and a combination of paragraphs from different articles is used as negative samples to train the model. RoBERTa-wwm-ext model cancels the NSP task in the pretraining task, and replaces the previous two paragraphs with a paragraph close to 512 tokens.

  3. 3.

    RoBERTa-wwm-ext increases the amount of pretraining data and takes longer to train. And it increases the training step size. In addition, it adjusts the parameters of Adam optimizer, making the RoBERTa-wwm-ext model perform better.

The main structure of the RoBERTa-wwm-ext model is shown on the left subgraph of Fig. 5, E1 and E2 are the input character vectors, the right subgraph of Fig. 5 is the Transformer model, Trm is the encoder of Transformer, and RoBERTa-wwm-ext is formed by stacking Transformer encoders together. E1, E2 are output by the encoder as character vectors of T1, T2.

Figure 5
figure 5

RoBERTa-wwm-ext model pre-training structure.

2.2.3 MHA Model

Although the BiLSTM model obtains the current context information, it loses some more important information as the sentence length increases. However, MHA model can fully capture long-distance features and obtain global information. It can obtain various features from characters, words, and sentences to improve the effect of entity relation extraction. Furthermore, MHA assigns more weights to important content in the text, reduces the attention to non-important features, and can more easily capture long-distance important features. For example, in the coal mine safety accident dataset, “3月23日上午9时左右,广西省宜州市庆远镇黄麻屯村在非法采煤时发生一起触电事故,造成2人死亡 (At around 9:00 am on March 23, an electric shock accident occurred during illegal coal mining in Huangmatun Village, Qingyuan Town, Yizhou City, Guangxi Province, resulting in two deaths.)”, “触电事故 (Electric shock accident)” is more closely related to other terms, and thus is assigned a larger weight. The weight distribution of other words is smaller, and since the “触电事故 (Electric shock accident)” occurred in “黄麻屯村 (Huangmatun Village)”, the weight distribution of “黄麻屯村 (Huangmatun Village)” is also larger.

The character vectors and word vectors output from RoBERTa-wwm and Word2vec are respectively mapped by three matrices to obtain Q, K, V of self-attention (V). By doing attention on Q and K, the calculation formula of attention is shown as follows:

$$\begin{aligned} {\text {attention}}(Q, K, V)={\text {soft}} \max \left( \frac{Q K^{T}}{\sqrt{d_{k}}}\right) V \end{aligned}$$
(3)

By doing h self-attention operations in parallel, where \(d_k\) is the dimension of Q, K, and V. Each attention function operation will get \(head_i\), and finally the result of the multi-heads self-attention mechanism can be obtained by splicing h \(head_i\). Compared with the ordinary attention mechanism, MHA pays more attention to the syntactic features inside the sentence. Q and V in the ordinary attention mechanism are different, representing source and target. And ordinary attention mechanisms do not pay attention to the features inside the sentence. The MHA model aggregates the semantic information of multiple self-attention mechanisms, and each head can learn independently, so as to focus on different parts of the input sentence, such as location, characters, words and other features. Finally, the outputs of h self-attention pooling are stitched together. The specific calculation are shown as follows:

$$\begin{aligned} {\text {head}}_{i}=\left( Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V}\right) \end{aligned}$$
(4)
$$\begin{aligned} {\text {MHA}}(Q, K, V)={\text {concat}}\left( \text{ head } _{1} \ldots \text{ head } _{n}\right) W^{\circ } \end{aligned}$$
(5)

Among them, \(W_{i}^{Q}\), \(W_{i}^{K}\), \(W_{i}^{V}\)are the matrices for linear transformation of Q, K and V respectively, and \(W^{\circ }\) is also the parameter matrix that will be used.

2.2.4 LSTM Model

The LSTM (Long Short-Term Memory) neural network model is a kind of recurrent neural network. Compared with the general recurrent neural network, there are three more gate states, namely the input gate, the output gate, and the forget gate. The output gate controls the input and output of the unit. And the forget gate controls whether the previous unit state is saved to the current unit state. The calculation is as follows:

$$\begin{aligned} f_{\mathrm {t}}=\sigma \left( \mathrm {w}_{f h} * h_{t-1}+w_{f x} * x_{t}+b_{f}\right) \end{aligned}$$
(6)

The input gate determines whether the current input is stored in the unit state. The calculation is shown as follows:

$$\begin{aligned} i_{\mathrm {t}}=\sigma \left( \mathrm {w}_{i} *\left[ h_{t-1}, x_{t}\right] +b_{i}\right) \end{aligned}$$
(7)

The output gate and unit state determine the output of the LSTM, and the calculation formula is shown as follows:

$$\begin{aligned} o_{\mathrm {t}}=\sigma \left( \mathrm {w}_{o} *\left[ h_{t-1}, x_{t}\right] +b_{c}\right) \end{aligned}$$
(8)

The LSTM model automatically extracts features from the fusion vector output in the upper layer, and then the labels predicted in the context are used in the CRF layer to obtain the optimal sequence.

2.2.5 CRF Model

Although the LSTM and the MHA model can learn the labels of the context and output the labels with the highest probability, they cannot obtain the dependencies between the labels, which may cause the same labels to be connected together. Since CRF can consider the order between tags, the CRF layer is selected as the final output layer. The commonly used first-order chain structure CRF is shown in Fig. 6.

Figure 6
figure 6

First order chain structure CRF.

For the character sequence \(\left( x_{1}, x_{2}, x_{3} \ldots x_{n}\right)\), the predicted label sequence \(\left( y_{1}, y_{2}, y_{3} \ldots y_{n}\right)\) can be obtained by using the linear chain link CRF. The formula for calculating the predicted score is shown as follows:

$$\begin{aligned} s(x, y)=\sum _{i=0}^{n} A_{y_{i}, y_{i+1}}+\sum _{i=1}^{n} P_{i, y_{i}} \end{aligned}$$
(9)

\(P_{i, y_{i}}\) represents the i-th position, the output is the probability of \(y_{i}\), and \(A_{y_{i}, y_{i+1}}\)represents the transition probability from \(y_{i}\) to \(y_{i+1}\). The optimal predicted label sequence can be obtained by solving the Viterbi algorithm, and the formula is shown as follows:

$$\begin{aligned} y^{*}=\arg \max (s(x, y)) \end{aligned}$$
(10)

The Viterbi algorithm uses the dynamic programming algorithm to solve the maximum state path.

2.2.6 Extraction of Entity Relation Triples for Coal Mine Safety Datasets

Entity relation extraction of coal mine safety dataset can be divided into two types: common entity relation extraction and overlapping entity relation extraction. According to the labeling results, certain rules need to be set to extract entity relation triples and entity attribute triples. For ordinary triple extraction, firstly find all entities according to the entity boundary, then search forward and backward according to the relation type to find the closest entities that belong to the same relation. Finally, distinguish between subject and object or subject and attribute by entity role. Until all entities have been matched, the sentence's triple extraction is completed. For overlapping entity or attribute triple extraction, there is no entity role and entity relation due to overlapping entity relation, and it can be matched with any other entity role. Therefore, after extracting common entities and entities with overlapping relations, the common entities are actively matched with the overlapping entities, and the entity roles of the overlapping entities are the corresponding entity roles of the matching entities. When all entities have been matched, the triples in the sentence are extracted.

The extraction of ordinary triples is shown in Fig. 7. First, extract the entities “郊南煤矿 (Jiaonan coal mine)” and “透水事故 (Water permeability accident)”. Then, for the entity “Jiaonan Coal Mine”, find the nearest entity “透水事故 (Water permeability accident)” which is also under the “op” relation. And according to the entity role, it is judged that “透水事故 (Water permeability accident)” is the subject, and “郊南煤矿 (Jiaonan coal mine)” is the object, and the triplet [透水事故 (Water permeability accident), 发生点 (occurrence point), 郊南煤矿 (Jiaonan coal mine)] is extracted. If the sentence has no other unmatched entities, and the extraction of triples ends.

Figure 7
figure 7

Common entity relation extraction.

The extraction of overlapping triples is shown in Fig. 8. In the first place, extract the overlapping entities “新兴煤矿 (Xinxing coal mine)”, common entities “1月 (January)”, and “爆炸事故 (Explosion accident)” according to the entity boundary labels. Then, for the entity “1月 (January)”, the closest entity forward and backward is the overlapping entity “爆炸事故 (Explosion accident)”. Then the entity role of “爆炸事故 (Explosion accident)” can be drawn as the main body, and the triplet [爆炸事故 (Explosion accident), 发生时间 (occurrence time), 1月 (January)] can be extracted. Similarly, the triplet corresponding to the unpaired entity “新兴煤矿 (Xinxing coal mine)” can be found [爆炸事故 (Explosion accident), 发生点 (occurrence point), 新兴煤矿 (Xinxing coal mine)]. At this point, the sentence triple extraction is complete.

Figure 8
figure 8

Overlapping entity relation extraction.

3 Experiment Preparation and Result Analysis

3.1 Coal Mine Safety Accident Dataset

At present, there are few datasets about coal mine safety. This paper crawls 8569 coal mine safety accident reports from coal mine-related news and WeChat public accounts, including the time of the accident, the title of the accident, and the content of the accident.

The obtained coal mine safety accident dataset is abbreviated as coalmine_data. 20\(\%\) of the extracted dataset is used as the test set, and 80\(\%\) is used as the train set. The relevant data statistics after labeling are shown in Table 4:

Table 4 Coal mine dataset label statistics.

3.2 General Experimental Dataset

In order to further verify the generality of the model proposed in this paper, RoBERTa-wwm-ext model used in this paper is a pre-training model optimized for Chinese. Therefore, this paper selects the Baidu DuIE Chinese dataset (https://ai.baidu.com/broad/download?dataset=sked) for generality experiments. The DuIE dataset is a relatively large dataset in Chinese entity relation extraction, and its data comes from Baidu Encyclopedia and Baidu News Abstract. In total, the dataset contains 457,866 entity relation triples, 214,590 sentences, and 50 defined relations. In this paper, 172,983 sentences are selected as the train set and 21,626 sentences are used as the test set, and the train set and test set are processed into the form of sequence annotation such as the coal mine safety accident dataset. The label statistics of the DuIE dataset are shown in Table 5.

Table 5 DuIE dataset label statistics.

3.3 Experimental Environment and Parameter Setting

The convolution kernel size for Word2vec-CNN-CRF model is 1*128, 2*128 or 3*128. The number of convolution kernels is 64. The number of channels is 64. The word vector dimension trained by the Word2vec model is 128, and the number of epochs for training word vectors with word2vec is 500.

The Word2vec model parameters used in the Word2vec-BiGRU-CRF model are the same, the BiGRU hidden layer is set to 128 dimensions, and the number of layers is set to 2. The Word2vec-BiLSTM-CRF model and Word2vec-BiGRU-CRF model have the same parameter setting.

The CWT-Joint model and other comparative models are trained and tested under the framework of Python 3.8 and Pytorch 1.9.0. The experimental hardware is 2080ti and the video memory is 11G. CWT-Joint model experiment uses the RoBERTa-wwm-ext as pre-training model, 12-head MHA, 12 layers. The hidden layer dimension is 768, and the parameter requires_grad set to be True, so that the parameters of the model can be fine-tuned during training and learning. The forward and backward hidden layer states of the hidden layer of the LSTM network are set to 128 dimensions. The maximum sequence length is set to 64, and the optimization function selects Adam to reduce the loss each time. The learning rate of the model is set to 0.001, and the dropout is set to 0.5 to prevent overfitting. The batch_size of the train set and the test set is selected as 64, the maximum number of iterations of the model is 500, and the best model is saved each time. Some hyperparameters of the CWT-Joint model are shown in Table 6.

Table 6 CWT-Joint Model hyperparameters setting.

3.4 Evaluation Index

This paper selects the precision rate P, the recall rate R, and the harmonic average F1 of the two as the evaluation indices. The specific calculation of the three are as follows.

$$\begin{aligned} P=\frac{T_{p}}{T_{p}+F_{p}} \times 100 \% \end{aligned}$$
(11)
$$\begin{aligned} R=\frac{T_{p}}{T_{p}+F_{n}} \times 100 \% \end{aligned}$$
(12)
$$\begin{aligned} F 1=\frac{2 P R}{P+R} \times 100 \% \end{aligned}$$
(13)

In the formula, \(T_{p}\) represents the number of positive samples determined as positive, \(F_{p}\) represents the number of negative samples determined as positive. \(F_{n}\) represents the number of positive samples that are determined as negative.

3.5 Comparison and Analysis of Experiment Results

This experiment compares Word2vec-CNN-CRF, Word2vec-BiLSTM-CRF, Word2vec-BiGRU-CRF and the model CWT-Joint proposed in this paper. The accuracy, recall and F1 value of each model are shown in Table 7:

Table 7 Experimental results of different models of coal mine safety dataset.

According to Table 7, the effect of Word2vec-CNN-CRF as baseline is worse than that of other models, because CNN has better effect in the field of computer vision. When extracting text features as TextCNN [40], only local features of the text can be obtained, which is similar to n-gram extraction information. But the word order is ignored, and the extraction effect is worse than the BiLSTM model. Through the comparison of the F1 value of Word2vec-CNN-CRF, Word2vec-BiGRU-CRF and Word2vec-BiLSTM-CRF models, it can be seen that the ability of BiLSTM model to extract features is stronger than that of BiGRU and CNN for input text word vectors. This is because BiGRU is a simplified model of BiLSTM. On small datasets, BiGRU has few parameters and is easy to converge, so it performs well. However, when the amount of data is large, BiLSTM has more parameters and stronger learning ability. Word2vec-BiLSTM-CRF improves the F1 value of Word2vec-BiGRU-CRF model by 0.374\(\%\). Therefore, this paper selects the BiLSTM model as the feature extraction model of the characters and words fusion vectors. On the F1 value, the CWT-Joint model is 19.01\(\%\) higher than the baseline, and 4.12\(\%\) higher than the Word2vec-BiLSTM-CRF model. This shows that the first three comparison models can not solve the word segmentation errors and polysemy of Word2vec model. At the same time, it also shows that the fusion of characters and words information features on the coal mine safety data set is superior to the word-based model, and achieves the best effect.

In addition, the comparison results of the CWT-Joint model proposed in this paper and the entity relation extraction model proposed by other papers in the Baidu DuIE dataset are shown in Table 8. The entity relation joint extraction methods used for comparison in this paper are:

  1. (a)

    The Word2vec-CNN-CRF model uses the Word2vec model to generate word vectors, and uses CNN for feature extraction, and CRF outputs the optimal sequence.

  2. (b)

    Pointer annotation model [41] proposes a pointer annotation method, which can represent relation categories through pointers, and any entity can be pointed to multiple entities by multiple pointers. And it designs a label aware attention mechanism, which can solve the problem of entity overlap to a certain extent.

  3. (c)

    The Word2vec-BiGRU-CRF model, differs from a) in that the BiGRU model is used for feature extraction.

  4. (d)

    The Word2vec-BiLSTM-CRF model is different from a), c) in that BiLSTM is used for text feature extraction.

  5. (e)

    Characters and words fusion model [37] is a kind of mixture of characters and words method, and it uses dilated convolutional networks for more distant feature extraction.

The experimental results of Pointer annotation model and Characters and words fusion model on the DuIE dataset are taken from the original paper.

Table 8 Experimental results of different models of baidu DuIE dataset.

It can be seen from Table 8 that the CWT-Joint model proposed in this paper has improved F1 value on the DuIE dataset compared with other models, which demonstrates the effectiveness and generality of the CWT-Joint model. The pointer annotation model can solve the problem of entity overlap to a certain extent, and the effect is better than Word2vec-CNN-CRF. Word2vec-BiGRU-CRF and Word2vec-BiLSTM-CRF models have better overall performance than pointer annotation models because Word2vec generates high-quality word vectors in specific domains. In the background of big data, BiLSTM has stronger feature extraction ability than BiGRU model due to the larger amount of parameters, resulting in better effect of Word2vec-BiLSTM-CRF model than Word2vec-BiGRU-CRF model. The mixture of characters and words model uses characters and words fusion vectors to solve the problem of polysemy in chinese words. Compared with the Word2vec-BiLSTM-CRF model, the performance is greatly improved, its F1 value is close to CWT-Joint, but the effect of using dilated convolutional network is still inferior to MHA. Because each head of MHA can learn different features of the input sentence. The low recall rate of the pointer annotation model indicates that the recognition effect of overlapping entities is not ideal. The attention mechanism can assign more weights to important words in the sentence, which has advantages in long-distance feature extraction, and the “cw” relation can be effectively identify overlapping entities.

At the same time, the extraction effect of some relations and attributes on the optimal model CWT-Joint is analyzed, as shown in Table 9. From Table 9, it can be seen that the extraction effect of coal mine safety accident relation is better, in which the F1 value of “casualties” and “occurrence time” attributes reaches the highest, and the F1 value is similar. Because the time and casualties of specific coal mine safety accidents often appear in the same sentence. Since there are fewer sentences containing the “occurrence point” relation than other relations, the training is insufficient, so that the “occurrence point” relation extraction F1 value is low and the effect is poor. The F1 value of “cw” overlapping relation extraction also reached 93.19\(\%\), indicating that this model has a good effect on the extraction of overlapping entity relations.

Table 9 Partial relation extraction results of coalmine dataset.

In the case of overlapping relations in the coal mine safety field, the overlapping relation is marked as “cw” relation, which solves the problems of entity pair overlap and single entity overlap in entity relation extraction. The F1 value of overlapping entity extraction is 93.19\(\%\), which is a good performance.

The F1 value of the CWT-Joint model on the Baidu public dataset DuIE is 80.2\(\%\), which is better than the models on other papers. It shows the generality of the model proposed in this paper, and performs well in entity relation joint extraction of data in other fields.

The model proposed in this paper and the coal mine safety accident dataset are limited to the triple extraction within the sentence. For more complex relations and overlapping relations between sentences and sentences, and between paragraphs and paragraphs, how to effectively extract them requires further research.

3.6 Error Analysis

In order to analyze the errors in the results of entity relation joint extraction, we draw the heatmaps of label prediction errors based on the baseline model and the CWT-Joint model in Fig. 9, respectively, including entity boundaries “B”, “I”, “O” , four relationships, two attributes, the overlapping relation “cw”, and the entity roles “sub” and “obj”.

Figure 9
figure 9

Wrong predicted label distributions for baseline model and CWT-Joint Model.

The left subgraph is the predicted label situation of the simple baseline model, and the right subgraph is the prediction situation of the CWT-Joint model. From the comparison of the two models in the heat map, it can be seen that the CWT-Joint model greatly improves the performance of entity relation extraction. It is not difficult to find from the baseline model that the entity location brings the most errors, especially the cases where “I” is incorrectly predicted as “O”. There is also the overlapping entity label “CW” that is difficult to extract. In the baseline model, except for the “mp” relation, the “CW” relation is easily misidentified as other relations. And these have been effectively improved under the CWT-Joint model.

However, some errors still exist in the CWT-Joint model, such as the problem of entity location error, and the prediction error of entity location can also affect the extraction of the entire entity relation, which shows the importance of entity location information. But we believe that adding the location information of the entity in the model input and expand the scale of the corpus of coal mine safety accidents can effectively solve this problem, which is also the direction of improvement in this paper.

4 Conclusion

In this paper, we proposed a coal mine safety accident entity relation joint extraction model based on MHA and characters and words fusion in the background of coal mine big data. The character vectors generated by RoBERTa-wwm-ext solved the polysemy of the entity relation in the coal mine field, and the word vectors generated by Word2vec made up for the lack of semantic information of the character vectors. MHA assigned more weights to the keyword word and character vectors. The overall F1 value of the model proposed in this paper reached 94.54\(\%\), which is better than the comparison model. So it is effective for entity relation and entity attribute extraction of coal mine safety accident dataset.