1 Introduction

Event causality identification(ECI) is an immensely challenging task in natural language understanding that involves predicting the causality relation for a pair of events in a document. This technique has a wide range of applications in machine reading comprehension [1, 2], question-and-answer reasoning [3, 4], and event prediction [5, 6]. As shown in Fig. 1, an ECI model needs to identify the causalities between the pairs of events mentioned in sentences S1 and S2: ① \({\textbf {earthquake}}\overset{cause}{\longrightarrow } {\textbf {collapsed}}\) in S1. It is an explicit causality that has explicit connectives with causal words like because, since, or therefore, which can be easily identified. ② \({\textbf {shot}}\overset{cause}{\longrightarrow } {\textbf {killed}}\) in S2. It is an implicit causality that has ambiguous connectives without causal words, which requires a deep understanding of the context’s semantics for identification.

Fig. 1
figure 1

Illustrations of original sentences with different causalities between event pairs

Fig. 2
figure 2

Fine-tuning and prompt tuning for ECI

The previous ECI approaches relied on feature-based techniques [7,8,9]. However, recent advancements have delved into deep learning techniques [10, 11] which have significantly improved performance. Nevertheless, current methods primarily adopt the “pre-train, fine-tuning” paradigm, as illustrated in Fig. 2. While PLMs such as BERT [12], Roberta [13] act as fill-in-the-blank, ECI requires a classification layer for fine-tuning following these pre-training stages. This results in a model unable to fully exploit the potential of PLMs due to an existing gap between downstream tasks and PLMs. Prompt tuning [14, 15] aims to address this gap by adjusting the downstream tasks, utilizing task-specific templates to align their pre-training approach with that of PLMs. Thus, prompt tuning allows us to leverage the prior knowledge gained from PLMs to enhance our ECI task’s effectiveness.

Although prompt tuning accommodates PLMs, it nonetheless faces two key challenges: (1) Difficult implicit causalities identification. PLMs primarily train on large unlabeled and unstructured data rich in generic high-frequency entities, common sense [16,17,18]. Task-specific knowledge involving long-tail entities, multivariate associative relationships, and complex causal logic like event knowledge poses significant difficulty for PLMs to comprehend. As such, even implementing a basic prompt tuning paradigm proves challenging due to the lack of event-specific knowledge in PLMs. (2) Insufficient event-knowledge interaction. Recent works have investigated external knowledge to enhance text understanding. Liu et al. [11] inserted external knowledge into the original text where the event was mentioned, potentially introducing noise and disrupting the semantics; Cao et al. [19] and Liu et al. [20] incorporated descriptive and relational knowledge by implementing a graph structure or knowledge representation learning, respectively, to facilitate causal reasoning. Nevertheless, external knowledge lacks sufficient interaction with the original text.

To tackle the challenges mentioned above, we propose a novel approach called Knowledge Interaction Graph guided Prompt Tuning for Event Causality Identification (KIGP). (1) Implant external event knowledge to elicit PLMs. The event knowledge descriptions enhance the contextual understanding of events conceptually more straightforward and profound. In addition, it allows prompt tuning for better activation of PLMs’ knowledge on event-event relationships, leading to more accurate identification of implicit causalities. (2)Construct interaction graphs to capture the interactionsbetween context, event mentions, and external knowledge. These graphs bridge the gap between external knowledge and causality by capturing potential semantic interaction.

Specifically, our approach comprises of three key steps: (1) Obtain the triples of event mentions from an external knowledge graph, i.e., ConceptNet, and linearize them into knowledge text. (2) Introduce an event pair-based template and answer mapping verbalizer that uses prompt tuning to induce learning ability of PLMs and enhance implicit causality identification. (3) To facilitate better interaction between text, event mentions, and knowledge, we propose an interaction graph guidance mechanism by constructing interaction graphs that effectively guide the model for causality identification. We use graph convolutional network(GCN) to promote the feature representation of various nodes from a global perspective. Experimental results on two widely used datasets indicate that our model outperforms previous methods.

Our contributions are summarized as follows:

  1. (1)

    We propose a novel Knowledge Interaction Graph guided Prompt Tuning for the Event Causality Identification method, which effectively utilizes external event knowledge and prompt tuning to fully activate the potential of PLMs. To the best of our knowledge, this is the first work that combines GCN and prompt tuning for ECI task.

  2. (2)

    We design a guidance mechanism and construct knowledge interaction graphs that accurately guide external knowledge to effectively enrich the event representations through the deep interaction of text, events, and knowledge. These interaction graphs help to better capture implicit causality better and significantly improve our model’s ability to solve the ECI task.

  3. (3)

    Experimental results demonstrate that our approach substantially outperforms the most recent state-of-the-art approach on two benchmark datasets, EventStoryLine and Causal-TimeBank, with an F1-score improvement of 6.3 and 2.9 percentage points, respectively.

2 Related work

Event causality identification

The initial strategies for ECI involved feature-based methods that typically employed various features such as lexical and syntactic patterns [21,22,23], causality cues [7, 24, 25], and statistical information [6, 26]. Thereafter, supervised learning-based methods emerged, relying on a large amount of labeled data [27,28,29,30]. However, the scale of annotated datasets currently is relatively limited where so far the largest dataset EventStoryLine [31] only contains 258 documents, 4316 sentences, and 1770 causal event pairs. To address this problem, weakly supervised [8] and external knowledge-introducing approaches [11, 32,33,34] enhance the datasets and improve ECI performance. Advanced PLMs have achieved good performance in recent research [10]. Liu et al. [11] utilized a BERT-based model for mention masking generalization, and Zuo et al. [33] employed a pairwise learning framework to identify causalities by generating new samples. However, these existing methods are resolved by fine-tuning, which makes it difficult to identify implicit causalities.

Fig. 3
figure 3

Overall framework of our approach for ECI(KIGP). ① The input of Document Encoder contains three parts: the event knowledge text obtained from the external knowledge graph ConceptNet, the original text and prompt for ECI, and the output is the token representations corresponding to the three parts. ② The interaction constructor obtains the event knowledge representations through the constructed event knowledge interaction graph by GCN encoding, and then fuses them with the event representations in the prompt. ③ The fused representations are sent to Roberta, and the causality is predicted according to the MASK-feature of RobertaLM head

Prompt tuning

Since the emergence of GPT-3 [18], a new fine-tuning methodology named prompt tuning has gained attention. Unlike the “pre-train, fine-tuning” paradigm, prompt tuning adapts downstream tasks to PLMs and retrieves knowledge already stored in PLMs. It is widely applied to a large variety of tasks such as text classification [35], relationship extraction [36], event extraction [37, 38], and entity classification [39]. Researchers have made efforts to determine how to design prompt templates, and different methods have been proposed for automatic search of discrete prompts, gradient-guided search, and continuous prompts [40] including P-tuning [41] and Prefix-tuning [42] were successively proposed. Shen et al. [43] used a derivative prompt joint learning method to enhance the model’s ability to identify explicit and implicit causality. Recently, some studies have attempted to integrate external knowledge into prompt design. Tsinghua University proposed PTR [36], which implanted logic rules into prompt tuning, and KPT++ [44], which extended the verbalizer through external knowledge graphs, achieved large performance gains in task scenarios such as relationship extraction and text classification. Generally, external knowledge can be planted into input augmentation, architectural augmentation, and output regularity. Liu et al. [20] put forward a knowledge enhanced prompt tuning framework that exploited background knowledge and relational information and adopted knowledge representation learning to capture implicit causalities further. However, previous works [45, 46] suggest that not all external knowledge will bring gains, and unselective implantation of external knowledge can sometimes introduce noise.

Graph convolutional network

GCN [47] for graph structured data (non-Euclidean data) is widely used for node classification, graph classification, or link prediction. For example, textGCN treated documents and words as nodes, and exploited GCN to learn better node representations for text classification. RichGCN [48] first utilized an interaction graph for causality prediction, but it was prone to error accumulation due to the composition process using various existing NLP tools. ERGO [49] constructed event-relational graphs, where each node represented a pair of causal events, thereby converting ECI into a graph for node classification problems.

To efficiently select appropriate task-related knowledge and optimize learning knowledge representation, our model also introduces external knowledge, utilizes prompt tuning, and uses GCN to process the interaction graph. However, our approach differs from other related works in three ways: (1) To avoid error accumulation and considering that syntactic structures can be obtained directly through PLMs, we do not utilize off-the-shelf NLP tools to build the graph but design it based on the guidance mechanism. (2) Instead of straightly implementing node classification or relationship prediction of GCN, we take advantage of GCN’s powerful feature extraction in graph data to obtain the hidden layer features of nodes in the knowledge interaction graph; (3) we expect to adopt the feature representation that contains more in-depth interaction knowledge to precisely guide prompt and effectively stimulate the potential of PLMs. Note that, we regard perceived event knowledge as a bridge between original texts and true causalities. Simultaneously, we focus on addressing the issue of constructing interaction graphs by blending representations of integrated events and different knowledge.

3 Methodology

The overall framework of our approach KIGP is illustrated as Fig. 3. It contains three components: Document Encoder, Interaction Constructor, and Predictor. First, Document Encoder obtains words representations of the original text, event knowledge, and prompt for ECI. Second, the Interaction Constructor obtains the event representations aggregating knowledge through the graph structure to enhance the event representations in the prompt. The event knowledge interaction graph acquires representations \( ke_{s} \), \( ke_{t} \) of event nodes by exploiting the knowledge encoder GCN, which aggregates the features of neighboring knowledge nodes in the graph. Fuse them with the event representations \( he_{s} \) and \( he_{t} \) in the prompt to obtain new representations \( hke_{s} \) and \( hke_{t} \) that contain event semantic and relational knowledge. Finally, feeding the fused representations into Roberta and combining with the prompt, the causality classification results are predicted by MASK-Feature in Predictor based on RobertaLM head’s vocabulary probability distribution.

3.1 Question definition

We convert the ECI task into a classification problem using the masked language model(MLM) head to make predictions. In contrast to most previous binary classification problems (Causality, NoCausality), we adopt ternary classification and further refine Causality into Cause and CausedBy. Given a sentence \( X = \{x_{1},x_{2},...,x_{l}\}\) and an event pair \( <e_{s},e_{t}> \) in S, where l is the number of tokens. \( \mathcal {Y} \) is the set of causal labels denoting whether there is a causality between event pairs. We set \( \mathcal {Y} = \{Cause,CausedBy,Null\}\), which respectively indicate that \( e_{s} \) causes \( e_{t} \), \( e_{s} \) is caused by \( e_{t} \) and there is no causality between \( <e_{s},e_{t}> \). The goal of KIGP is to predict the causal label \( y \in \mathcal {Y} \).

We design ECI Template as \( \mathcal {T}_{ECI}(X) \) and splice it after the sentence X to make \( X^{'} \) as the input of MLM. Then induce the model to generate label words associated with given labels. Specifically, splice [CLS] and [SEP] to the beginning and end of X respectively, and add [SEP] to the end of \( \mathcal {T}_{ECI}(X) \). \( X^{'} \) containing one [MASK] token in \( \mathcal {T}_{ECI}(X) \) that needs MLM to predict label words is:

$$\begin{aligned} X^{'} = [CLS]X[SEP]\mathcal {T}_{ECI}(X)[SEP] \end{aligned}$$
(1)
Fig. 4
figure 4

Event knowledge from ConceptNet related to event mentions

When \( X^{'} \) is fed into MLM, the model can obtain the probability distribution \( p([MASK] \vert X^{'}) \) of the candidate class, as:

$$\begin{aligned} p(y \vert X^{'})= p([MASK]=m \vert X^{'}) \end{aligned}$$
(2)

where m represents the \( m^{th} \) label token of class y.

3.2 Document encoder

We choose an prominent MLM named Roberta [13] as the document encoder to encode the input sequence and output prediction results. Encode each word in the input sequence \( X^{'} \) which includes Original Text, Event Knowledge, and ECI prompt template \( \mathcal {T}_{ECI}(X) \) as a sequence of representations. The encoded result sequence is \( H = [h_{CLS},H_{X},h_{SEP}, H_{prompt},h_{SEP}] \), where \( H_{X} = [h_{1},h_{2}, ...,h_{n}] \) and \( H_{prompt} = [h_{e_{s}},h_{MASK},h_{e_{t}}] \). This module involves event knowledge acquisition and prompt template design.

3.2.1 Event knowledge acquisition and linearization

A knowledge graph involving a large amount of common sense, entity knowledge, and semantic relations is undoubtedly the best choice for external knowledge. ConceptNet [50] is a knowledge graph rich in concepts and semantic relations, with more than 8 million nodes, 21 million edges, and 34 core relations. For ECI task, we require in-deepth knowledge of event descriptions to supplement or activate the potential of PLMs, as well as to give a better prompt for prompt learning. Therefore, we retrieve the definitions of events mentioned in the original text along with the 16 semantic relations pertinent to ECI in ConceptNet: CapableOf, Causes, CauseDesire, UsedFor, HasA, PartOf, Entails, Desires, HasContext, HasSubevent, HasPrerequisite, ReceivesAction, IsA, HasProperty, MannerOf, and CreatedBy. Other knowledge sources such as WordNet can also be used as external knowledge.

Specifically, we first retrieve the nodes of event mentions \( e_{s} \), \( e_{t} \) in the original text from Knowledge Graph (ConceptNet). Noting that most of the event mentions words involve the plural, past tense or participle forms of words, we perform word form reduction on them. Then match the Sub-Graph of the 16 semantic relations and associated nodes related to the event mentions. Part of the knowledge related to “shot” and “killed” gleaned from ConceptNet is shown in Fig. 4, \({\textbf {shooting}}\overset{IsA}{\longrightarrow } {\textbf {homicide}}\), \({\textbf {kill}}\overset{Causes}{\longrightarrow } {\textbf {death}}\), etc. We observe that an event mentioned corresponds to various relational knowledge and that there may also be several explanatory terms within each relationship. As an illustration, the “HasSubevent” relation for the event mention “kill” includes the explanation items “HasSubevent shoot”, “HasSubEvent someone or something dies”, etc. In order to enhance the event representation, we add each explanation item connected to each event mention to a knowledge list, which is a more thorough and in-depth description of the event. These triples are finally linearized into text using EventText. The semantic relation words “IsA”, “HasSubevent”, etc., are changed into plain language descriptions such as “is a” and “has subevent”, etc., to make the knowledge description more natural and fluid. EventText is implanted into the input sequence in a spliced manner.

3.2.2 Design for ECI prompt

Prompt tuning converts downstream tasks into a form consistent with the pre-trained target by introducing task-specific templates, where it is critical to know how to construct templates and verbalizers. We design a prompt for ECI as \( \mathcal {T}_{ECI}(X) \).

$$\begin{aligned} \begin{aligned}&\mathcal {T}_{ECI}(X): In~ this~ sentence, \\&\!<\!t1\!>\!~ 'e_{s}'~ \!<\!t2\!>\!~ \!<\!t5\!>\!~ [MASK]~ \!<\!t6\!>\!~ \!<\!t3\!>\!~ 'e_{t}'~ \!<\!t4\!>\!. \end{aligned} \end{aligned}$$
(3)

Some learnable tokens are applied to the template to dynamically accommodate the training of PLMs(e.g. \( <t1> \) and \( <t2> \) denote the position of \( 'e_{s}' \) , \( <t3> \) and \( <t4> \) denote the position of \( 'e_{t}' \) ). The [MASK] token in \( \mathcal {T}_{ECI}(X) \) requires the label words to be filled (\( <t5> \) and \( <t6> \) denote the position of [MASK] ).

For ECI task, the label words V are in PLM vocabulary. However, due to the large space of PLM vocabulary, it is possible that some words do not reflect the causality well, so we follow the previous work with setting virtual words for causal verbalizer. The label words V are denoted by the three virtual words \( \{Cause, CausedBy, Null\} \). These virtual words are also learnable tokens, Cause and CausedBy facilitate the model to learn the direct features of causality, and the verbalizer directly uses these three label words corresponding to the causal labels. The probability distribution of the label words in the [MASK] position of MLM is used as the probability distribution of the causal labels.

3.3 Interaction constructor

The representation of each word is obtained from the document encoder, and the classification results can be straightforwardly acquired after model training. Considering that there are close correlations with text, events, and knowledge, these associations can focus on event semantics and concept knowledge, which provide richer semantic features for causality identification. Therefore, we propose an interaction guidance mechanism and design an interaction constructor. Based on the guidance mechanism, the interaction constructor can effectively guide external knowledge to enhance the representation of relevant nodes. By constructing interaction graphs with texts, events, and knowledge, the hidden interaction representation of each node is generated by GCN, which has powerful feature extraction capabilities.

Fig. 5
figure 5

Guide mechanism for interaction graph

3.3.1 Guidance mechanism

There are two types of guidance mechanisms: guiding original text (got) and guiding events in the prompt(get). As shown in Fig. 5, the got mechanism intends to use external event description knowledge to enhance the semantic comprehension of the original text, so it bridges external knowledge with textual event mentions; while the get mechanism seeks to reinforce the reasoning ability of event relations in the prompt template, so it bridges external knowledge with event mentions in the prompt template. An example of guidance mechanisms can be found in Fig. 6 in Section 3.3.2. The blue arrow represents the got mechanism, which guides the establishment of connections between knowledge nodes and event nodes in the original text. For example, knowledge nodes \( k_{s} \) are connected to “shot” and \( k_{t} \) are connected to “killed”. The red arrow indicates the get mechanism, which guides the establishment of connections between knowledge nodes and event nodes in the prompt. For example, knowledge nodes \( k_{s} \) are connected to \( e_{s} \), and \( k_{t} \) are connected to \( e_{t} \).

Fig. 6
figure 6

Example of event knowledge interaction graph adjacency matrix

3.3.2 Interaction graph construction

Constructing the nodes and edges in the interaction graph based on the guidance mechanism is essential for learning effective event representations for ECI.

Nodes in interaction graph

Given a document \( D=\{w_{1},w_{2},...,w_{i}\} \) (where \( w_{i} \) is a word in the document), we construct one graph for each document separately. The nodes in the graph should be able to capture the document content relevant to the source event \( e_{s} \) and the target event \( e_{t} \) to predict causality. We consider three node types in our work.

① Word Nodes, i.e., the contextual words of the document D.

② Event Nodes, i.e., the event mentions in the document D or the ECI prompt \(\mathcal {T}_{ECI}(X)\), noted as \( E = \{e_{1},e_{2},...,e_{l}\} \), where l is the number of knowledge.

③ Knowledge Nodes, i.e., external knowledge related to the event mentions, noted as \( K = \{k_{1},k_{2},...,k_{m}\} \), where m is the number of knowledge. Thus, the set of nodes is \( N = \{D\cup E\cup K\} = \{x_{1},x_{2},...,x_{n}\} \), where n is the number of nodes \( (n = i + l + m) \).

Edges in interaction graph After mapping the document into the three types of nodes, we construct the following two types of edges between nodes to establish the interaction graph based on the guidance mechanism.

① Event-Event Edge (E-E). The event pairs in a document will be scattered in different sentences. Our main purpose is to identify the causality between two events, therefore, Event-Event information is extremely valuable. We add edges between events \( e_{s} \) and \( e_{t} \) in a document.

② Event-Knowledge Edge (E-K). In order to complement the conceptual and semantic knowledge of events in a document, we construct edges between event nodes and external knowledge nodes.

Interaction graph feature extraction

After defining the nodes and edges of the knowledge interaction graph, the adjacency matrix A is used to automatically construct the event knowledge interaction graph G, with the number of nodes n. Then A is the matrix of \( n \times n \). According to the got mechanism, connect the knowledge nodes to the corresponding original event nodes, and set the corresponding position to 1. According to the get mechanism, connect the knowledge nodes with the event nodes in the prompt, with the corresponding position set to 1 and all other positions set to 0.

$$\begin{aligned} A_{ij}=\left\{ \begin{aligned} 1&,&e_{ij} \in {E-E, E-K}\\ 0&,&e_{ij} \notin {E-E, E-K} \end{aligned} \right. \end{aligned}$$
(4)

\( A_{ij} = 1 \) indicates that node i and node j is connected to each other with an edge. We employ GCN for feature extraction to generate node representations in the interaction graph. Specifically, in our work, the GCN model uses the feature representation obtained by the document encoder \( H^{(0)}=[h_{CLS},H_{K}^{(0)},h_{SEP},H_{D}^{(0)},h_{SEP},H_{Prompt}^{(0)},h_{SEP}] \) as the initial representation, where \( H_{K}^{(0)}=[hk_{s},hk_{t}] \), \( H_{D}^{(0)}=[h_{1},he_{s},h_{3},...,he_{t},...,h_{i}] \) and \( H_{Prompt}^{(0)}=[he_{s},h_{MASK}, he_{t}] \). After l layers aggregation, the feature representation \( H^{(l+1)} \) of the \( (l+1)^{th} \) layer is:

$$\begin{aligned} H^{(l+1)} = ReLU (AH^{(l)}W^{(l)}) \end{aligned}$$
(5)

Where \( H^{(l)} \) and \( H^{(l+1)} \) denote the feature vectors of nodes in the \( l^{th} \), \( (l+1)^{th} \) layer, respectively. \( W^{(l)} \) denotes the weight matrix of the \( l^{th} \) layer, ReLU is the activation function, and after G-layer GCN, \( H^{(g)} \) is noted as \( H^{(g)} = GCN(A,H^{(0)},G) \) for convenience. The GCN model outputs the feature vectors of event nodes \( e_{s} \) and \( e_{t} \) as \( ke_{s} \) and \( ke_{t} \), where \( ke_{s} \) and \( ke_{t} \) aggregates the features of neighboring knowledge nodes from the structure. The new feature vectors \( hke_{s} \) and \( hke_{t} \) are derived by fusing them with \( he_{s} \) and \( he_{t} \) in the prompt through splicing to enhance the semantic representation of events. The final fused feature representation \( H^{(g)} \) captures the relationship between word nodes and their neighboring nodes, denoted as: \( H^{(g)}=[h_{CLS}^{'},H_{K}^{(g)},h_{SEP}^{'},H_{D}^{(g)},h_{SEP}^{'},H_{Prompt}^{(g)},h_{SEP}^{'}] \), where \( H_{K}^{(g)}\!=\![hk_{s}^{'},hk_{t}^{'}] \), \( H_{D}^{(g)}\! =\![h_{1}^{'},he_{s}^{'},h_{3}^{'},...,he_{t}^{'},...,h_{i}^{'}] \) and \( H_{Prompt}^{(g)}=[he_{s}^{'},h_{MASK}^{'},he_{t}^{'}] \). Therefore, it realizes the interaction of events and knowledge, and provides more extensive and more abstract deep features for causality prediction.

Assuming the input document is S2 as above, the input format is:

[CLS] \(\langle \) \( k_{s} \) \(\rangle \) shooting is a homicide, causes death...\(\langle \) \( /k_{s} \) \(\rangle \). \(\langle \) \( k_{t} \) \(\rangle \) kill causes death, has subevent shoot...\(\langle \) \( /k_{t} \) \(\rangle \). [SEP] A disgruntled woman shot at a Kraft factory, two workers were killed.[SEP] In this sentence, shot [MASK] killed. [SEP]

The two events “shot” and "killed" in the above text are denoted by \( e_{s} \) and \( e_{t} \), respectively. The knowledge corresponding to the two events “shooting causes death...” and “kill has subevent shoot...” is denoted by \( k_{s} \) and \( k_{t} \). The adjacency matrix of the event knowledge interaction graph constructed in the training process is shown in Fig. 6, where the word itself is 1 on the diagonal line, and \( k_{s} \) and \( k_{t} \) interact with \( e_{s} \) and \( e_{t} \) in the original text and prompt for ECI, respectively.

3.4 Predictor

The representation \( H^{(g)} \) obtained by the GCN module of the interaction constructor has the intensive interaction feature, enhancing the event representations in the prompt. Then \( H^{(g)} \) is further fed into RobertaLM Head to yield the MASK-Feature. The predictor will get the probability distribution of the candidate classes based on the MASK-Feature, and eventually, the causality labels \( \mathcal {Y} = \{Cause,CausedBy,Null\}\) of the event pair \( <e_{s},e_{t}> \) can be predicted.

4 Experiments

Our experiment aim to verify (1) whether external event knowledge can effectively potentiate the ability of PLMs to identify the implicit causality, and (2) whether knowledge interaction graphs can precisely guide models to enhance the performance of ECI.

4.1 Datasets and evaluation metrics

Our proposed method is evaluated on two widely used datasets, EventStoryLine (version 0.9) [31] and Causal-TimeBank [51]. EventStoryLine contains 258 documents, 5334 events, and 1770 causal event pairs. As the prior split in [9, 20], we group documents by topic and sort them by topic IDs. The last two topics are used as the development data, and documents in the remaining 20 topics are employed for 5-fold cross-validation. Causal-TimeBank contains 184 documents, 1813 events, and 318 causal event pairs. Following Zuo et al. [33, 34], we adopt the same data division as they did, using 10-fold cross-validation. For the evaluation, we used Precision (P), Recall (R), and F1-score (F1) as evaluation metrics.

4.2 Experimental settings

In implementations, we use the pre-trained Roberta-base model of 768 dimensions for word embeddings as a document encoder. We set 1e-4 for the learning rate of the Adam optimizer. Due to the sparsity of the positive example samples in the ECI datasets, the model training process uses negative sampling. We adopt a negative sampling rate of 0.5 for training our model, and the batch size for training is 16. We tune the hyper-parameters by grid search based on the development set performance and perform early stopping. In the interaction graph construction module, we use one layer for the GCN model (G = 1) and 2000 hidden units for GCN layers. The external knowledge acquisition is selected from the common sense knowledge graph ConceptNet 5.5.

4.3 BaseLines

We compare our model with the state-of-the-art(SOTA) models for ECI and we consider the following baselines:

Previous SOTA methods: LSTM [52] and Seq [53], a dependency path-based sequential model which is originally developed for temporal relation prediction; LR+ [9] and LIP [9], document structure-based models; RB [51], a rule-based system; and ML [54], a feature-based model for ECI.

Using PLMs and introducing external knowledge methods: LearnDA [33], a data augmentation method by introducing knowledge bases to augment training data; CauseRL [32], a self-supervised learning method that learns context-specific causal patterns from external causal statements; MM [11], a BERT-based model with mention masking generalization. KEPT [20], a knowledge enhanced prompt tuning method incorporating background information and relational information.

Employing GCN methods: RichGCN [48], using GCN to capture interconnections in the document structure graph; ERGO [49], constructing an event relationship graph and utilizing GCN for node classification.

4.4 Main results

Tables 1 and 2 show the performances of our proposed approach and all benchmark models on EventStoryLine and Causal-TimeBank datasets, respectively.

Table 1 Main results on EventStoryLine dataset (%)
Table 2 Main results on Causal-TimeBank dataset (%)
  1. (1)

    In terms of overall performance, the proposed model KIGP outperforms all the existing baselines on both EventStoryLine and Causal-Timebank datasets, with 6.3% and 2.9% improvement over the SOTA method ERGO, respectively. This indicates the effectiveness of our method for ECI task.

  2. (2)

    From the perspective of external knowledge and pre-training, LearnDA and CauSeRL show that introducing external knowledge can improve the prediction performance of causality compared to the approaches without external knowledge(RB, ML), but there is still a semantic gap between external knowledge and causality. The pre-training model MM is dedicated to stimulating the knowledge of PLMs themselves, and its performance is not as good as that of adding external knowledge, probably because PLMs do not have enough event-specific and causality knowledge for learning. KEPT capitalizes on background knowledge and relational information, and optimizes event representations and causality jointly with TransE to capture implicit relationships, which outperforms LearnDA and CauSeRL.

  3. (3)

    Our model adopts the “PLMs, event knowledge, prompting” paradigm to supplement PLMs with event-specific knowledge and use prompting to explore the potential semantics. The performance is improved by approximately 8% compared with CauSeRL and KEPT on both datasets. This demonstrates that external event-specific knowledge can effectively stimulate the ability of PLMs to recognize implicit causality.

  4. (4)

    From the perspective of interaction graph structure, compared with RichGCN and ERGO models using graph structure, we do not employ GCN to do node classification and relationship prediction, but directly adopt it to extract node features of event knowledge interaction graph. The F1-scores of our model are higher than RichGCN and ERGO on both datasets. The reason may be that our process of building event knowledge interaction graphs avoids introducing noise and causing error accumulation with the existing NLP tools. In addition, the powerful feature extraction capability of GCN can promote the hidden layer representation of nodes and precisely guide the model to understand the semantics to help causality prediction.

4.5 Ablation experiments

To analyze how each component in the proposed KIGP model contributes to the performance, we conduct ablation studies to turn off one at a time on the validation set as Tables 3 and 4 show.

Table 3 Performance of KIGP model with different components on EventStoryLine dataset(%)
  1. (1)

    w/o intergcn, to verify the effectiveness of the interaction graph module, we remove the interaction graph and use only Roberta encoder to generate hidden layer representations H instead of representations \( H^{(g)} \) that go through the GCN layer for predicting causality. Without the interaction between text, events, and knowledge for bootstrapping, the performance decreased by 2.1% and 1.8% on the two datasets, respectively. This shows the importance of the event-knowledge interaction, where the features after the interaction play an crucial role in guiding causality reasoning. Our model can make in-depth interaction between event and knowledge, thus boosting the model performance.

  2. (2)

    w/o eventkg, to verify the effectiveness of external event knowledge, we remove the event knowledge text EventText which is acquired from ConceptNet from the input of document encoder, and simultaneously, the interaction graph module with GCN loses its function. As a result, the performance of our model drops 2.9% and 2.4% in terms of F1-score on the two datasets, respectively. The result indicates that external event-specific knowledge contains useful clues between events that facilitates the ability of PLMs to understand the semantics of text regarding event relationships.

  3. (3)

    w/o prmauto, to verify the validity of the automatic prompt, we remove the learnable tokens \(<t1> <t2>... <t6> \) in the template and use only “manual” prompt like \( \mathcal {T}_{ECI}(X): In\ this\ sentence,\ 'e_{s}'\ [MASK]\ 'e_{t}' \). The experimental results show that the performance of the “manual” prompt is 1.2% and 1.1% lower than that of the “manual+automatic” prompt using learnable tokens on the two datasets, which suggests that the learnable tokens indeed learn some contextual semantic information through the model training, which is helpful for causality prediction.

  4. (4)

    w/o prmeci, to demonstrate the necessity of the prompt template module, the prompt \( \mathcal {T}_{ECI}(X)\) is removed and ECI degenerates into a basic fine-tuning paradigm. We only serve the original text and event knowledge as Roberta’s input, resulting in a significant drop in performance (3.6% and 3.2%). This illustrates that the [MASK] form of the prompt better caters to MLM’s cloze task and stimulates its learning ability. The precise prompt for ECI promotes a more accurate understanding and prediction of causalities.

Table 4 Performance of KIGP model with different components on Causal-TimeBank dataset(%)

Through ablation experiments, we observe that all the components contribute to model performance and both external knowledge interaction graph with GCN are beneficial and functional for ECI.

Fig. 7
figure 7

Comparison of model performance corresponding to different number ranges in EventStoryLine and Causal-TimeBank datasets

Fig. 8
figure 8

(a) Three forms of knowledge position, \( x_{i} \) indicating words in the original text (blue), \( e_{i} \) indicating event mentions (orange), and \( k_{i} \) indicating event knowledge (purple); (b) Comparison of model accuracy (%) for different knowledge

4.6 Impact of knowledge number and position

The number of knowledge

It is observed that the number of each event knowledge triple obtained from ConceptNet varies within the range of 0 and 20. We make statistics on the number of relevant event knowledge in EventStoryLine and Causal-TimeBank datasets, and find that most of the event knowledge is within 5 items. Experiments with different numbers(2, 5, 10, and unrestricted) of event knowledge and the results are shown in Fig. 6. It can be found that the model performance does not keep improving with the increase of the number of knowledge, but the best performance is captured by limiting the number of knowledge to less than 5. More than 6 or unrestricted knowledge may generate knowledge noise, confuse the semantics and affect the understanding of the original text by PLMs (Fig. 7).

Knowledge positions

Three forms of knowledge-enhanced event text as input to the document encoder are validated: preposition, postposition, and interpolation. Preposition is to place the linearized knowledge EventText in front of Original Text , denoted as:

$$\begin{aligned} X = [EventText, Original\;Text] \end{aligned}$$
(6)

Postposition is to place the linearized knowledge \( EventText \) behind Original Text, denoted as:

$$\begin{aligned} X = [Original\;Text, EventText] \end{aligned}$$
(7)

Interpolation is to insert the linearized knowledge \( EventText \) directly into the position where the event is mentioned in the original text and to insert the relevant knowledge \( k_{1} \), \( k_{2} \) corresponded to the event mentioned in the text directly behind \( e_{1} \), \( e_{2} \). An experimental comparison of the three forms of knowledge positions, as shown in Fig. 8, reveals that the accuracy rate of knowledge preposition is higher than that of knowledge postposition, and knowledge interpolation is the least effective. Intuitively, although the interpolation form can help the model improve the understanding of the event itself, it will make the gap between two events mentioned in the text too wide, reducing the flow of the original text and making it difficult for the model to determine the causality of the events.

Fig. 9
figure 9

Variation of model’s F1-score corresponding to different GCN’s layers of the interaction graph

Table 5 A comparison of the graphics memory consumption and time consumption for training and prediction on the EventStoryLine dataset

4.7 Effect of structure and GCN layers of the interaction graph

The structure of the interaction graph constructed under the guidance of got and get mechanisms will change due to changes in event knowledge, original text, and prompt templates. With the original text and event knowledge unchanged, the structure of interaction graphs is different when “manual” prompt and “manual+automatic” prompt with learnable tokens are used. We compare the effects of the two in the ablation experiment, as shown in the row of w/o prmauto in Tables 3 and 4. The “manual+automatic” prompt with learnable tokens show better results than manual prompt, and also demonstrate the rationality of the interaction graph structure.

GCN is employed by the interaction graph module as feature extractor. Two layers of GCN are typically used in the text classification task to collect neighbor node features and achieve high performance. We experiment with the choice of GCN layers (G=1,2,3) for ECI task, and the model’s F1 score on both datasets as shown in Fig. 9. It demonstrates that 1 layer is preferred over 2 and 3 layers. That is to say, the more GCN layers there are, the worse the effect. The interaction graph specifies text, event, and knowledge nodes with the intention of applying knowledge to improve the comprehension of occurrences, which may be the cause. knowledge nodes serve as the immediate neighbors of event nodes in the interaction graph, allowing knowledge features to be directly aggregated by a single layer. If 2 or 3 layers are used, the range of aggregated nodes will further expand, which can easily confuse semantics and hinder the understanding of events.

4.8 The computation complexity involved in the process

Computational complexity is described from two aspects, one is spatial complexity, which refers to the storage space occupied by model training or prediction, mainly referring to the computer’s graphics memory here. The second is time complexity, which refers to the time taken for model training or prediction, by counting the average time to run multiple Batches in each epoch. Since deep learning models are difficult to use traditional complexity representations, such as O(n*logn), etc., we use a comparison-based complexity analysis method.

Our model is a "base+module" structure, the base model is Roberta , and the module is mainly GCN. Therefore, our model Robert+GCN is compared with the base model Roberta and analyzed in terms of time and spatial consumption. A comparison of the graphics memory consumption and time consumption for training and prediction of the two models on the EventStoryLine dataset is shown in Table 5. For convenience, maximum memory consumption for model training is denoted as Maxmt and average time for Batches spent on model training is denoted as Avgmt. Maximum memory consumption for model prediction is denoted as Maxmp and average time for Batches spent on model prediction is denoted as Avgmp.

It can be seen that with the Roberta+GCN model, there is no significant increase in the spatial complexity and time complexity compared to the Roberta model, and even the maximum memory consumption of Roberta+GCN is a little bit less than that of Roberta’s when the model is predicted.

4.9 Case study

To visually demonstrate the effectiveness of KIGP, we do a case study to compare the identification results of both KIGP and RichGCN methods, as shown in Fig. 10.

Fig. 10
figure 10

Case study. \( <e_{i}, e_{j}> \) indicates event pairs, GT means Ground Truth demonstrating the true relations between event pairs. Rich indicates the RichGCN method. Bold underlined words indicate events, ✔ indicates that the model identifies a causality between event pairs, and ✘ indicates that the model does not identify causality between event pairs

In Case 1, RichGCN identifies \( <war, bombs> \) as a causal pair, but in fact, there is no causality between war and bombs, and the model may have misunderstood the words and confuses “war” with “warns” which are close to each other. Because there is no explicit clue word in the text, RichGCN fails to identify the causality between “bombs” and “death”, an implicit causality that often requires common sense knowledge to be correctly inferred.

In Case 2, both RichGCN and KIGP can correctly determine that “earthquake cause injured” and “earthquake cause killed”, but for the causal event \( <earthquake, destroyed> \), RichGCN fails to identify, which indicates that adopting only document structure graph to capture the association between events from structural features may lack the comprehending of text semantics. However, KIGP accurately identifies “earthquake cause destroyed” by using structural features and also emphasizing semantic features. KIGP can correctly identify all causal pairs in the two cases, indicating that our proposed approach can facilitate the identification of implicit causality by incorporating external knowledge to interact with text and events, thus enhancing the effectiveness of the ECI model.

Finally, the experiments demonstrate that (1) the incorporation of external event knowledge for PLMs can promote the semantic analyzing of events and event relations in texts, and further improve the effect of implicit causality identification through prompt tuning, and (2) the interaction structure features extraction by the event knowledge interaction graph can guide the model to identify causality more precisely and strengthen ECI capability.

5 Conclusion and future work

This paper proposes a novel Knowledge Interaction Graph guided Prompt Tuning (KIGP) to leverage external event knowledge and interaction graph for ECI task. To improve the identification of implicit causalities, we incorporate external event knowledge and design the prompt to maximally activate the powerful learning capability of PLMs. To accurately guide the ECI models and augment the interaction between events and knowledge, we introduce a guidance mechanism to construct interaction graphs capturing deeply hidden features to enhance event representations in the prompt. Experimental results on two widely used ECI datasets demonstrate that our approach outperforms existing SOTA methods, effectively addressing the challenges of implicit causality identification and event knowledge interaction to some extent. In future work, we will explore automatic prompt template generation for ECI models to further enhance performance. Other knowledge sources like WordNet may provide additional useful knowledge for this task, and we will use WordNet and other knowledge sources as external knowledge for our future research.