1 Introduction

Money laundering is the process of disguising, concealing, and transforming the illegal income obtained from crimes such as drug trafficking, prostitution, and smuggling. It has become one of the most prevalent economic crimes, seriously undermining economic stability and social security. According to SIPRI, global money laundering of up to 1.6 trillion dollars (2.7% of global GDP) in 2022 [1]. Therefore, combating money laundering is urgent for maintaining healthy economic development.

Early studies relied on expert knowledge, where financial institutions designed numerous indicators and rules to detect suspicious transactions. For example, large transactions occurring in high-risk regions at midnight may warrant attention. Although rule-based methods are intuitive, their labor-intensive nature makes it difficult to handle vast amounts of data. Additionally, pre-designed rules can be easily bypassed by emerging money laundering tactics. Consequently, many studies focus on applying machine learning and deep learning to AML. Recently, graph neural networks (GNNs) have emerged as a promising solution for AML by leveraging the topological information in transaction networks.

Despite the remarkable success of GNN-based methods, they still face several challenges. First, labeled data are both costly and scarce. Most existing studies rely on supervised learning, which requires sufficient and high-quality annotations. However, in the field of AML, manually labeling suspicious transactions is expensive and labor-intensive. Therefore, it is essential to extract valuable information from unlabeled data. Second, interactions between transactions are often sparse. Due to the challenges of data acquisition, available data are often fragmented. Moreover, money launderers often obscure sequences of suspicious transactions, further complicating detection. However, overly sparse interactions can limit the performance of GNNs.

Contrastive learning (CL), a popular framework in self-supervised learning, has recently achieved significant success across various domains [2,3,4,5,6,7]. Many researchers have adapted CL to graph data, referred to as graph contrastive learning (GCL). The core idea of GCL is to capture intrinsic knowledge by maximizing the agreement between representations from different views of the same graph. Motivated by this, we propose GCPAL, a novel graph contrastive pre-training framework for anti-money laundering (AML). Specifically, we construct three augmented views, including two stochastic perturbed views and a KNN view. The perturbed views are generated using edge dropping and feature dropping strategies, which help the model learn invariant information and improve robustness against noise. The KNN view is generated by selecting the top-k most similar node pairs based on the node attribute matrix, providing implicit interactions to address link sparsity in the transaction network and preserving node feature similarity [8]. These three views are then embedded by a shared graph encoder. Finally, the graph encoder is optimized through a cross-view contrastive learning objective across the three views. Based on the homophily assumption that connected and similar nodes tend to share the same label, we treat neighboring nodes in the original graph and KNN view as positive samples for target nodes.

Our contributions are summarized as follows:

  • We propose the GCPAL framework, which achieves strong AML performance with limited labels by leveraging contrastive learning to enhance model expressiveness across multiple augmented views.

  • We construct KNN views to mitigate the sparsity of interactions in the data. Moreover, the extended positive sample set further enhances the performance of the model.

  • Extensive experiments demonstrate that GCPAL outperforms state-of-the-art (SOTA) AML models, especially with scarce labeled data (e.g., 1\(\%\) and 2\(\%\) of the training data).

2 Related Work

2.1 Anti-money Laundering

Early studies mainly are rule-based methods [9, 10], relying on expert knowledge and simple data analysis to identify suspicious transactions. For instance, Rajput et al. [10] developed an ontology-based expert system incorporating domain knowledge and explicit rules. Nonetheless, rule-based methods are challenging to keep up to date and cannot adapt to new money laundering strategies. More recently, researchers have increasingly applied machine learning algorithms to anti-money laundering (AML). These approaches can be broadly categorized into supervised methods, such as random forests (RF), logistic regression (LR), and support vector machines (SVM), and unsupervised methods, such as K-means and t-SNE. Jorge et al. [11] proposed a Bayesian network for AML, while Savage et al. [12] used RF and SVM to detect criminal transactions. However, studies have shown that machine learning models often suffer from a high false-positive rate problem [13, 14].

Recently, an increasing number of studies have been devoted to deep learning methods [15] in AML. Paula et al. [16] designed an AutoEncoder to capture financial transaction patterns, while Han et al. [17] utilized LSTM to embed news articles and tweets to support AML investigations. Inspired by the success of GNNs in areas like social networks [18], recommendation systems [19], anomaly detection [20,21,22], and traffic prediction [23], several studies [24,25,26] have leveraged topological information in financial transactions to uncover potential money laundering patterns. Alarab et al. [24] proposed an early GNN-based model that combines GCN and MLP in parallel to classify Bitcoin transactions. Evolve-GCN [26] addresses both structural and temporal characteristics using GCN to learn structural information at each time step and RNN to integrate features across time steps. Subsequently, other studies [27,28,29] have proposed dynamic graph neural networks to model the evolving patterns in money laundering. To capture more complex relationships and rich semantic information, some researchers [30, 31] model financial interaction networks as heterogeneous graphs. Palita et al. [32] address the issue of class imbalance using focal loss as the optimization objective. Numerous advanced detection methods [33,34,35,36,37] have also been developed to accurately classify anomalous nodes in networks using deep learning techniques. However, these methods are typically semi-supervised or supervised, requiring substantial label data. Since labeling suspicious transactions is costly and labor-intensive, Inspection-L [38] proposes a self-supervised GNN framework using deep graph infomax (DGI). LaundroGraph [39] leverages a link prediction task on the directed bipartite customer-transaction graph to train GNN in a self-supervised manner.

2.2 Graph Contrastive Learning

Motivated by the promising performances of contrastive learning in the fields of CV and NLP, many studies pay attention to contrastive learning on graph data. Graph contrastive learning aims to maximize the mutual information (MI) between positive examples within graphs. DGI [40] is a pioneering approach that maximizes MI between node and graph representations. Other models, such as InfoGraph [41], MVGRL [42], and SUGAR [43], also leverage MI between local and global representations to learn node embeddings. GCC [44] constructs two augmented views by randomly perturbing nodes or edges and pre-trains GNNs by bringing together representations of the same node across these views while pushing apart representations of different nodes. GCA [45] proposes multiple adaptive graph augmentation strategies based on topological and semantic information. CuCo [46] introduces curriculum learning into contrastive learning by creating a scoring function to rank negative samples. BYOL [47] is a bootstrap method that does not use negative samples, significantly reducing memory costs. As a self-supervised paradigm, graph contrastive learning extracts supervisory signals from unlabeled data, greatly reducing dependency on labeled data. In this paper, we propose a graph contrastive learning framework for anti-money laundering.

3 Preliminaries

3.1 Problem Definition

Following [48], we construct the Bitcoin transaction network as the graph \(G=(V, E, X)\), where node \(v_{i} \in V\) denotes a transaction, edge \(e_{i,j} \in E\) denotes the financial flow between nodes \(v_i\) and \(v_j\), and \(X=\{x_1, x_2, \ldots , x_{|V|} \} \in \mathbb {R}^{|V| \times d}\) is the node feature matrix. Finally, the adjacency matrix is presented as \(A \in {0,1}^{|V|\times |V|}\).

This paper considers anti-money laundering as a graph pre-training task. We pretrain a GNN \(g:x_i \rightarrow h_{i}\), where \(h_{i}\) is the pre-trained representation of \(v_i\). We then train a supervised classifier \(f: h_i \rightarrow Y_i\), where \(Y_i\) is the label of \(v_i\).

3.2 Graph Neural Networks

Graph neural networks (GNNs) aim to iteratively update node representation by combining representations from itself and its neighborhoods as

$$\begin{aligned} {h}_{i}^{(k)}=\operatorname {COM}\left( {h}_{i}^{(k-1)}, \operatorname {AGG}\left( \left\{ {h}_{j}^{(k-1)}: j \in \mathcal {N}_{i}\right\} \right) \right) , \end{aligned}$$
(1)

where \({h}_{i}^{(k)}\) is the node representation in the kth layer, \(\mathcal {N}_{i}\) represents the set of directly connected neighbors of node i, and \(h_i^{(0)}=x_i\). \(\operatorname {AGG}(\cdot )\) denotes the aggregate function which aggregates neighbor’s information. \(\operatorname {COM}(\cdot )\) is the combination function that combines the aggregated neighbor information and its own features from the previous layer, such as averaging, summation, and concatenation.

4 Methodology

This section elaborates on the proposed graph contrastive pre-training framework for anti-money laundering (GCPAL). We begin with an overview of the GCPAL framework and then detail its two major parts: graph contrastive pre-training and supervised classification.

Fig. 1
figure 1

Graph contrastive pre-training of GCPAL. Two perturbed views \(\mathcal {G}^{'}\) and \(\mathcal {G}^{''}\) are generated using graph augmentation from original graph \(\mathcal {G}\). The KNN view is constructed by selecting the top-k most similar node pairs based on the node features. Three views are encoded by a shared graph encoder to obtain their representations. The objective of contrastive learning pre-training is to maximize the agreement between positive samples and the dissimilarity between negative samples

4.1 Overview of GCPAL

The proposed GCPAL model builds upon the graph contrastive learning method for AML and consists of two main stages: self-supervised pre-training and supervised classification. Figure 1 illustrates the main architecture of GCPAL. The pseudocode is shown in Algorithm 1.

The pre-training phase generates three augmented graph views: two stochastic perturbed views and a KNN view. The stochastic perturbed graphs are created through feature and edge dropping, while the KNN view is constructed by selecting the top-k most similar node pairs based on node features. These three views are encoded by a shared graph encoder to obtain their respective representations. Contrastive learning tasks are performed across these multiple views. Additionally, positive sample sets are created for each node. The objective of contrastive learning pre-training is to maximize the agreement between positive samples and the dissimilarity between negative samples.

The classification phase is conducted using the limited labels obtained through manual annotation. The pre-trained GNN encoder is reused to generate global-level node embeddings. These embeddings are then concatenated with raw features, and the resulting feature set is fed into a classifier for the anti-money laundering task. The goal of this phase is to accurately identify illegal transactions.

figure a

4.2 Graph Contrastive Pre-training

The pre-training stage aims to extract inherent knowledge from massive unlabeled data. This phase typically includes three major components: graph data augmentation, graph encoder, and contrastive learning.

4.2.1 Graph Data Augmentation

Bitcoin transaction network has valuable characteristics in both its graph structure and node information. (1) The Bitcoin transaction network may have a large number of edges (e.g., 1.1B edges in the full Elliptic dataset) representing payment flows, the majority of which are legal. As a result, the graph may contain redundant edges that have limited relevance in detecting illegal transactions. (2) Nodes in the network contain high-dimensional features, such as time, transaction fees, and several aggregated values. Although these features contribute to classification accuracy, they also increase the risk of information redundancy and over-fitting. (3) Certain similarities may exist among illicit transactions (e.g., similar payment methods or currency types). However, due to temporal and spatial constraints, some transactions may lack direct connections or have only a few multi-hop links within the network. Since GNNs can utilize only limited k-hop neighbor information, this semantic similarity of transaction nodes is not well exploited in AML. To address these challenges, we employ three augmentations to the original transaction graph to generate different views, including edge dropping, feature dropping, and KNN graph construction.

Edge dropping (ED) Given the edge set E, edge dropping will randomly remove certain ratio \(\alpha \) of edges. It aims to reduce the model’s reliance on whole edges to help reveal more meaningful structures. Formally, this process can be presented as

$$\begin{aligned} \text {ED} \left( \mathcal {G} \right) = \left( \varvec{M_1} \odot E, X \right) , \end{aligned}$$
(2)

where \({M_1} \in {\{0, 1 \}}^{|E|}\) is an edge masking matrix where corresponding elements are 1 if the edge is masked; otherwise, it is 0. And, \(\odot \) denotes element-wise product.

Feature dropping (FD) Given the node feature matrix X, feature dropping will randomly discard certain portion \(\beta \) of nodes. Similarly, this operation is used to reduce reliance on abundant features, improving the model’s generalization and robustness. This procedure can be modeled as

$$\begin{aligned} \text {FD} \left( \mathcal {G} \right) = \left( E, \varvec{M_2} \odot X \right) , \end{aligned}$$
(3)

where \({M_2} \in {\{0, 1 \}}^{|V| \times d}\) is a masking matrix of feature matrix X where all elements in the jth row are 0 if jth node is masked.

KNN graph construction To leverage the semantic similarity between different transactions, we propose to build a KNN graph view using raw node features. In detail, we first compute the similarity score by matrix multiplication of X. Then, we keep k edges with the highest similarities for each node to obtain the augmented adjacency matrix \(A^{'}\). The process can be formulated as

$$\begin{aligned} A^{\textrm{KNN}} = \mathrm {top-}k\left( X X^{\top }\right) , \end{aligned}$$
(4)

where \(X \in \mathbb {R}^{|V| \times d}\) denotes the raw feature matrix, \(\mathrm {top-}k(\cdot )\) function is used to select the most similar node pairs. After that, we can obtain the new edge set \(E^{\textrm{KNN}}\) of KNN graph by extracting edges from the adjacency matrix \(A^{\textrm{KNN}}\). The whole procedure can be defined as: \(\text {KNN} \left( \mathcal {G} \right) = \left( E^{\textrm{KNN}}, X \right) \).

We generate two perturbed views \(\mathcal {G^{'}}\) and \(\mathcal {G^{''}}\) and a KNN view \({\mathcal {G}}^\textrm{KNN}\) through above strategies.

4.2.2 Graph Encoder

Graph encoding enables the transformation of complex structures into informative representations essential for classification. The graph encoder \(g(\cdot )\) is flexible and can be selected from common graph neural networks (GNNs), such as GCN, GAT, and GIN, among others. In this work, the default graph encoder is the graph isomorphism network (GIN) [49], due to its exceptional ability in graph modeling. Formally, GIN is considered the most powerful GNN model theoretically, as it ensures the injectivity of the passing function. It can be represented as

$$\begin{aligned} \begin{aligned} {h}_{i}^{(k)} = \operatorname {MLP}^{(k)}\bigg ((1+\varepsilon ^{(k)}){h}_{i}^{(k-1)} \\ +\operatorname {SUM}\left( \left\{ {h}_{j}^{(k-1)}: j \in \mathcal {N}_{i}\right\} \right) \bigg ), \end{aligned} \end{aligned}$$
(5)

where \(\operatorname {MLP}(\cdot )\) is a multi-layer perceptron, \(\varepsilon ^{(k)}\) denotes a learnable scalar parameter, \({h}_{i}^{(k)} \in \mathbb {R}^d\) is the representation of node i in the kth GNN layer, and \(h_{i}^{(0)}=x_i\) is the raw node feature. We can acquire the final representations of all nodes \(H = \{h_{1}^{(k)},h_{2}^{(k)}),\ldots ,h_{|V|}^{(k)}\} \) by stacking k layers of GNN. We will omit the superscript (k) below for simplicity.

We can obtain the representations of three augmented views by the shared graph encoder \(g(\cdot )\) as follows:

$$\begin{aligned} H^{'} = g(G^{'}), H^{''} = g(G^{''}), H^{\textrm{KNN}} = g(G^{\textrm{KNN}}). \end{aligned}$$
(6)

4.2.3 Contrastive Learning

GCL is a self-supervised graph learning technique that learns node representations by contrasting representations from different graph views. Before the contrastive learning task, a project layer \(Z = proj(H)\) is further applied to project the learned node representations into a latent space, where a multi-layer perceptron (MLP) is used in this work. Finally, according to the principle of mutual information maximization (MIM), the objective of CL is to push the representations of positive samples close and pull negative samples away. This is typically accomplished using the InfoNCE [50] loss as a lower bound of MIM. The GCL loss between \({\mathcal {G}}^{'}\) and \({\mathcal {G}}^{''}\) can be defined as

$$\begin{aligned} \mathcal {L}_\textrm{GCL}({\mathcal {G}}^{'}, {\mathcal {G}}^{''})= & \frac{1}{2 \left| \mathcal {B} \right| }{\sum \limits _{v \in \mathcal {B}}{\mathcal {L}_\textrm{MI}\left( z_{v}^{'}, z_{v}^{''} \right) + \mathcal {L}_\textrm{MI}\left( z_{v}^{''}, z_{v}^{'} \right) }},~\nonumber \\ \end{aligned}$$
(7)
$$\begin{aligned} \mathcal {L}_{\textrm{MI}}(z_{i}^{'}, z_{i}^{''})= & \sum _{i \in \mathcal {B}} -\log \frac{\sum _{j \in \mathbb {P}_{i}} \exp \left( \operatorname {sim}\left( z_{i}^{'}, z_{j}^{''} \right) / \tau \right) }{\sum _{k \in \left\{ \mathbb {P}_{i} \cup \mathbb {N}_{i}\right\} } \exp \left( \operatorname {sim}\left( z_{i}^{'}, z_{k}^{''}\right) / \tau \right) },\nonumber \\ \end{aligned}$$
(8)

where \(\mathcal {B}\) denotes the node set in the current batch, \(\tau \) is a temperature factor, \(z_{v}^{'}\) and \(z_{v}^{''}\) denote projected representations of node v in views \({\mathcal {G}}^{'}\) and \({\mathcal {G}}^{''}\), respectively, and sim denotes the cosine similarity here. \(\mathbb {P}_{i}\) and \(\mathbb {N}_{i}\) represent the positive samples and negative samples of node i, respectively.

Based on the homophily assumption, we regard both connected neighbors and neighbors with similar features as positive samples. Specifically, we define the positive sample matrix \(M_\mathbb {P}\) as follows:

$$\begin{aligned} M_\mathbb {P} = A + A^{\textrm{KNN}}. \end{aligned}$$
(9)

The non-zero elements in i-th row of \(M_\mathbb {P}\) are positive samples \(\mathbb {P}_{i}\) of node \(v_i\).

Similarly, we can calculate the contrastive loss between \({\mathcal {G}}^{'}\) and \({\mathcal {G}}^{\textrm{KNN}}\). The final loss of graph contrastive learning can be presented as

$$\begin{aligned} \mathcal {L}_\textrm{pretrain}=\lambda \mathcal {L}_{\textrm{GCL}}({\mathcal {G}}^{'}, {\mathcal {G}}^{''}) + (1 - \lambda ) \mathcal {L}_\textrm{GCL}({\mathcal {G}}^{'}, {\mathcal {G}}^\textrm{KNN}), \end{aligned}$$
(10)

where \(\lambda \) is the hyper-parameter to control the weight of each loss.

Theoretical analysis. We will prove that the mutual information between the two views can be maximized by optimizing the loss presented in Eq. (8). In Ref. [51], the proof is shown as follows:

$$\begin{aligned} \begin{aligned} \mathcal {L}_{\textrm{MI}}(z_{i}^{'}, z_{i}^{''})&= \sum _{i \in \mathcal {B}} -\log \frac{\sum _{j \in \mathbb {P}_{i}} \exp \left( \operatorname {sim}\left( z_{i}^{'}, z_{j}^{''} \right) / \tau \right) }{\sum _{k \in \left\{ \mathbb {P}_{i} \cup \mathbb {N}_{i}\right\} } \exp \left( \operatorname {sim}\left( z_{i}^{'}, z_{k}^{''}\right) / \tau \right) }, \\&=\mathbb {E}_{Z}\left[ -\log \frac{\sum _{j \in \mathbb {P}_{i}} \exp \left( \operatorname {sim}\left( z_{i}^{'}, z_{j}^{''} \right) / \tau \right) }{\sum _{k \in \left\{ \mathbb {P}_{i} \cup \mathbb {N}_{i}\right\} } \exp \left( \operatorname {sim}\left( z_{i}^{'}, z_{k}^{''}\right) / \tau \right) }\right] \\&\ge \mathbb {E}_{Z}\left[ -\log \frac{\exp \left( \operatorname {sim}\left( z_{i}^{'}, z_{i}^{''} \right) / \tau \right) }{\exp \left( \operatorname {sim}\left( z_{i}^{'}, z_{i}^{''} \right) / \tau \right) + \sum _{k \in \mathbb {N}_{i}} \exp \left( \operatorname {sim}\left( z_{i}^{'}, z_{k}^{''}\right) / \tau \right) }\right] \\&=\mathbb {E}_{Z}\left[ -\log \frac{p(z_{i}^{''} \mid z_{i}^{'}) / p(z_{i}^{''})}{p(z_{i}^{''} \mid z_{i}^{'}) / p(z_{i}^{''})+\Sigma _{k \in \mathbb {N}_{i}} p\left( z_{k}^{''} \mid z_{i}^{'}\right) / p\left( z_{k}^{''}\right) }\right] \\&=\mathbb {E}_{Z}\log \left[ 1 + \frac{p(z_{i}^{''})}{p(z_{i}^{''} \mid z_{i}^{'})} \Sigma _{k \in \mathbb {N}_{i}} \frac{p\left( z_{k}^{''} \mid z_{i}^{'}\right) }{p\left( z_{k}^{''}\right) } \right] \\&\approx \mathbb {E}_{Z} \log \left[ 1 + \frac{p(z_{i}^{''})}{p(z_{i}^{''} \mid z_{i}^{'})} (|\mathcal {B}|-1) \mathbb {E}_{z_{k}^{''}}\frac{p\left( z_{k}^{''} \mid z_{i}^{'}\right) }{p\left( z_{k}^{''}\right) } \right] \\&= \mathbb {E}_{Z} \log \left[ 1 + \frac{p(z_{i}^{''})}{p(z_{i}^{''} \mid z_{i}^{'})} (|\mathcal {B}|-1) \right] \\&\ge \mathbb {E}_{Z} \log \left[ \frac{p(z_{i}^{''})}{p(z_{i}^{''} \mid z_{i}^{'})} |\mathcal {B}| \right] \\&=-I(z_{i}^{'}, z_{i}^{''})+\log (|\mathcal {B}|). \end{aligned} \end{aligned}$$
(11)

Therefore, \(I(z_{i}^{'}, z_{i}^{''})\ge \log (|\mathcal {B}|) - \mathcal {L}_{\textrm{MI}}(z_{i}^{'}, z_{i}^{''})\), where \(I(z_{i}^{'}, z_{i}^{''})\) denotes the mutual information between two views. We proposed the optimization objective is the lower bound of mutual information maximization.

4.3 Supervised Classification

Once the graph contrastive pre-training is completed, we reuse the pre-trained graph encoder \(g(\cdot )\) (without project head proj\((\cdot )\) to obtain node representations \(H = \{h_{1},h_{2},\ldots ,h_{|V|}\} \). Afterward, H and the raw feature X will be concatenated as input features to a classifier to get the classification score. The whole process can be presented as

$$\begin{aligned} P = \sigma (\textrm{classifier}(H||X)), \end{aligned}$$
(12)

where \(P \in \mathbb {R}^{|V|}\) denotes the possibilities of illegal transactions, \(\sigma \) denotes the softmax function, and || is the concatenation operation. With losing generality, classifier\((\cdot )\) can be an arbitrary supervised classification model. We use a multi-layer perceptron in this work due to its effectiveness and efficiency.

Finally, we use the binary cross-entropy loss to optimize the classifier, which can be presented as

$$\begin{aligned} \mathcal {L}_\textrm{AML} = -\frac{1}{|V_\textrm{label}|} \sum _{v \in V_\textrm{label}} Y_v log (P_v) + (1-Y_v) log (1-P_v), \end{aligned}$$
(13)

where \(V_\textrm{label}\) denotes the trained node set, \(Y_v\) denotes the label of node v (v is illegal when \(Y_v=1\)), and \(P_v\) is the predicted probability of node v as a illegal transaction.

5 Experiments

5.1 Experimental Settings

5.1.1 Dataset

We use two datasets to evaluate our proposed model. The Elliptic dataset [26] consists of 203,769 nodes (transactions) and 234,355 edges (transaction flows). Of these, 4545 (2%) transactions are labeled as illegal, 42,019 (21%) are labeled as legal, and the remaining transactions are unlabeled. Each transaction is represented by 166 features. The first 94 features capture characteristics of the transaction itself (e.g., time step, number of inputs/outputs, and transaction fee). The remaining 72 features are aggregated characteristics based on information from neighboring transactions, including the maximum, minimum, standard deviation, and correlation coefficients of similar data from neighboring transactions. We denote the local raw features (the first 94 features) and the full set of raw features (94 local features plus 72 aggregated features) as LF and AF, respectively.

AMLworld [52] is a recently emerged synthetic AML dataset generated by a simulator that builds a multi-agent virtual world of banks, individuals, and companies. Nine different illegal activities, such as extortion, loan sharking, gambling, etc., can lead to placement in AMLworld. The amount of money obtained and how frequently these illegal operations occur depend on the activity and the performing entity. The financial system is then more entrenched with illicit monies. Two high-level groups, HI and LI, with correspondingly higher and lower illicit ratios (laundering), are derived from the data. Separate datasets for HI and LI are available in small, medium, and large sizes, with the large datasets containing between 175 M and 180 M transactions. We use the HI-Small dataset due to its appropriate size and high illicit ratio.

5.1.2 Evaluation Metrics

In real-world detection tasks, the metrics of F1-score, accuracy, recall, and precision provide valuable insights into different aspects of a model’s effectiveness.

Accuracy reflects the overall proportion of correctly identified instances, but it can be misleading in class-imbalanced settings, such as anti-money laundering (AML), where true-positive cases are often rare.

Recall measures the model’s ability to identify all true suspicious cases, which is particularly important in fields like disease diagnosis and cybersecurity, where capturing as many true positives as possible is crucial.

Precision indicates the model’s ability to correctly identify positive cases among all those labeled as positive. It is especially critical in applications such as financial risk management and spam detection, where avoiding false warnings or misclassifications is paramount.

F1-score provides a balanced measure of precision and recall, capturing both the accuracy and sensitivity of the model. This makes it particularly well suited for evaluating AML models, where both false positives and false negatives have significant implications.

5.1.3 Baseline Methods

We conduct a comprehensive comparison of GCPAL with a variety of established baseline methods.

GNN-based methods:

  • GCN [53] is a neural network architecture that performs convolutional operations on graph-structured data, using localized node features and graph topology to learn node embeddings.

  • GAT [54] utilizes the attention mechanism to weigh the importance of different neighbors during message passing.

  • GraphSAGE [55]: GraphSAGE is a scalable graph neural network framework that learns node embeddings by utilizing neighborhood sampling and aggregation techniques. By generalizing over local structures, it generates node representations in large graphs efficiently.

  • GIN [49] improves graph representation learning with a novel aggregation function that uses multi-layer non-linear layers, allowing for superior discrimination between non-isomorphic graphs.

  • Skip-GCN [26] adds a skip connection between the intermediate feature embeddings and the raw features before the last GCN layer.

  • Evolve-GCN [26] uses a separate GCN for each time step and connects them with an RNN to better capture system dynamics.

  • OCGTL [56] is a graph-level anomaly detection model, which combines the advantages of deep one-class classification and self-neural transformation learning.

Non-graph-based methods:

  • Logistic regression is a binary classification model based on linear combinations of input features.

  • MLP conducts multi-layer non-linear transformation on the input features for classification.

Self-supervised GNN methods:

  • Inspection-L [38] is a strong baseline method using Deep Graph Informax (DGI) pre-training and downstream supervised training for AML detection.

  • GAE [57] is a graph encoder–decoder framework, which embeds nodes to low-dimensional representations and predicts inputted structures.

  • GraphMAE [58] is a masked graph autoencoder that focuses on reconstructing the masked feature.

5.1.4 Implementation Details

We set the embedding size to 128 for baseline models. The batch size is searched from \(\{64, 128,\ldots , 2028\}\). The edge and feature dropout ratios are both searched from \(\{0.1, 0.3,\ldots , 0.9\}\). The weight \(\lambda \) is tuned from \(\{0.1, 0.2,\ldots , 1.0\}\). The temperature factor \(\tau \) is searched from \(\{0.05, 0.1, 0.2, 0.5,\ldots , 5.0\}\). The neighbor number of the KNN graph is turned from \(\{1, 3, 5, 10,\ldots , 50\}\). The data ratio of supervised training is set as \(\{1\%, 2\%, 5\%, 10\%, 20\%\}\). We run each method five times with different random seeds and report the average performance along with its standard deviation. The model training is stopped if there is no improvement in performance after 50 consecutive epochs (Tables 1, 2).

5.2 Performance Comparison

We compare GCPAL with SOTA AML detection models to show its effectiveness. The results are presented in Tables 3 and 4. The superscript LF or AF on models in Table 3 indicates whether local raw features or all raw features are used. We use bold fonts to highlight the best results. According to the results, we have the following observations:

For GNN-based methods, OCGTL demonstrates the best performance, especially when the training ratios are extremely small (e.g., 1% and 2% on Elliptic and 40% on AMLworld). As the training data ratio increases, the performance of OCGTL becomes even more pronounced. For instance, OCGTL achieves the highest F1 scores (e.g., 80.3%, 82.7%, and 84.6%) among GNN models under 5%, 10% and 20% training ratios on Elliptic.

For non-graph-based methods, MLP significantly outperforms logistic regression across all training ratios due to its superior ability to model non-linear relationships. Moreover, as the training ratio increases, the benefit of using aggregated features (AF) becomes more evident. For instance, the F1 scores of \(\text {MLP}^{\text {AF}}\) surpass those of \(\text {MLP}^{\text {LF}}\) only when the training ratio exceeds 2%. The reason might be that MLP struggles to effectively utilize the complex aggregated features at smaller training ratios, as MLP models typically require large amounts of data to learn effectively.

Among self-supervised GNN methods, GAE performs poorly, likely due to the sparse interactions in AML, which do not provide enough self-supervised signals. GraphMAE shows competitive performance, as the reconstruction of masked features provides intrinsic knowledge that helps reduce reliance on labels. Inspection-L achieves the second-best performance across all training ratios, only trailing behind GCPAL, outperforming other baseline models. Notably, it attains high F1 scores (76% and 77.9%) on the Elliptic dataset, even with training ratios as low as 1% and 2%. These results highlight the effectiveness of self-supervised pre-training, which uses self-supervised signals to guide model optimization, leading to strong generalization capabilities with limited training data.

Finally, our proposed GCPAL significantly outperforms all other methods across all training ratios. For instance, GCPAL achieves the highest F1 scores of 78% and 87.3% at the lowest (1%) and highest (20%) training ratios, respectively. The performance improvement stems from the graph contrastive learning pre-training, which enables the model to learn more fine-grained node representations. Compared to Inspection-L, our method introduces a greater number of negative samples during self-supervised training, which enhances the discriminative power of the learned representations, leading to superior performance.

Table 1 Anti-money laundering detection results of GCPAL and baseline methods on the Elliptic dataset with different training ratios. We use bold fonts to highlight the best results
Table 2 Anti-money laundering detection results of GCPAL and baseline methods on the AMLworld dataset with training ratios 40\(\%\) and 60\(\%\)

5.3 Ablation and Variant Study

5.3.1 Ablation Study

Our proposed GCPAL model incorporates several essential components, including GCL between random graph views, GCL between random and KNN views, and the connected neighbor-based positive sample selection. To assess the contribution of each component, we perform an ablation study with the following variants:

  • w/o randGCL: This variant removes the GCL between random graph views from the entire GCPAL model. Specifically, the loss term \(\lambda _{GCL}(\mathcal {G}^{'}, \mathcal {G}^{''})\) is discarded in Eq. 10.

  • w/o KNNGCL: Similarly, this variant removes the GCL between random and KNN graph views by discarding the loss term \(\lambda _{GCL}(\mathcal {G}^{''}, \mathcal {G}^{KNN})\) in Eq. 10.

  • w/o neighbor pos: In this variant, we remove the neighbor-based positive sample selection strategy. As a result, positive pairs can only be the same nodes in different graph views.

We observe the following according to Tables 3 and 4: (1) all components contribute to the final performance of GCPAL. For example, the variants \(\text {w/o randGCL}^{\text {LF}}\) and \(\text {w/o KNNGCL}^{\text {LF}}\) yield F1 scores of 75.9% and 76.8%, respectively, whereas the entire KGCPAL produces the highest 78.9%. (2) Among the three variants, \(\text {w/o randGCL}^{\text {LF}}\) shows the least performance degradation. It yields the second-best F1 score (75.9%), only slightly worse than \(\text {GCPAL}^{\text {LF}}\). This may be because the KNNGCL component still benefits from the random graph view in its GCL process, mitigating the loss from omitting randGCL. (3) The variant \(\text {w/o neighbor pos}\) fails to outperform the other variants, supporting the effectiveness of our neighbor-based positive sample selection strategy. This strategy helps to eliminate semantically similar samples from the negative set, addressing the pseudo-negative sample issue.

Table 3 Ablation study of GCPAL on the Elliptic dataset. We use bold fonts to highlight the best results
Table 4 Ablation study of GCPAL on the AMLworld dataset. We use bold fonts to highlight the best results
Table 5 The performances of different GNNs on the Elliptic dataset. We use bold fonts to highlight the best results
Table 6 The performances of different GNNs on the AMLworld dataset. We use bold fonts to highlight the best results

5.3.2 Analysis of Different GNNs on GCPAL

In the GCPAL model, we have chosen GIN as the base graph encoder due to its strong graph modeling capabilities. However, the performance of GCPAL with other commonly used GNN models warrants further exploration. To investigate this, we integrate three different GNN architectures (GCN, GAT, and GraphSAGE) into the GCPAL framework and evaluate their impact. The results, presented in Tables 5 and 6, demonstrate that the choice of graph encoder architecture significantly affects GCPAL’s performance. Among the three alternative architectures, GraphSAGE consistently achieves the highest scores, followed closely by GAT and GCN. However, the GCPAL framework with the GIN encoder yields the best performance overall. This highlights the importance of selecting the most suitable graph neural network architecture for graph encoding, as different models may vary in their ability to capture complex relationships among nodes in the graph.

5.3.3 Analysis of Different Feature Subsets for KNN Graph Construction

We have investigated the influence of using different feature subsets (LF and AF) for KNN graph construction on the Elliptic dataset. The results are presented in Table 7. We can observe that as the features used for AML model training remain constant, the impact of varying KNN graph construction appears to be minor. For example, when the training ratio is 1% and AML training feature is LF, the F1 scores achieved by KNN-LF and KNN-AF are 0.780±0.012 and 0.779±0.022, respectively. This minor performance difference is consistent across all training ratios. We guess the reasons may be twofold: (1) The LF feature is sufficiently expressive to capture the essential correlations between nodes, ensuring that the constructed KNN graph remains robust and minimally affected by feature variations. (2) The two randomly augmented graph views provide graph information that compensates for minor changes in the KNN graph, maintaining the overall performance consistency.

Table 7 AML detection results using different features on the Elliptic dataset

5.4 Hyper-parameter Analysis

Fig. 2
figure 2

Performance comparison of SGCP w.r.t different loss weight \(\lambda \), temperature factor \(\tau \), and neighbor number k

Fig. 3
figure 3

Performance comparison and robust analysis of GCPAL

5.4.1 Influence of Weight \(\lambda \)

The hyper-parameter \(\lambda \) and its counterpart, \(1 - \lambda \), serve as weights for the two contrastive targets in Eq. 10, specifically the contrasts between random views and the contrasts between random and KNN views. Figure 2a illustrates the performance trend of GCPAL as \(\lambda \) varies. From the figure, we observe that the optimal value of \(\lambda \) is located near the middle of the curve, rather than at the extremes (0 or 1). This suggests that both contrastive targets contribute positively to model training, with the best performance achieved when both targets are utilized. Additionally, GCPAL performs best when \(\lambda \) is set to a smaller value (e.g., 0.1 for Elliptic and 0.3 for AMLworld). This indicates that contrasts between random and KNN views have a more significant impact on performance compared to the random view contrasts alone. We hypothesize that the KNN view, which relies on node correlations through raw features, better captures the similarities between illicit transactions.

5.4.2 Influence of Temperature Factor

In contrastive learning, the temperature factor \(\tau \) controls the sharpness of the softmax distribution, which in turn affects the decision boundary between positive and negative samples. Figure 2b shows a bell-curve relationship between \(\tau \) and performance, with the peak observed at \(\tau = 0.5\) on the AMLworld dataset before the performance begins to decline. This suggests that a moderate temperature value strikes an optimal balance, allowing the model to focus on informative positive pairs while also exploring a diverse range of negative samples, thereby enhancing representation learning. On the other hand, an excessively high value or low value of \(\tau \) disrupts this balance, either diminishing the model’s ability to discriminate between positive and negative samples or hindering convergence. This emphasizes the importance of carefully tuning \(\tau \) for achieving optimal performance.

5.4.3 Influence of Neighbor Number k

The number of neighbors, k, used to construct the KNN graph plays a crucial role in the performance of the GCPAL model. As shown in Fig. 2c, when k is small (e.g., 1 or 3), the F1 score is low, suggesting that the model is unable to effectively leverage the KNN view. As k increases, the F1 score generally improves, reaching its peak at \(k = 15\), before declining. This indicates that a moderate number of neighbors provides the most valuable information for the model. When k is too small, the model struggles to capture meaningful relationships between nodes, leading to suboptimal performance. Conversely, a very large k introduces excess noise from irrelevant neighbors, which can hinder the model’s ability to distinguish between legal and illegal transactions. Therefore, selecting an appropriate k is critical for optimizing performance and capturing the underlying patterns of illicit transactions.

5.4.4 Influence of Batch Size

We examine the impact of batch size on the training of the GCPAL model. As shown in Fig. 3a, smaller batch sizes tend to yield slightly higher F1 scores compared to larger ones. However, the difference between the results from varying batch sizes is minimal. This suggests that the GCPAL model is relatively insensitive to changes in batch size, demonstrating consistent performance across a broad range of batch sizes. This stability may be due to the model’s ability to effectively incorporate information from both the KNN graph and the random edge/feature drop perspectives, which helps mitigate the influence of batch size variations.

5.4.5 Influence of Dropout Ratio \(\alpha \) and \(\beta \)

The performance of the GCPAL model is influenced by the dropout ratios \(\alpha \) and \(\beta \), which are used to create graph views through edge and feature dropout. Figure 3b demonstrates how the F1 score changes with different combinations of dropout ratios. The results show that when the feature dropout rate is low (e.g., 0.1, 0.3, or 0.5), GCPAL achieves high F1 scores. However, when the feature dropout rate exceeds 0.5, the model’s performance degrades significantly. In contrast, when the feature dropout rate is low, a higher edge dropout rate still yields good performance. This indicates that features play a more crucial role in detecting illegal transactions, necessitating a low feature dropout rate. Additionally, this supports our previous observation that a large number of edges in the transaction network may be redundant for illegal transaction detection, thus allowing a higher edge dropout rate.

Fig. 4
figure 4

Confusion matrices of different methods under the 1% training ratio

5.4.6 Robust Analysis

We also conduct experiments on the Elliptic dataset to evaluate GCPAL’s robustness against noisy features and edges. Specifically, we randomly add noise edges (e.g., by connecting nodes at random) and noise features (e.g., by replacing some feature values with Gaussian noise) at varying proportions (5%, 10%, 15%, and 20%) to the clean data. Figure 3c shows that adding noise reduces the performance of all three methods. However, GCPAL consistently exhibits a lower performance drop compared to the others, and even with a 20% noise ratio, it outperforms Inspection-L. This is likely due to GCPAL’s use of edge and feature dropout augmentations to mitigate noise effects, as well as its inclusion of more negative samples in the graph contrastive pre-training, which enhances its discriminative power and robustness to noise perturbations.

5.4.7 Analysis of Confusion Matrix

Confusion matrices provide valuable insights into the performance of various classification models. In this section, we present the confusion matrices for four methods with a training ratio of 1%. Ideally, a perfect classifier would correctly classify all instances, placing them along the matrix’s diagonal. As shown in Fig. 4, GCPAL consistently identifies the most illegal transactions accurately, whether using LF or AF features, while GraphSAGE performs the worst due to its limitations with small graphs. These results highlight the effectiveness of GCPAL in detecting illicit transactions by learning more accurate node representations.

5.5 Discussion for Real-World Applications

Although our GCPAL model has been validated on the Elliptic and AMLworld datasets, several challenges remain for real-world deployment. (1) Scalability for large-scale transaction networks: As the number of nodes and edges increases, memory costs and training time rise significantly. To address these challenges, we propose integrating subgraph sampling and graph condensation techniques. Subgraph sampling [59, 60] enables training on representative portions of the network, reducing computational overhead while retaining essential structural information. Similarly, graph condensation methods [61, 62], such as sparsification or dimensionality reduction, simplify the graph structure and reduce memory usage without losing critical relationships. These techniques enhance scalability, enabling efficient application of our model to large networks. (2) Data privacy concerns: to ensure data privacy in real-world scenarios, we can incorporate federated learning technology. Federated learning [63, 64] allows training across decentralized devices or servers, where data remain local and only model updates are shared, thus preserving data confidentiality while improving model performance collaboratively. Furthermore, while our model is tailored for AML, it can be adapted to other domains such as rumor detection [65, 66], malware detection [67, 68], and fraud detection [69, 70]. The data formats across these applications are largely similar, so minimal preprocessing is required. In many anomaly detection tasks, labeled data are scarce, but data augmentation and contrastive learning can help reduce the dependency on labeled data. With minor adjustments, our GCPAL framework can be applied to various domains.

6 Conclusion and Future Work

This paper proposes a novel graph contrastive learning framework for AML. We leverage contrastive learning to improve the model’s expressiveness from multiple augmented views. Extensive experiments and in-depth analysis demonstrate that GCPAL outperforms SOTA AML baselines, especially with scarce labeled data (e.g., 1\(\%\) and 2\(\%\) of the training data). In the future, we plan to focus on the development of learnable augmentations to minimize the distortion of the original semantics caused by random perturbations. In addition, fraud nodes in the networks are usually surrounded by normal nodes that have been cheated. This inherently heterophilic structure plays an important role in AML. Therefore, the problem of heterophily in transaction networks is also a research priority.