1 Introduction

Deep learning is a type of machine learning (ML) that has made significant advances in solving many challenging problems for the artificial intelligence (AI) community for many years. The latest advancements in deep learning techniques have demonstrated exceptional performance across various applications, including audio and speech processing, visual data processing and natural language processing [1]. Also, it is a key technology for autonomous cars that can recognise objects and situations on the road [2]. In addition, it has been shown that it can discover complex structures in high-dimensional data [3].

Lately, the world witnessed intense competition for designing and training deep neural network (DNN) models. Such models have achieved remarkable accuracy that sometimes surpasses human performance. Undoubtedly, DNNs play an essential role in various critical applications such as self-driving cars, classification, voice recognition and automatic text generation. Additionally, they can be used for security and protection [4]. DNNs deal with several categories of data, like images [5], video [6], audio [7] and text [8]. However, the training process for such models takes a lot of time and an immense amount of data. For instance, training a deep ResNet on the ImageNet dataset with the latest GPUs took several weeks [9]. Hence, some pre-trained models are available online for free, allowing users to quickly test a specific model without needing training steps. For example, trained models using the Caffe framework, which performs various tasks, are offered online for free “Model ZooFootnote 1” [10]. As the industrial benefit from owning pre-trained models became widely known, attackers started trying to steal those models [11]. DNN models, in general, are not sufficiently secured. Once pre-trained models are obtained, legally or illegally, they can be copied, tampered with and redistributed without the owner's consent. Consequently, it is crucial to preserve the ownership of the DNN models. The research concerning this field is still in its infancy [12].

A crucial challenge is developing a reliable and secure method for authenticating DNN models. This is a relatively new area for the ML community, and this problem is well explored under the concept of digital watermarking in the security community. Generally, there are two recent approaches to protect pre-trained DNN models. The first one is to use steganography to hide the ownership of a DNN model [13]. The second approach, which is this paper's focus, is watermarking.

Digital watermarking is the process of robustly hiding information in a signal (text, image, video or audio) to verify authenticity. Watermarking was widely researched within digital keys or digital media [14]. This technique protects digital content ownership, such as text, images, sound or videos. It is primarily used for copyright verification and involves embedding a small amount of confidential data within multimedia files [15]. It can provide privacy, authentication and ownership protection for the transmitted information [16]. Watermarking is employed as a means to preserve ownership [17]. The first technique for embedding a watermark in a neural network that could be shared publicly and thus needs ownership verification was proposed by Uchida et al. [18] and Nagai et al. [10]. The neural network and its learned parameters are the marked objects in this case. However, this technique requires direct access to the model weights, treating the model as a white box.

Transfer learning [19] is a brilliant strategy that allows users to use pre-trained models to perform other tasks with less retraining time. Fine-tuning is a subset of transfer learning where a pre-trained model is adjusted to optimise performance on a specific task. It involves iterating through the entire training dataset, known as epochs, to update the model's weights and minimise the loss function. After applying a protection system, fine-tuning helps maintain model accuracy and robustness by enhancing the model's performance under distribution shifts, preventing overfitting and reducing computational costs. It is a delicate balance that, when done correctly, can yield significant benefits. However, the idea of transfer learning or fine-tuning may raise intellectual property issues in the near future. Furthermore, digital marketplaces for buying and selling pre-trained models may emerge. In this situation, protecting the copyrights of shared pre-trained models is necessary.

Most existing watermarking methods cannot efficiently handle different types of DNNs. It is hard to design a good watermark for DNN security because it should not affect the DNN’s performance on its original task, like classification or regression. Moreover, DNN owners usually prefer a watermarking algorithm that can prove their ownership rather than using simple hash functions based on the weight of matrices [20].

The main contribution of this paper is to protect the pre-trained models as intellectual property. The main contribution of the paper encompasses three elements:

  1. 1.

    Utilisation of adversarial attacks as watermarks to safeguard the ownership of DNN models.

  2. 2.

    Establishment of a robust hybrid two-level protection system, ensuring the resilience of one level in case of failure of the other. This robustness is developed by applying five sequenced proposals.

  3. 3.

    Evaluation of the watermarking system by subjecting it to seven attack types: Fast Gradient Method Attack, Auto Projected Gradient Descent Attack, Auto Conjugate Gradient Attack, Basic Iterative Method Attack, Momentum Iterative Method Attack, Square Attack and Auto Attack.

The rest of this paper is structured as follows. Section 2 presents previous research related to watermarking DNN models. Section 3 presents the proposed system, while the verification method of the proposed system is illustrated in section 4. In section 5, discussions and comparative study are provided. Finally, Sect. 7 lists the conclusions of this paper.

2 Related work

DNN models are widely used and valuable in today's world. They have outperformed the traditional ML algorithms in general. Nevertheless, creating a DNN model requires a lot of resources, such as time, computation power and data. DNNs typically have a large number of parameters due to their architecture comprising multiple layers, each with numerous neurons. The size of DNNs has increased enormously, from the first CNN model with 60k parameters [21] to the VGG-16 model with 138M parameters [19]. This increase in parameters makes the computation of deep learning costly. Therefore, pruning is a viable option. Pruning aims to reduce the number of unnecessary parameters without compromising the performance of the original DNN, thus enhancing the performance of the model [22,23,24].

Training DNNs from scratch requires a lot of data and many training steps. Consequently, sometimes, it is more convenient to fine-tune existing models when the data for training is scarce [25, 26]. Fine-tuning can be a good option if the dataset is similar to the one on which the pre-trained model was trained. Thus, fine-tuning can be a highly effective way for plagiarisers to use a stolen model and train a new one with less data. The new model will perform the same as the stolen one but look different [27].

DNN model developers can monetise their work by licencing or selling access to their pre-trained models. However, some developers are worried that others may steal or share their DNN models without permission [28]. Intrinsically, DNN models are insecure and can be copied and shared without permission. This unpermitted copying and sharing can lead to problems like losing ownership rights and maliciously modifying the models. Therefore, finding ways to protect the shared trained DNNs from being misused is essential. DNN protection is a new and challenging research area [29].

The first idea of using a watermark to protect DNN was presented by Uchida et al. [18]. They designed a model embedding a digital watermark into DNN to claim ownership rights. They used a parameter regulariser to insert the watermark into the DNN model. They demonstrated that their DNN model performance did not degrade after adding the watermark. The watermark also remained intact after parameter pruning or fine-tuning. The watermark survived even when 65% of the parameters were pruned. The drawback of this paper is that the robustness of the watermark against diverse types of attacks (such as model inversion attacks and adversarial attacks) is not thoroughly discussed. It is unclear how resilient the watermark would be against sophisticated attempts to remove or alter it.

To protect the rights of the owners of trained DNN models, Nagai et al. [10] developed a method to embed watermarks in them. They specified the conditions and requirements for watermark embedding and evaluated their method against various attacks. They also used a regulariser to modify the DNN model parameters with the watermark. Their experiments showed the robustness of their technique to different attacks and did not affect the DNN performance.

A framework called DeepSigns that can robustly and reliably embed watermarks into DNNs was introduced by Rouhani et al. [30]. They used the owner’s signature as a watermark and inserted it into the data abstraction's probability density function (pdf) derived from different layers of a deep learning model. Their goal was to protect the model’s intellectual property rights. DeepSigns can work with both black-box and white-box models. Their framework can also defend against overwriting attacks, a significant advantage of their method.

A black-box method for embedding watermarks into DNNs was proposed by Adi et al. [20]. They demonstrated a practical analysis framework that can perform classification tasks without affecting the model’s original purpose. They claimed that their method can use random labels and random training instances to watermark DNNs. They also discussed the possible attacks and showed how their method can resist them.

A watermarking method for DNNs was developed by Zhang et al. [27], which generated and embedded different watermarks into the DNNs. They could remotely verify the ownership of the DNNs using a few application programming interface (API) queries. They showed that their method was resistant to various attacks, such as fine-tuning and parameter pruning. Their method could quickly verify the ownership of any deep learning model without compromising the model’s accuracy.

Le Merrer et al. [31] aimed to safeguard any machine learning model running remotely, not just the neural network. They used a zero-bit watermarking model that could exploit adversarial examples to embed a mark in the model’s behaviour. This type of watermark, along with the corresponding key to check it, should suffice for anyone suspecting illegal use of the model to verify its authenticity. They minimised the impact of the watermark on the model's performance and enabled its extraction with few queries. They applied their model to the MNIST dataset with three different neural networks specifically created for image classification, as this is a common machine learning task. Their model was resilient against overwriting, compression and transfer learning attacks. They also planned to explore other domains in the future, such as image semantic segmentation or regression, where adversarial examples are also relevant. The drawback of this paper is that they did not check their watermark robustness against adversarial attacks such as Fast Gradient Method Attack, Auto Projected Gradient Descent Attack and Auto Conjugate Gradient Attack.

A watermarking technique for DNNs was proposed by Wang et al. [32], which used an independent neural network to mark the DNNs selectively. The watermark was inserted and extracted using error back-propagation, and the independent neural network was only used in training and verification but not publicly released. Their experiments demonstrated that their watermarking method did not affect the performance of the DNNs and that it was robust to common attacks such as compression and fine-tuning. Their method offered high fidelity, capacity and robustness for DNN security.

A critical DNN model for medical X-ray images was developed and secured by Gupta et al.[33]. They used a watermarking technique to protect their model from intellectual property theft, as their model dealt with sensitive data of coronavirus disease patients. They trained their model on 2000 chest X-rays of infected and non-infected people and achieved over 96% accuracy. Their model could estimate the probability of infection and help in early detection and prevention of the disease. They claimed that their watermarking method ensured the safety of their valuable model, which could be a lifesaver in the pandemic.

In [34], Bangyal et al. started by pre-processing the fake news dataset, which involved replacing missing values, noise removal, tokenisation and stemming. They applied a semantic model with term frequency and inverse document frequency weighting for data representation. They applied eight machine learning algorithms and four deep learning models in the evaluation step. Based on the results, they developed a highly efficient prediction model with Python and trained and evaluated the classification model according to performance measures. The model was then tested on a set of unclassified fake news on COVID-19 to predict the sentiment class of each piece of news. The results demonstrated high accuracy compared to other models. Also, in [35], Contreras et al. presented a study that uses a Spanish-language Transformers model for sentiment analysis of tweets in Mexico during the COVID-19 pandemic, demonstrating high precision compared to other models. Additionally, Bangyal et al.[36] investigated the use of machine learning algorithms for classifying the sentiment of tweets into positive, negative or neutral categories, emphasising its importance in building business decision support systems.

3 Proposed hybrid two-level protection system

This paper presents a two-level protection system to preserve pre-trained DNN models ownership. The first-level uses zero-bit watermarking, while the second-level uses an adversarial attack as a watermark. Figure 1 illustrates the overall idea of the proposed system.

Fig. 1
figure 1

Framework of the proposed hybrid two-level protection system

3.1 First-level protection

At the first-level, zero-bit watermarking [31] is used as the first protection. Suppose an entity (individual or company) designed and trained a machine learning model, particularly a neural network, and wants to apply zero-bit watermarking to it [31]. This model can then be deployed for various applications and services. In case of a security breach in the application (where the model has been copied at the bit level), the entity can query the remote service suspected of reusing the leaked model to address its concerns. The approach to zero-bit watermarking methods, similar to classic watermarking techniques [37] [38], involves embedding the zero-bit watermark in the model (performed by the entity), verifying the presence or absence of the watermark in the suspected model (also performed by the entity) and studying probable attacks that others might perform to remove the model's watermark deliberately. Embedding a zero-bit watermark in a generic classifier is a brilliant idea as it is unexpected to the attacker. Assume the input space dimension is d, the finite targeted set of labels is C, and the set of real numbers is \(R\). Assume \(k:{R}^{d}\to C\) is the problem's optimal classifier (i.e. \(k(x)\) returns the right answer). Assume \(\widehat{k}:{R}^{d}\to C \)is the classifier that was trained to be watermarked, and F is the space of all possible classifiers. The goal is to obtain a zero-bit marked edition of \(\widehat{k}\) (called \({\widehat{k}}_{w}\)) and a set \(K\subset {R}^{d}\) of particular inputs, called the key, also their labels\(\left\{ {\hat{k}_{w} \left( x \right),{ }x{ } \in { }K} \right\}\) . The aim is to use the key to query a remote model that could be one of two models either \({\widehat{k}}_{w}\) or a different unmarked model \(k_{r} { } \in { }F\). The mentioned key, which contains the “object” to be classified directly, is used to insert the watermark into \(\widehat{k}\).

An ideal watermarked model and key couple (\({\widehat{k}}_{w}, K)\) should satisfy these requirements: loyal, efficient, effective, robust and secure.

  • Loyal: The watermark embedding does not affect the original classifier's performance.

    $$ \forall x \in { }R^{d} ,{ } \notin K,{ }\hat{k}\left( x \right) = \hat{k}_{w} \left( x \right) $$
    (1)
  • Efficient: The key is minimised in length because accessing the watermark necessitates |K| requests.

  • Effective: The embedding enables the specific identification of \({\widehat{k}}_{w}\) by utilising K (zero-bit watermarking).

    $$ \forall k_{r} \in {\text{ F}},{ }k_{r} \ne \hat{k}_{w} { = }x \in K $$
    (2)
    $$ k_{r} \left( x \right) \ne \hat{k}_{w} \left( x \right) $$
    (3)
  • Robust: Attempts to alter \({\widehat{k}}_{w}\), such as compression or fine-tuning, do not result in removing the watermark.

    $$ \forall x \in {\text{ K}},{ }\left( {\hat{k}_{w} + \varepsilon } \right)\left( x \right) = \hat{k}_{w} \left( x \right) $$
    (4)
  • Secure: No efficient algorithm is available for an unauthorised party to detect the presence of the watermark in a model.

Figure 2 shows the methodology within the context of a binary classifier (without loss of generality). The selection of input points for watermarking the owned model and later querying a suspected remote model is crucial. Opting for a non-watermarking solution that arbitrarily relies solely on choosing |K| training examples (along with their correct labels) is highly unlikely to successfully identify a specific valuable model, as it involves correctly classifying those points. Achieving this task is uncomplicated for highly accurate classifiers, leading to comparable results and undermining effectiveness. Conversely, adopting an alternative strategy involves selecting |K| arbitrary examples and adjusting \(\widehat{k}\) to change their classification (e.g. for each x in K,\(\hat{k}\left( x \right){ } \ne \hat{k}_{w} \left( x \right)\) ). This provides an option to modify the model's behaviour in a distinguishable manner.

Fig. 2
figure 2

Illustration of the first-level of protection of the proposed system

However, fine-tuning, even on a small number of examples that may be distant from decision frontiers, will significantly impact \(\widehat{k}\)'s performance. The resulting solution will lack loyalty. These observations collectively suggest that the selected points should be close to the original model's decision frontier, meaning their classification is non-trivial and heavily relies on the model. The purpose of adversarial perturbations [39] [10] is to identify and manipulate such inputs. When given a trained model, any well-classified example can be subtly modified to be misclassified. These modified samples are called “adversarial examples” or adversaries.

The initial stage involves selecting a small key set, K, comprising two categories of adversaries in terms of input points. The first category consists of traditional adversaries, termed true adversaries, which \(\widehat{k}\) misclassifies despite their proximity to well-classified examples. The second category comprises false adversaries generated by applying an adversarial perturbation to a well-classified example without affecting its classification. In practical terms, the “fast gradient sign method” proposed in [39] is employed with a suitable gradient step to generate potential adversaries of both types from training examples. These adversaries are inputs closer to a decision frontier than their base inputs. The purpose of adversarial attacks is to modify these inputs in the direction of other classes.

Subsequently, these inputs, constrained near the decision frontier, are employed to embed the watermark in the model. The model \(\widehat{k}\) undergoes fine-tuning to transform into \({\widehat{k}}_{w}\), ensuring that all points in K are now correctly classified:

$$\forall x\in \text{ K}, {\widehat{k}}_{w}(x)=\text{k}(x)$$
(5)

The first-level’s technique using zero-bit watermarking [31] can be summarised in these points:

  1. 1.

    Adversarial example generation: The algorithm first generates adversarial examples. These inputs are slightly modified from correctly classified examples to cause misclassification by the model. The fast gradient sign method is typically used to create these adversarial examples.

  2. 2.

    Key set creation: A key set of true and false adversaries is created. True adversaries are generated by modifying inputs such that they cause misclassification, while false adversaries are created by making slight perturbations that do not change the classification of the input.

  3. 3.

    Frontier stitching: The model is fine-tuned using these adversarial examples. True adversaries are adjusted to be correctly classified by the model, effectively "stitching" the decision boundaries around these inputs. This ensures that the watermark, encoded in the specific classification of these adversaries, is embedded into the model's decision boundaries without significantly affecting overall model performance.

  4. 4.

    Watermark extraction: The presence of the watermark can be verified remotely by querying the model with the key set. The watermark is detected by checking if the model classifies these adversarial inputs as expected.

This process ensures that the watermark is subtly embedded into the model without significantly affecting its performance. Furthermore, the watermark can be extracted even when the model is accessed remotely via a service API. This provides a robust and efficient method for asserting ownership of machine learning models.

3.2 Second-level protection

The second-level of protection consists of five sequenced proposals, as shown in Fig. 3.

Fig. 3
figure 3

Five sequenced proposals of the second-level protection

3.2.1 The first proposal

The first proposal was about choosing a suitable embedded adversarial attack. Various attacks were tried using several methods and parameters like flip, zoom, crop and rotate, as shown in Fig. 4. The experiments of flip attack were two experiments (flip horizontally, flip vertically), but both failed to give satisfactory results. The crop attack experiments encompassed four trials (crop 10%, crop 15%, crop 20% and crop 25%). However, all cropping attempts proved unsuccessful in yielding satisfactory results. Also, four rotation attack experiments were conducted, involving 30°, 60°, 90° and 120\(^\circ \). Experiments using 30\(^\circ \), 60\(^\circ \) and 120\(^\circ \) produced unsatisfactory outcomes, but rotating 90\(^\circ \) yielded perfect results. Hence, the experiments proved that the rotate attack was the best.

Fig. 4
figure 4

Choosing suitable embedded adversarial attack “first proposal”

3.2.2 The second proposal

The second proposal was to re-label the selected sample, as shown in Fig. 5. The idea is to modify the assigned labels of specific data points in the training set. This modification was done by making a dictionary that the algorithm can use to re-label the selected samples. The numbers of this dictionary were chosen to be distinctively different from the original number on the MNIST dataset to make the deliberate misclassification. Then, the selected samples were updated with the targeted labels. Finally, the DNN was retrained on the modified dataset.

Fig. 5
figure 5

Re-labelling sample “Second proposal”

3.2.3 The third proposal

The third proposal was an improvement step to the accuracy by using a pruning algorithm to eliminate unwanted weights, connections and nodes, reducing the size of a neural network. The “prune_low_magnitude” technique was used where the pruning process was conducted by considering the magnitude or strength of the weights. The first step was establishing a primary threshold (0.5) to a final threshold (0.8) to select connections for pruning, often relying on magnitude values. Then, it eliminated connections or weights that fell below the designated magnitude threshold. Finally, the pruned model was retrained to refine its performance.

3.2.4 The fourth proposal

The fourth proposal was to improve the second-level’s robustness against attacks, as shown in Fig. 6. The results of the experiments revealed that sometimes, the second-level’s watermark (the proposed watermark) failed after applying attacks. This improvement was made by applying a zoom attack with a factor of (0.75) to the selected sample after applying a rotate attack (90\(^\circ )\). Choosing the zoom factor (0.75) was not arbitrary. This choice was made after trying many zoom factors like (0.25, 0.5, 0.75, 1.25, 1.5).

Fig. 6
figure 6

Hybrid two-level protection system “after applying the fourth proposal”

3.2.5 The fifth proposal

The fifth proposal was improving the first-level's robustness against attacks. The first-level's watermark (their proposed watermark [31]) often failed after applying attacks. Enhancement was achieved by changing the selected key number of samples from (k = 20, i.e. 20 samples) to (k = 40, i.e. 40 samples). The selection of (k = 40) was not random; it was made after experimenting with various samples, including (25, 30, 40, 50, 100). This process is displayed in Fig. 7.

Fig. 7
figure 7

Hybrid two-level protection system “after applying the fifth proposal”

In summary, the second-level of protection changes the input image by applying a 90-degree rotation and then a 0.75 zoom. The image's label is then replaced with another label that is distinctively different from the original input to minimise the effect of this misclassification on the system. The labels are chosen randomly using a dictionary and kept throughout the experiments. The embedded keys are these modified images, along with the new labels. The robustness of the approach was verified experimentally by conducting multiple adversarial attacks on the system and ensuring the existence of the second-level keys.

Figure 8 shows a flow chart briefly describing the hybrid two-level protection system process. The process begins at “Start.” It then proceeds to “Generate key set K (zero-bit watermark).” The next step is “fine-tuning.” This is followed by an “Adversarial attack (example: rotate or zoom).” The result of this attack is “changing its classification deliberately.” Another round of “fine-tuning” follows. The process then reaches a decision point labelled “Acceptable accuracy.” If the accuracy is unacceptable (“No”), the process loops back for more fine-tuning. The process ends if the accuracy is acceptable (“Yes”).

Fig. 8
figure 8

Flow chart of the process of the hybrid two-level protection system

4 System verification

The verification of the proposed system is done through three stages. First, requests are sent to the external DNN service provider to obtain output labels associated with randomly selected keys. Then, calculate the count of discrepancies between model predictions and designated model labels. Finally, apply a threshold test to the number of discrepancies, as shown in Fig. 9:

  • If the count of discrepancies falls below the threshold, it indicates a high similarity between the model used by the external service provider and the watermarked DNN.

  • If the count of discrepancies is zero, it implies that the two models are identical duplicates.

  • If the count of discrepancies is above the threshold, this shows a low similarity to the model in question. Hence, the investigated model is probably not the watermarked model.

Fig. 9
figure 9

Verification

The conflicting relationship between robustness and effectiveness is noticeable, especially in situations where, for instance, if \(\left( {\hat{k}_{w} + \varepsilon } \right){ } \in {\text{ F}}\) violates one of the two attributes. To establish a practical framework for the problem, a measure, mK(a, b), was introduced to assess the matching between two classifiers, a and b, both belonging to the set F [31]:

$${\text{m}}_{\text{K}}(\text{a},\text{ b})=\sum_{x\in k}\delta (a\left(x\right),b\left(x\right))$$
(6)

where d represents the Kronecker delta. It is noticeable that mK (a, b) is essentially the Hamming distance calculated between a(k) and b(k) vectors, relying on elements in K. By emphasising distance in this manner, two criteria can now be reformulated in a way that avoids conflicts.

  • Effectiveness:

    $$\forall {k}_{r}\in \text{ F}, {\text{m}}_{\text{K}}({\widehat{k}}_{w}, {\widehat{k}}_{r})\approx |\text{k}|$$
    (7)
  • Robustness:

    $$ \begin{array}{*{20}c} {\forall \varepsilon \approx 0,} & {\left( {\hat{k}_{w} ,~\hat{k}_{w} + \varepsilon } \right) \approx 0} \\ \end{array} $$
    (8)

5 Discussion and comparative analysis

Experiments were performed on the MNIST dataset [40], employing the Keras backend [41] integrated with the TensorFlow platform [42]. The neural network architecture CNN comprises three convolutional layers (of sizes 16, 32 and 64), with a kernel size of 3 × 3, followed by a flatten layer, then dense, using “Relu” as an activation function. This architecture is the same as the published code of [31].Footnote 2 The first-level’s key = 20, threshold = 0.05 and epochs = 3, then fine-tuning epochs = 5. The second-level’s key = 4, threshold = 0.4 and epochs = 3, then fine-tuning epochs = 2. All these chosen parameters were not arbitrary; they were made after experimenting with various samples several times.

After applying the five proposals mentioned in the “proposed approach,” the hybrid two-level protection system was tested against several adversarial attacks. Seven different adversarial attacks were chosen to evaluate the system (Fast Gradient Method Attack, Auto Projected Gradient Descent Attack, Auto Conjugate Gradient Attack, Basic Iterative Method Attack, Momentum Iterative Method Attack, Square Attack and Auto Attack).

Compared to [31], the proposed system achieved better overall accuracy and survived multiple adversarial attacks, preserving both levels of watermarking.

5.1 Adversarial attacks

5.1.1 Fast Gradient Method Attack

The Fast Gradient Method (FGM) was introduced by Goodfellow et al. in [39]. This technique presents a rapid and effective approach for generating adversarial examples by perturbing input data in the direction that maximises the loss function. Specifically, the authors leverage the loss gradient concerning the input data to determine the modification direction necessary to augment the loss. The first step is calculating the gradient of the loss function concerning the input data. Then, the computed gradient is used to identify the direction in which the input data should be adjusted to maximise the loss. Finally, a small perturbation is introduced to the input data in the determined direction. The computational efficiency of the FGM attack renders it suitable for real-time applications that generate adversarial examples. These examples, designed to deceive machine learning models, expose vulnerabilities in their decision boundaries. The FGM attack has found widespread use in exploring adversarial robustness, contributing significantly to our understanding of how subtle perturbations in input data can affect machine learning models.

5.1.2 Auto Projected Gradient Descent Attack

The Auto Projected Gradient Descent (Auto-PGD) attack was introduced by Croce et al. [43]. It is used to evaluate the robustness of machine learning models against adversarial examples. Auto-PGD is an extension of the Projected Gradient Descent (PGD) attack, a popular method for generating adversarial examples. Within adversarial attacks, PGD aims to modify input data to deceive the model intentionally. PGD introduces parameters, a perturbation cost and a step size to regulate the quantity and direction of the perturbation. Auto-PGD optimises the attack strength by adapting the step size across iterations depending on the overall attack budget and the progress of the optimisations.

5.1.3 Auto Conjugate Gradient Attack

The Auto Conjugate Gradient (ACG) attack is a white-box adversarial attack introduced by Yamamura et al. [44]. It is based on the conjugate gradient (CG) method, which optimises problems with gradients. The ACG attack is designed to generate adversarial examples that mislead the prediction of a machine learning model. ACG found more adversarial examples with fewer iterations than the existing state-of-the-art Auto-PGD (APGD) algorithm. They also proposed a measure called the diversity index (DI) to quantify the degree of diversification of the attacks. They showed that the more diverse search of the proposed method remarkably improves its attack success rate.

5.1.4 Basic Iterative Method Attack

The Basic Iterative Method (BIM) Attack [45] is a form of iterative adversarial attack designed to create controlled perturbations in input data, intending to deceive machine learning models. This method involves applying a sequence of small, controlled perturbations to the input data across multiple iterations. The iterative approach enables a gradual exploration of the input space, facilitating the discovery of subtle perturbations capable of causing misclassification in the targeted model. The (BIM) is an iterative approach similar to PGD, as the concept involves maintaining the new pixels reasonably close to the input ones.

5.1.5 Momentum Iterative Method Attack

The Momentum Iterative Method (MIM) Attack [46] represents an advancement over traditional iterative adversarial attack approaches like the Basic Iterative Method (BIM) or Projected Gradient Descent (PGD). It integrates a momentum term into the iterative process, enabling the accumulation of perturbations in a directional manner across iterations. This process introduced momentum, which aids the attack in navigating the input space more efficiently, potentially discovering additional adversarial directions compared to methods without momentum. The MIM Attack is a modified version of iterative adversarial attacks, leveraging momentum to improve the effectiveness of identifying adversarial examples. Its introduction was proposed to enhance the success rate of adversarial attacks on machine learning models, particularly within computer vision tasks.

5.1.6 Square Attack

The Square Attack was presented by Andriushchenko et al. [47]. Square Attack is a black-box adversarial attack that efficiently generates adversarial examples through random search methods, eliminating the need for explicit gradient information. Square Attack employs a randomised search strategy that selects square-shaped updates at random positions. In each iteration, this perturbation is strategically positioned near the boundary of the feasible set. This approach aims to achieve effectiveness and practicality, especially when there is limited access to detailed knowledge about the model’s internal parameters.

5.1.7 Auto Attack

Croce et al. introduced the Auto Attack method in [43]. This method is designed to offer a robust assessment of the adversarial resilience of machine learning models. In contrast to conventional attack methods, Auto Attack constitutes an ensemble of varied attacks that operate without the necessity of manually specified parameters. The primary objective of this approach is to improve the dependability and inclusiveness of evaluations regarding adversarial robustness by employing a spectrum of attack strategies without requiring user-defined parameters. They proposed an ensemble. The ensemble of attacks is designed to overcome the limitations of existing attacks and provide a more reliable evaluation of adversarial robustness. These four attacks are the Auto Projected Gradient Descent Attack, Deep-fool, Square Attack and Projected Gradient Descent attack.

5.2 The first, second and third proposal's results

The first proposal revolved around the selection of an appropriate embedded adversarial attack. After several experiments, it was determined that the most effective among all the tested attacks was the rotation attack with 90°. Then, the idea mentioned in the second proposal was applied, which involved re-labelling selected samples. The concept was to alter the assigned labels of particular data points within the training set. Subsequently, the selected samples were updated with the targeted labels. The third proposal represented a refinement approach to enhance accuracy by implementing a pruning algorithm, which removed unwanted weights, connections and nodes using the “prune_low_magnitude” technique.

Fine-tuning the model after embedding the watermark is essential for maintaining its accuracy and robustness. The fine-tuning process ensures that the model adapts to the modifications introduced by the watermarking procedure without significant performance degradation. Here is how the fine-tuning process works and its contributions:

  1. 1.

    Initial Watermark Embedding

    • Adversarial Examples: Initially, adversarial examples are generated and introduced into the model.

    • Embedding Process: The embedding uses a fine-tuning process as new samples retrain the model.

  1. 2.

    Fine-Tuning Epochs

After embedding the watermark, fine-tuning is carried out through several epochs of additional training. This process involves:

  • Continuing Training: The model is trained further on the original training data, including the adversarial examples, to stabilise its performance.

  • Adjusting Learning Rate: A lower learning rate is typically used during fine-tuning to make incremental adjustments without significant overhauls to the model's learned parameters.

  1. 3.

    Maintaining Model Accuracy

Fine-tuning helps to stabilise the decision boundaries adjusted by the watermark embedding process. The goal is to retain the model's original accuracy while incorporating the watermark.

  1. 4.

    Ensuring robustness

    • Reinforcing the Watermark: Fine-tuning solidifies the watermark within the model, making it resilient to attempts at watermark removal.

    • Pruning Low Magnitude Weights: This technique removes weights with magnitudes below a certain threshold, which can help reduce the model size and improve efficiency without significantly impacting accuracy. Fine-tuning post-pruning is crucial to recover any minor performance loss and ensure the remaining weights are optimised. Pruning also reduced overfitting and improved generalisation.

Table 1, Table 2, Table 3, Table 4, Table 5, Table 6 and Table 7 present the comparison of Le Merrer’s accuracy [31] and the hybrid system's accuracy in the first and second columns, respectively. The third column shows the accuracy of the hybrid system after applying the “third proposal: pruning.” After applying one of the seven adversarial attacks, the accuracy results are written in the fourth column. Finally, the fifth column shows the verification results of the overall hybrid system, which consists of the first and second-levels of protection. If the watermark is detected after applying the attack, it will appear on the table in green with the description “watermark is successfully verified after attack.” If not, this will appear red with the description “watermark is not successfully verified after attack.”

Table 1 Applying Fast Gradient Method Attack, used embedded adversarial attack (rotate angle = 90), epsilon = 1000
Table 2 applying Auto Projected Gradient Descent, used embedded adversarial attack (rotate angle 90), epsilon = 1000
Table 3 Applying Auto Conjugate Gradient, used embedded adversarial attack (rotate angle 90), epsilon = 1000
Table 4 Applying Basic Iterative Method, used embedded adversarial attack (rotate angle 90), epsilon = 1000
Table 5 Applying Momentum Iterative Method, used embedded adversarial attack (rotate angle 90), epsilon = 1000
Table 6 Applying Square Attack, used embedded adversarial attack (rotate angle 90), epsilon = 100
Table 7 Applying Auto Attack, used embedded adversarial attack (rotate angle 90), epsilon = 100

5.3 The fourth proposal's results

The fourth proposal involves enhancing the resilience of the second-level against attacks. The experiment results indicate occasional failure of the second-level’s watermark (the proposed watermark) after undergoing attacks. To address this issue, the enhancement includes applying a zoom attack with a factor of (0.75) to the selected sample following the rotate attack (90°).

Table 8, Table 9, Table 10, Table 11, Table 12, Table 13 and Table 14 present the comparison of Le Merrer’s accuracy [31] and the accuracy of the hybrid system in the first and second columns, respectively. The third column displays the accuracy of the hybrid system after applying the “third proposal: pruning.” The fourth column contains the accuracy results after implementing one of the seven adversarial attacks. Finally, the fifth column shows cases of the verification outcomes of the overall hybrid system, which includes both the first and second-levels of protection. If the watermark remains intact and robust after an attack, it will be indicated in the table in green, and the description “watermark is successfully verified after attack.” Conversely, if the watermark is not successfully verified after an attack, it will be indicated with a red colour and the description “watermark is not successfully verified after attack.”

Table 8 Applying fast gradient method, used embedded adversarial attack (rotate angle 90 and zoom factor = 0.75), epsilon = 1000
Table 9 Applying auto Projected Gradient Descent, used embedded adversarial attack (rotate angle 90 and zoom factor = 0.75), epsilon = 1000
Table 10 Applying Auto Conjugate Gradient, used embedded adversarial attack (rotate angle 90 and zoom factor = 0.75), epsilon = 1000
Table 11 Applying Basic Iterative Method, used embedded adversarial attack (rotate angle 90 and zoom factor = 0.75), epsilon = 1000
Table 12 Applying Momentum Iterative Method, used embedded adversarial attack (rotate angle 90 and zoom factor = 0.75), epsilon = 1000
Table 13 Applying Square Attack, used embedded adversarial attack (rotate angle 90 and zoom factor = 0.75), epsilon = 100
Table 14 Applying Auto Attack, used embedded adversarial attack (rotate angle 90 and zoom factor = 0.75), epsilon = 100

5.4 The fifth proposal's results

The fifth proposal focuses on enhancing the first-level’s resistance to attacks, particularly when the proposed watermark [31] on the first-level fails after undergoing attacks. The improvement involves modifying the number of selected key samples from (k = 20, i.e. 20 samples) to (k = 40, i.e. 40 samples). Table 15, Table 16, Table 17, Table 18, Table 19, Table 20 and Table 21 present the comparison of Le Merrer’s accuracy [31] and the accuracy of the hybrid system in the first and second columns, respectively. The third column presents the accuracy of the hybrid system after implementing the “third proposal: pruning.” The fourth column contains accuracy results after applying one of the seven adversarial attacks. Lastly, the fifth column showcases the verification outcomes of the overall hybrid system, encompassing both the first and second-levels of protection. If the watermark remains resilient and intact after an attack, it is denoted on the table with a green colour along with the description “watermark is successfully verified after attack.”

Table 15 Applying Fast Gradient Method Attack, used embedded adversarial attack (rotate angle 90 and zoom factor = 0.75), epsilon = 1000, Le Merrer's Key[31] = 40
Table 16 Applying Auto Projected Gradient Descent Attack, used embedded adversarial attack (rotate angle 90 and zoom factor = 0.75), epsilon = 1000, Le Merrer’s Key [31] = 40
Table 17 Applying Auto Conjugate Gradient Attack, used embedded adversarial attack (rotate angle 90 and zoom factor = 0.75), epsilon = 1000, Le Merrer’s Key [31] = 40
Table 18 Applying Basic Iterative Method Attack, used embedded adversarial attack (rotate angle 90 and zoom factor = 0.75), epsilon = 1000, Le Merrer’s Key [31] = 40
Table 19 Applying Momentum Iterative Method Attack, used embedded adversarial attack (rotate angle 90 and zoom factor = 0.75), epsilon = 1000, Le Merrer’s Key [31] = 40
Table 20 Applying Square Attack, used embedded adversarial attack (rotate angle 90 and zoom factor = 0.75), epsilon = 100, Le Merrer’s Key [31] = 40
Table 21 Applying Auto Attack, used embedded adversarial attack (rotate angle 90 and zoom factor = 0.75), epsilon = 100, Le Merrer’s Key [31] = 40

The five proposals' effect on enhancing accuracy gradually from the first to the fifth is shown in Fig. 10 and Fig. 11. FGM attack tables (1, 8 and 15) were chosen to be our case study. Figure 10 illustrates the difference between Le Merrer’s [31] accuracy (grey bars), the hybrid system's accuracy (blue bars), the enhancement of accuracy after applying pruning (green bars) and the hybrid accuracy after applying the FGM attack (red bars). In addition, Fig. 10 shows that the accuracy after applying the attack is sufficiently acceptable, and the attack did not significantly affect the accuracy of the first, second, third, fourth and fifth proposals. Figure 10 also shows that the two-level system made sure accuracy is better than that of Le Merrer’s [31] by about 10% after applying the fifth proposal.

Fig. 10
figure 10

Enhancing accuracy gradually from the first proposal to the fifth proposal applying FGM attack

Fig. 11
figure 11

Watermark’s success rate of the first and second-levels from the first proposal to the fifth proposal applying the FGM attack

Figure 11 shows the watermark’s first-level and second-level success rates passing through the five proposals when applying the FGM attack. The blue line illustrates a gradual enhancement in the success rate as proposals from proposal one to proposal five were applied. As demonstrated, after applying the first, second and third proposals, the success rate was 40%, implying that the first-level watermark (Le Merrer’s watermark [31]) persisted in only 4 out of 10 instances. To enhance this, the fifth proposal was applied, as mentioned before, which resulted in 100% of tested cases showing that the first-level's watermark still exists even after the attack. Likewise, the green line also shows the enhancement of the success rate but for the second-level watermark.

The list of potential limitations in the proposed hybrid two-level system are:

  1. 1.

    The experiment's execution time was high during fine-tuning and pruning, which means the system had a high computational cost.

  2. 2.

    The experiments of choosing an adversarial attack on the first proposal were done manually, which was inefficient and may have overlooked other optimal values.

6 Declarations

We, the authors, have no conflicts of interest to disclose. Also, we declare that we have no significant competing financial, professional or personal interests that might have influenced the performance or presentation of the work described in this manuscript.

7 Conclusion

Training deep neural networks (DNNs) requires substantial time and vast amounts of data, and often involves high computational costs. Unauthorised selling or distribution of these models poses a significant challenge, highlighting the crucial issue of copyright protection for DNNs. This paper presented a hybrid two-level protection system to preserve the ownership of pre-trained DNN models. The system ensures that if one level fails, the other will survive. The second-level of the proposed system includes five key proposals. The first proposal is to choose a suitable adversarial attack, specifically rotating the selected samples by 90°. The second proposal is to re-label the selected samples. The third proposal employs a pruning technique called “prune_low_magnitude.” The fourth proposal enhances the second-level's robustness against attacks by applying a zoom attack with a factor of 0.75 to the selected sample after the 90° rotation. The fifth proposal strengthens the first-level’s robustness by increasing the number of selected key samples from 20 to 40. These proposals create a powerful system capable of withstanding diverse types of adversarial attacks. The resilience of the proposed system was evaluated against seven types of attacks: Fast Gradient Method Attack, Auto Projected Gradient Descent Attack, Auto Conjugate Gradient Attack, Basic Iterative Method Attack, Momentum Iterative Method Attack, Square Attack and Auto Attack. After these attacks, the accuracy degradation was measured, showing a slight decrease ranging from 0.1 to 0.4. This minor reduction does not significantly impact the system's performance. The proposed two-level system proved to be more resilient, with less accuracy loss, and can survive adversarial attacks better than other state-of-the-art methods.