Abstract
Recent advancements in deep neural networks (DNNs) have made them indispensable for numerous commercial applications. These include healthcare systems and self-driving cars. Training DNN models typically demands substantial time, vast datasets and high computational costs. However, these valuable models face significant risks. Attackers can steal and sell pre-trained DNN models for profit. Unauthorised sharing of these models poses a serious threat. Once sold, they can be easily copied and redistributed. Therefore, a well-built pre-trained DNN model is a valuable asset that requires protection. This paper introduces a robust hybrid two-level protection system for safeguarding the ownership of pre-trained DNN models. The first-level employs zero-bit watermarking. The second-level incorporates an adversarial attack as a watermark by using a perturbation technique to embed the watermark. The robustness of the proposed system is evaluated against seven types of attacks. These are Fast Gradient Method Attack, Auto Projected Gradient Descent Attack, Auto Conjugate Gradient Attack, Basic Iterative Method Attack, Momentum Iterative Method Attack, Square Attack and Auto Attack. The proposed two-level protection system withstands all seven attack types. It maintains accuracy and surpasses current state-of-the-art methods.
Similar content being viewed by others
Explore related subjects
Discover the latest articles and news from researchers in related subjects, suggested using machine learning.Avoid common mistakes on your manuscript.
1 Introduction
Deep learning is a type of machine learning (ML) that has made significant advances in solving many challenging problems for the artificial intelligence (AI) community for many years. The latest advancements in deep learning techniques have demonstrated exceptional performance across various applications, including audio and speech processing, visual data processing and natural language processing [1]. Also, it is a key technology for autonomous cars that can recognise objects and situations on the road [2]. In addition, it has been shown that it can discover complex structures in high-dimensional data [3].
Lately, the world witnessed intense competition for designing and training deep neural network (DNN) models. Such models have achieved remarkable accuracy that sometimes surpasses human performance. Undoubtedly, DNNs play an essential role in various critical applications such as self-driving cars, classification, voice recognition and automatic text generation. Additionally, they can be used for security and protection [4]. DNNs deal with several categories of data, like images [5], video [6], audio [7] and text [8]. However, the training process for such models takes a lot of time and an immense amount of data. For instance, training a deep ResNet on the ImageNet dataset with the latest GPUs took several weeks [9]. Hence, some pre-trained models are available online for free, allowing users to quickly test a specific model without needing training steps. For example, trained models using the Caffe framework, which performs various tasks, are offered online for free “Model ZooFootnote 1” [10]. As the industrial benefit from owning pre-trained models became widely known, attackers started trying to steal those models [11]. DNN models, in general, are not sufficiently secured. Once pre-trained models are obtained, legally or illegally, they can be copied, tampered with and redistributed without the owner's consent. Consequently, it is crucial to preserve the ownership of the DNN models. The research concerning this field is still in its infancy [12].
A crucial challenge is developing a reliable and secure method for authenticating DNN models. This is a relatively new area for the ML community, and this problem is well explored under the concept of digital watermarking in the security community. Generally, there are two recent approaches to protect pre-trained DNN models. The first one is to use steganography to hide the ownership of a DNN model [13]. The second approach, which is this paper's focus, is watermarking.
Digital watermarking is the process of robustly hiding information in a signal (text, image, video or audio) to verify authenticity. Watermarking was widely researched within digital keys or digital media [14]. This technique protects digital content ownership, such as text, images, sound or videos. It is primarily used for copyright verification and involves embedding a small amount of confidential data within multimedia files [15]. It can provide privacy, authentication and ownership protection for the transmitted information [16]. Watermarking is employed as a means to preserve ownership [17]. The first technique for embedding a watermark in a neural network that could be shared publicly and thus needs ownership verification was proposed by Uchida et al. [18] and Nagai et al. [10]. The neural network and its learned parameters are the marked objects in this case. However, this technique requires direct access to the model weights, treating the model as a white box.
Transfer learning [19] is a brilliant strategy that allows users to use pre-trained models to perform other tasks with less retraining time. Fine-tuning is a subset of transfer learning where a pre-trained model is adjusted to optimise performance on a specific task. It involves iterating through the entire training dataset, known as epochs, to update the model's weights and minimise the loss function. After applying a protection system, fine-tuning helps maintain model accuracy and robustness by enhancing the model's performance under distribution shifts, preventing overfitting and reducing computational costs. It is a delicate balance that, when done correctly, can yield significant benefits. However, the idea of transfer learning or fine-tuning may raise intellectual property issues in the near future. Furthermore, digital marketplaces for buying and selling pre-trained models may emerge. In this situation, protecting the copyrights of shared pre-trained models is necessary.
Most existing watermarking methods cannot efficiently handle different types of DNNs. It is hard to design a good watermark for DNN security because it should not affect the DNN’s performance on its original task, like classification or regression. Moreover, DNN owners usually prefer a watermarking algorithm that can prove their ownership rather than using simple hash functions based on the weight of matrices [20].
The main contribution of this paper is to protect the pre-trained models as intellectual property. The main contribution of the paper encompasses three elements:
-
1.
Utilisation of adversarial attacks as watermarks to safeguard the ownership of DNN models.
-
2.
Establishment of a robust hybrid two-level protection system, ensuring the resilience of one level in case of failure of the other. This robustness is developed by applying five sequenced proposals.
-
3.
Evaluation of the watermarking system by subjecting it to seven attack types: Fast Gradient Method Attack, Auto Projected Gradient Descent Attack, Auto Conjugate Gradient Attack, Basic Iterative Method Attack, Momentum Iterative Method Attack, Square Attack and Auto Attack.
The rest of this paper is structured as follows. Section 2 presents previous research related to watermarking DNN models. Section 3 presents the proposed system, while the verification method of the proposed system is illustrated in section 4. In section 5, discussions and comparative study are provided. Finally, Sect. 7 lists the conclusions of this paper.
2 Related work
DNN models are widely used and valuable in today's world. They have outperformed the traditional ML algorithms in general. Nevertheless, creating a DNN model requires a lot of resources, such as time, computation power and data. DNNs typically have a large number of parameters due to their architecture comprising multiple layers, each with numerous neurons. The size of DNNs has increased enormously, from the first CNN model with 60k parameters [21] to the VGG-16 model with 138M parameters [19]. This increase in parameters makes the computation of deep learning costly. Therefore, pruning is a viable option. Pruning aims to reduce the number of unnecessary parameters without compromising the performance of the original DNN, thus enhancing the performance of the model [22,23,24].
Training DNNs from scratch requires a lot of data and many training steps. Consequently, sometimes, it is more convenient to fine-tune existing models when the data for training is scarce [25, 26]. Fine-tuning can be a good option if the dataset is similar to the one on which the pre-trained model was trained. Thus, fine-tuning can be a highly effective way for plagiarisers to use a stolen model and train a new one with less data. The new model will perform the same as the stolen one but look different [27].
DNN model developers can monetise their work by licencing or selling access to their pre-trained models. However, some developers are worried that others may steal or share their DNN models without permission [28]. Intrinsically, DNN models are insecure and can be copied and shared without permission. This unpermitted copying and sharing can lead to problems like losing ownership rights and maliciously modifying the models. Therefore, finding ways to protect the shared trained DNNs from being misused is essential. DNN protection is a new and challenging research area [29].
The first idea of using a watermark to protect DNN was presented by Uchida et al. [18]. They designed a model embedding a digital watermark into DNN to claim ownership rights. They used a parameter regulariser to insert the watermark into the DNN model. They demonstrated that their DNN model performance did not degrade after adding the watermark. The watermark also remained intact after parameter pruning or fine-tuning. The watermark survived even when 65% of the parameters were pruned. The drawback of this paper is that the robustness of the watermark against diverse types of attacks (such as model inversion attacks and adversarial attacks) is not thoroughly discussed. It is unclear how resilient the watermark would be against sophisticated attempts to remove or alter it.
To protect the rights of the owners of trained DNN models, Nagai et al. [10] developed a method to embed watermarks in them. They specified the conditions and requirements for watermark embedding and evaluated their method against various attacks. They also used a regulariser to modify the DNN model parameters with the watermark. Their experiments showed the robustness of their technique to different attacks and did not affect the DNN performance.
A framework called DeepSigns that can robustly and reliably embed watermarks into DNNs was introduced by Rouhani et al. [30]. They used the owner’s signature as a watermark and inserted it into the data abstraction's probability density function (pdf) derived from different layers of a deep learning model. Their goal was to protect the model’s intellectual property rights. DeepSigns can work with both black-box and white-box models. Their framework can also defend against overwriting attacks, a significant advantage of their method.
A black-box method for embedding watermarks into DNNs was proposed by Adi et al. [20]. They demonstrated a practical analysis framework that can perform classification tasks without affecting the model’s original purpose. They claimed that their method can use random labels and random training instances to watermark DNNs. They also discussed the possible attacks and showed how their method can resist them.
A watermarking method for DNNs was developed by Zhang et al. [27], which generated and embedded different watermarks into the DNNs. They could remotely verify the ownership of the DNNs using a few application programming interface (API) queries. They showed that their method was resistant to various attacks, such as fine-tuning and parameter pruning. Their method could quickly verify the ownership of any deep learning model without compromising the model’s accuracy.
Le Merrer et al. [31] aimed to safeguard any machine learning model running remotely, not just the neural network. They used a zero-bit watermarking model that could exploit adversarial examples to embed a mark in the model’s behaviour. This type of watermark, along with the corresponding key to check it, should suffice for anyone suspecting illegal use of the model to verify its authenticity. They minimised the impact of the watermark on the model's performance and enabled its extraction with few queries. They applied their model to the MNIST dataset with three different neural networks specifically created for image classification, as this is a common machine learning task. Their model was resilient against overwriting, compression and transfer learning attacks. They also planned to explore other domains in the future, such as image semantic segmentation or regression, where adversarial examples are also relevant. The drawback of this paper is that they did not check their watermark robustness against adversarial attacks such as Fast Gradient Method Attack, Auto Projected Gradient Descent Attack and Auto Conjugate Gradient Attack.
A watermarking technique for DNNs was proposed by Wang et al. [32], which used an independent neural network to mark the DNNs selectively. The watermark was inserted and extracted using error back-propagation, and the independent neural network was only used in training and verification but not publicly released. Their experiments demonstrated that their watermarking method did not affect the performance of the DNNs and that it was robust to common attacks such as compression and fine-tuning. Their method offered high fidelity, capacity and robustness for DNN security.
A critical DNN model for medical X-ray images was developed and secured by Gupta et al.[33]. They used a watermarking technique to protect their model from intellectual property theft, as their model dealt with sensitive data of coronavirus disease patients. They trained their model on 2000 chest X-rays of infected and non-infected people and achieved over 96% accuracy. Their model could estimate the probability of infection and help in early detection and prevention of the disease. They claimed that their watermarking method ensured the safety of their valuable model, which could be a lifesaver in the pandemic.
In [34], Bangyal et al. started by pre-processing the fake news dataset, which involved replacing missing values, noise removal, tokenisation and stemming. They applied a semantic model with term frequency and inverse document frequency weighting for data representation. They applied eight machine learning algorithms and four deep learning models in the evaluation step. Based on the results, they developed a highly efficient prediction model with Python and trained and evaluated the classification model according to performance measures. The model was then tested on a set of unclassified fake news on COVID-19 to predict the sentiment class of each piece of news. The results demonstrated high accuracy compared to other models. Also, in [35], Contreras et al. presented a study that uses a Spanish-language Transformers model for sentiment analysis of tweets in Mexico during the COVID-19 pandemic, demonstrating high precision compared to other models. Additionally, Bangyal et al.[36] investigated the use of machine learning algorithms for classifying the sentiment of tweets into positive, negative or neutral categories, emphasising its importance in building business decision support systems.
3 Proposed hybrid two-level protection system
This paper presents a two-level protection system to preserve pre-trained DNN models ownership. The first-level uses zero-bit watermarking, while the second-level uses an adversarial attack as a watermark. Figure 1 illustrates the overall idea of the proposed system.
3.1 First-level protection
At the first-level, zero-bit watermarking [31] is used as the first protection. Suppose an entity (individual or company) designed and trained a machine learning model, particularly a neural network, and wants to apply zero-bit watermarking to it [31]. This model can then be deployed for various applications and services. In case of a security breach in the application (where the model has been copied at the bit level), the entity can query the remote service suspected of reusing the leaked model to address its concerns. The approach to zero-bit watermarking methods, similar to classic watermarking techniques [37] [38], involves embedding the zero-bit watermark in the model (performed by the entity), verifying the presence or absence of the watermark in the suspected model (also performed by the entity) and studying probable attacks that others might perform to remove the model's watermark deliberately. Embedding a zero-bit watermark in a generic classifier is a brilliant idea as it is unexpected to the attacker. Assume the input space dimension is d, the finite targeted set of labels is C, and the set of real numbers is \(R\). Assume \(k:{R}^{d}\to C\) is the problem's optimal classifier (i.e. \(k(x)\) returns the right answer). Assume \(\widehat{k}:{R}^{d}\to C \)is the classifier that was trained to be watermarked, and F is the space of all possible classifiers. The goal is to obtain a zero-bit marked edition of \(\widehat{k}\) (called \({\widehat{k}}_{w}\)) and a set \(K\subset {R}^{d}\) of particular inputs, called the key, also their labels\(\left\{ {\hat{k}_{w} \left( x \right),{ }x{ } \in { }K} \right\}\) . The aim is to use the key to query a remote model that could be one of two models either \({\widehat{k}}_{w}\) or a different unmarked model \(k_{r} { } \in { }F\). The mentioned key, which contains the “object” to be classified directly, is used to insert the watermark into \(\widehat{k}\).
An ideal watermarked model and key couple (\({\widehat{k}}_{w}, K)\) should satisfy these requirements: loyal, efficient, effective, robust and secure.
-
Loyal: The watermark embedding does not affect the original classifier's performance.
$$ \forall x \in { }R^{d} ,{ } \notin K,{ }\hat{k}\left( x \right) = \hat{k}_{w} \left( x \right) $$(1) -
Efficient: The key is minimised in length because accessing the watermark necessitates |K| requests.
-
Effective: The embedding enables the specific identification of \({\widehat{k}}_{w}\) by utilising K (zero-bit watermarking).
$$ \forall k_{r} \in {\text{ F}},{ }k_{r} \ne \hat{k}_{w} { = }x \in K $$(2)$$ k_{r} \left( x \right) \ne \hat{k}_{w} \left( x \right) $$(3) -
Robust: Attempts to alter \({\widehat{k}}_{w}\), such as compression or fine-tuning, do not result in removing the watermark.
$$ \forall x \in {\text{ K}},{ }\left( {\hat{k}_{w} + \varepsilon } \right)\left( x \right) = \hat{k}_{w} \left( x \right) $$(4) -
Secure: No efficient algorithm is available for an unauthorised party to detect the presence of the watermark in a model.
Figure 2 shows the methodology within the context of a binary classifier (without loss of generality). The selection of input points for watermarking the owned model and later querying a suspected remote model is crucial. Opting for a non-watermarking solution that arbitrarily relies solely on choosing |K| training examples (along with their correct labels) is highly unlikely to successfully identify a specific valuable model, as it involves correctly classifying those points. Achieving this task is uncomplicated for highly accurate classifiers, leading to comparable results and undermining effectiveness. Conversely, adopting an alternative strategy involves selecting |K| arbitrary examples and adjusting \(\widehat{k}\) to change their classification (e.g. for each x in K,\(\hat{k}\left( x \right){ } \ne \hat{k}_{w} \left( x \right)\) ). This provides an option to modify the model's behaviour in a distinguishable manner.
However, fine-tuning, even on a small number of examples that may be distant from decision frontiers, will significantly impact \(\widehat{k}\)'s performance. The resulting solution will lack loyalty. These observations collectively suggest that the selected points should be close to the original model's decision frontier, meaning their classification is non-trivial and heavily relies on the model. The purpose of adversarial perturbations [39] [10] is to identify and manipulate such inputs. When given a trained model, any well-classified example can be subtly modified to be misclassified. These modified samples are called “adversarial examples” or adversaries.
The initial stage involves selecting a small key set, K, comprising two categories of adversaries in terms of input points. The first category consists of traditional adversaries, termed true adversaries, which \(\widehat{k}\) misclassifies despite their proximity to well-classified examples. The second category comprises false adversaries generated by applying an adversarial perturbation to a well-classified example without affecting its classification. In practical terms, the “fast gradient sign method” proposed in [39] is employed with a suitable gradient step to generate potential adversaries of both types from training examples. These adversaries are inputs closer to a decision frontier than their base inputs. The purpose of adversarial attacks is to modify these inputs in the direction of other classes.
Subsequently, these inputs, constrained near the decision frontier, are employed to embed the watermark in the model. The model \(\widehat{k}\) undergoes fine-tuning to transform into \({\widehat{k}}_{w}\), ensuring that all points in K are now correctly classified:
The first-level’s technique using zero-bit watermarking [31] can be summarised in these points:
-
1.
Adversarial example generation: The algorithm first generates adversarial examples. These inputs are slightly modified from correctly classified examples to cause misclassification by the model. The fast gradient sign method is typically used to create these adversarial examples.
-
2.
Key set creation: A key set of true and false adversaries is created. True adversaries are generated by modifying inputs such that they cause misclassification, while false adversaries are created by making slight perturbations that do not change the classification of the input.
-
3.
Frontier stitching: The model is fine-tuned using these adversarial examples. True adversaries are adjusted to be correctly classified by the model, effectively "stitching" the decision boundaries around these inputs. This ensures that the watermark, encoded in the specific classification of these adversaries, is embedded into the model's decision boundaries without significantly affecting overall model performance.
-
4.
Watermark extraction: The presence of the watermark can be verified remotely by querying the model with the key set. The watermark is detected by checking if the model classifies these adversarial inputs as expected.
This process ensures that the watermark is subtly embedded into the model without significantly affecting its performance. Furthermore, the watermark can be extracted even when the model is accessed remotely via a service API. This provides a robust and efficient method for asserting ownership of machine learning models.
3.2 Second-level protection
The second-level of protection consists of five sequenced proposals, as shown in Fig. 3.
3.2.1 The first proposal
The first proposal was about choosing a suitable embedded adversarial attack. Various attacks were tried using several methods and parameters like flip, zoom, crop and rotate, as shown in Fig. 4. The experiments of flip attack were two experiments (flip horizontally, flip vertically), but both failed to give satisfactory results. The crop attack experiments encompassed four trials (crop 10%, crop 15%, crop 20% and crop 25%). However, all cropping attempts proved unsuccessful in yielding satisfactory results. Also, four rotation attack experiments were conducted, involving 30°, 60°, 90° and 120\(^\circ \). Experiments using 30\(^\circ \), 60\(^\circ \) and 120\(^\circ \) produced unsatisfactory outcomes, but rotating 90\(^\circ \) yielded perfect results. Hence, the experiments proved that the rotate attack was the best.
3.2.2 The second proposal
The second proposal was to re-label the selected sample, as shown in Fig. 5. The idea is to modify the assigned labels of specific data points in the training set. This modification was done by making a dictionary that the algorithm can use to re-label the selected samples. The numbers of this dictionary were chosen to be distinctively different from the original number on the MNIST dataset to make the deliberate misclassification. Then, the selected samples were updated with the targeted labels. Finally, the DNN was retrained on the modified dataset.
3.2.3 The third proposal
The third proposal was an improvement step to the accuracy by using a pruning algorithm to eliminate unwanted weights, connections and nodes, reducing the size of a neural network. The “prune_low_magnitude” technique was used where the pruning process was conducted by considering the magnitude or strength of the weights. The first step was establishing a primary threshold (0.5) to a final threshold (0.8) to select connections for pruning, often relying on magnitude values. Then, it eliminated connections or weights that fell below the designated magnitude threshold. Finally, the pruned model was retrained to refine its performance.
3.2.4 The fourth proposal
The fourth proposal was to improve the second-level’s robustness against attacks, as shown in Fig. 6. The results of the experiments revealed that sometimes, the second-level’s watermark (the proposed watermark) failed after applying attacks. This improvement was made by applying a zoom attack with a factor of (0.75) to the selected sample after applying a rotate attack (90\(^\circ )\). Choosing the zoom factor (0.75) was not arbitrary. This choice was made after trying many zoom factors like (0.25, 0.5, 0.75, 1.25, 1.5).
3.2.5 The fifth proposal
The fifth proposal was improving the first-level's robustness against attacks. The first-level's watermark (their proposed watermark [31]) often failed after applying attacks. Enhancement was achieved by changing the selected key number of samples from (k = 20, i.e. 20 samples) to (k = 40, i.e. 40 samples). The selection of (k = 40) was not random; it was made after experimenting with various samples, including (25, 30, 40, 50, 100). This process is displayed in Fig. 7.
In summary, the second-level of protection changes the input image by applying a 90-degree rotation and then a 0.75 zoom. The image's label is then replaced with another label that is distinctively different from the original input to minimise the effect of this misclassification on the system. The labels are chosen randomly using a dictionary and kept throughout the experiments. The embedded keys are these modified images, along with the new labels. The robustness of the approach was verified experimentally by conducting multiple adversarial attacks on the system and ensuring the existence of the second-level keys.
Figure 8 shows a flow chart briefly describing the hybrid two-level protection system process. The process begins at “Start.” It then proceeds to “Generate key set K (zero-bit watermark).” The next step is “fine-tuning.” This is followed by an “Adversarial attack (example: rotate or zoom).” The result of this attack is “changing its classification deliberately.” Another round of “fine-tuning” follows. The process then reaches a decision point labelled “Acceptable accuracy.” If the accuracy is unacceptable (“No”), the process loops back for more fine-tuning. The process ends if the accuracy is acceptable (“Yes”).
4 System verification
The verification of the proposed system is done through three stages. First, requests are sent to the external DNN service provider to obtain output labels associated with randomly selected keys. Then, calculate the count of discrepancies between model predictions and designated model labels. Finally, apply a threshold test to the number of discrepancies, as shown in Fig. 9:
-
If the count of discrepancies falls below the threshold, it indicates a high similarity between the model used by the external service provider and the watermarked DNN.
-
If the count of discrepancies is zero, it implies that the two models are identical duplicates.
-
If the count of discrepancies is above the threshold, this shows a low similarity to the model in question. Hence, the investigated model is probably not the watermarked model.
The conflicting relationship between robustness and effectiveness is noticeable, especially in situations where, for instance, if \(\left( {\hat{k}_{w} + \varepsilon } \right){ } \in {\text{ F}}\) violates one of the two attributes. To establish a practical framework for the problem, a measure, mK(a, b), was introduced to assess the matching between two classifiers, a and b, both belonging to the set F [31]:
where d represents the Kronecker delta. It is noticeable that mK (a, b) is essentially the Hamming distance calculated between a(k) and b(k) vectors, relying on elements in K. By emphasising distance in this manner, two criteria can now be reformulated in a way that avoids conflicts.
-
Effectiveness:
$$\forall {k}_{r}\in \text{ F}, {\text{m}}_{\text{K}}({\widehat{k}}_{w}, {\widehat{k}}_{r})\approx |\text{k}|$$(7) -
Robustness:
$$ \begin{array}{*{20}c} {\forall \varepsilon \approx 0,} & {\left( {\hat{k}_{w} ,~\hat{k}_{w} + \varepsilon } \right) \approx 0} \\ \end{array} $$(8)
5 Discussion and comparative analysis
Experiments were performed on the MNIST dataset [40], employing the Keras backend [41] integrated with the TensorFlow platform [42]. The neural network architecture CNN comprises three convolutional layers (of sizes 16, 32 and 64), with a kernel size of 3 × 3, followed by a flatten layer, then dense, using “Relu” as an activation function. This architecture is the same as the published code of [31].Footnote 2 The first-level’s key = 20, threshold = 0.05 and epochs = 3, then fine-tuning epochs = 5. The second-level’s key = 4, threshold = 0.4 and epochs = 3, then fine-tuning epochs = 2. All these chosen parameters were not arbitrary; they were made after experimenting with various samples several times.
After applying the five proposals mentioned in the “proposed approach,” the hybrid two-level protection system was tested against several adversarial attacks. Seven different adversarial attacks were chosen to evaluate the system (Fast Gradient Method Attack, Auto Projected Gradient Descent Attack, Auto Conjugate Gradient Attack, Basic Iterative Method Attack, Momentum Iterative Method Attack, Square Attack and Auto Attack).
Compared to [31], the proposed system achieved better overall accuracy and survived multiple adversarial attacks, preserving both levels of watermarking.
5.1 Adversarial attacks
5.1.1 Fast Gradient Method Attack
The Fast Gradient Method (FGM) was introduced by Goodfellow et al. in [39]. This technique presents a rapid and effective approach for generating adversarial examples by perturbing input data in the direction that maximises the loss function. Specifically, the authors leverage the loss gradient concerning the input data to determine the modification direction necessary to augment the loss. The first step is calculating the gradient of the loss function concerning the input data. Then, the computed gradient is used to identify the direction in which the input data should be adjusted to maximise the loss. Finally, a small perturbation is introduced to the input data in the determined direction. The computational efficiency of the FGM attack renders it suitable for real-time applications that generate adversarial examples. These examples, designed to deceive machine learning models, expose vulnerabilities in their decision boundaries. The FGM attack has found widespread use in exploring adversarial robustness, contributing significantly to our understanding of how subtle perturbations in input data can affect machine learning models.
5.1.2 Auto Projected Gradient Descent Attack
The Auto Projected Gradient Descent (Auto-PGD) attack was introduced by Croce et al. [43]. It is used to evaluate the robustness of machine learning models against adversarial examples. Auto-PGD is an extension of the Projected Gradient Descent (PGD) attack, a popular method for generating adversarial examples. Within adversarial attacks, PGD aims to modify input data to deceive the model intentionally. PGD introduces parameters, a perturbation cost and a step size to regulate the quantity and direction of the perturbation. Auto-PGD optimises the attack strength by adapting the step size across iterations depending on the overall attack budget and the progress of the optimisations.
5.1.3 Auto Conjugate Gradient Attack
The Auto Conjugate Gradient (ACG) attack is a white-box adversarial attack introduced by Yamamura et al. [44]. It is based on the conjugate gradient (CG) method, which optimises problems with gradients. The ACG attack is designed to generate adversarial examples that mislead the prediction of a machine learning model. ACG found more adversarial examples with fewer iterations than the existing state-of-the-art Auto-PGD (APGD) algorithm. They also proposed a measure called the diversity index (DI) to quantify the degree of diversification of the attacks. They showed that the more diverse search of the proposed method remarkably improves its attack success rate.
5.1.4 Basic Iterative Method Attack
The Basic Iterative Method (BIM) Attack [45] is a form of iterative adversarial attack designed to create controlled perturbations in input data, intending to deceive machine learning models. This method involves applying a sequence of small, controlled perturbations to the input data across multiple iterations. The iterative approach enables a gradual exploration of the input space, facilitating the discovery of subtle perturbations capable of causing misclassification in the targeted model. The (BIM) is an iterative approach similar to PGD, as the concept involves maintaining the new pixels reasonably close to the input ones.
5.1.5 Momentum Iterative Method Attack
The Momentum Iterative Method (MIM) Attack [46] represents an advancement over traditional iterative adversarial attack approaches like the Basic Iterative Method (BIM) or Projected Gradient Descent (PGD). It integrates a momentum term into the iterative process, enabling the accumulation of perturbations in a directional manner across iterations. This process introduced momentum, which aids the attack in navigating the input space more efficiently, potentially discovering additional adversarial directions compared to methods without momentum. The MIM Attack is a modified version of iterative adversarial attacks, leveraging momentum to improve the effectiveness of identifying adversarial examples. Its introduction was proposed to enhance the success rate of adversarial attacks on machine learning models, particularly within computer vision tasks.
5.1.6 Square Attack
The Square Attack was presented by Andriushchenko et al. [47]. Square Attack is a black-box adversarial attack that efficiently generates adversarial examples through random search methods, eliminating the need for explicit gradient information. Square Attack employs a randomised search strategy that selects square-shaped updates at random positions. In each iteration, this perturbation is strategically positioned near the boundary of the feasible set. This approach aims to achieve effectiveness and practicality, especially when there is limited access to detailed knowledge about the model’s internal parameters.
5.1.7 Auto Attack
Croce et al. introduced the Auto Attack method in [43]. This method is designed to offer a robust assessment of the adversarial resilience of machine learning models. In contrast to conventional attack methods, Auto Attack constitutes an ensemble of varied attacks that operate without the necessity of manually specified parameters. The primary objective of this approach is to improve the dependability and inclusiveness of evaluations regarding adversarial robustness by employing a spectrum of attack strategies without requiring user-defined parameters. They proposed an ensemble. The ensemble of attacks is designed to overcome the limitations of existing attacks and provide a more reliable evaluation of adversarial robustness. These four attacks are the Auto Projected Gradient Descent Attack, Deep-fool, Square Attack and Projected Gradient Descent attack.
5.2 The first, second and third proposal's results
The first proposal revolved around the selection of an appropriate embedded adversarial attack. After several experiments, it was determined that the most effective among all the tested attacks was the rotation attack with 90°. Then, the idea mentioned in the second proposal was applied, which involved re-labelling selected samples. The concept was to alter the assigned labels of particular data points within the training set. Subsequently, the selected samples were updated with the targeted labels. The third proposal represented a refinement approach to enhance accuracy by implementing a pruning algorithm, which removed unwanted weights, connections and nodes using the “prune_low_magnitude” technique.
Fine-tuning the model after embedding the watermark is essential for maintaining its accuracy and robustness. The fine-tuning process ensures that the model adapts to the modifications introduced by the watermarking procedure without significant performance degradation. Here is how the fine-tuning process works and its contributions:
-
1.
Initial Watermark Embedding
-
Adversarial Examples: Initially, adversarial examples are generated and introduced into the model.
-
Embedding Process: The embedding uses a fine-tuning process as new samples retrain the model.
-
-
2.
Fine-Tuning Epochs
After embedding the watermark, fine-tuning is carried out through several epochs of additional training. This process involves:
-
Continuing Training: The model is trained further on the original training data, including the adversarial examples, to stabilise its performance.
-
Adjusting Learning Rate: A lower learning rate is typically used during fine-tuning to make incremental adjustments without significant overhauls to the model's learned parameters.
-
3.
Maintaining Model Accuracy
Fine-tuning helps to stabilise the decision boundaries adjusted by the watermark embedding process. The goal is to retain the model's original accuracy while incorporating the watermark.
-
4.
Ensuring robustness
-
Reinforcing the Watermark: Fine-tuning solidifies the watermark within the model, making it resilient to attempts at watermark removal.
-
Pruning Low Magnitude Weights: This technique removes weights with magnitudes below a certain threshold, which can help reduce the model size and improve efficiency without significantly impacting accuracy. Fine-tuning post-pruning is crucial to recover any minor performance loss and ensure the remaining weights are optimised. Pruning also reduced overfitting and improved generalisation.
-
Table 1, Table 2, Table 3, Table 4, Table 5, Table 6 and Table 7 present the comparison of Le Merrer’s accuracy [31] and the hybrid system's accuracy in the first and second columns, respectively. The third column shows the accuracy of the hybrid system after applying the “third proposal: pruning.” After applying one of the seven adversarial attacks, the accuracy results are written in the fourth column. Finally, the fifth column shows the verification results of the overall hybrid system, which consists of the first and second-levels of protection. If the watermark is detected after applying the attack, it will appear on the table in green with the description “watermark is successfully verified after attack.” If not, this will appear red with the description “watermark is not successfully verified after attack.”
5.3 The fourth proposal's results
The fourth proposal involves enhancing the resilience of the second-level against attacks. The experiment results indicate occasional failure of the second-level’s watermark (the proposed watermark) after undergoing attacks. To address this issue, the enhancement includes applying a zoom attack with a factor of (0.75) to the selected sample following the rotate attack (90°).
Table 8, Table 9, Table 10, Table 11, Table 12, Table 13 and Table 14 present the comparison of Le Merrer’s accuracy [31] and the accuracy of the hybrid system in the first and second columns, respectively. The third column displays the accuracy of the hybrid system after applying the “third proposal: pruning.” The fourth column contains the accuracy results after implementing one of the seven adversarial attacks. Finally, the fifth column shows cases of the verification outcomes of the overall hybrid system, which includes both the first and second-levels of protection. If the watermark remains intact and robust after an attack, it will be indicated in the table in green, and the description “watermark is successfully verified after attack.” Conversely, if the watermark is not successfully verified after an attack, it will be indicated with a red colour and the description “watermark is not successfully verified after attack.”
5.4 The fifth proposal's results
The fifth proposal focuses on enhancing the first-level’s resistance to attacks, particularly when the proposed watermark [31] on the first-level fails after undergoing attacks. The improvement involves modifying the number of selected key samples from (k = 20, i.e. 20 samples) to (k = 40, i.e. 40 samples). Table 15, Table 16, Table 17, Table 18, Table 19, Table 20 and Table 21 present the comparison of Le Merrer’s accuracy [31] and the accuracy of the hybrid system in the first and second columns, respectively. The third column presents the accuracy of the hybrid system after implementing the “third proposal: pruning.” The fourth column contains accuracy results after applying one of the seven adversarial attacks. Lastly, the fifth column showcases the verification outcomes of the overall hybrid system, encompassing both the first and second-levels of protection. If the watermark remains resilient and intact after an attack, it is denoted on the table with a green colour along with the description “watermark is successfully verified after attack.”
The five proposals' effect on enhancing accuracy gradually from the first to the fifth is shown in Fig. 10 and Fig. 11. FGM attack tables (1, 8 and 15) were chosen to be our case study. Figure 10 illustrates the difference between Le Merrer’s [31] accuracy (grey bars), the hybrid system's accuracy (blue bars), the enhancement of accuracy after applying pruning (green bars) and the hybrid accuracy after applying the FGM attack (red bars). In addition, Fig. 10 shows that the accuracy after applying the attack is sufficiently acceptable, and the attack did not significantly affect the accuracy of the first, second, third, fourth and fifth proposals. Figure 10 also shows that the two-level system made sure accuracy is better than that of Le Merrer’s [31] by about 10% after applying the fifth proposal.
Figure 11 shows the watermark’s first-level and second-level success rates passing through the five proposals when applying the FGM attack. The blue line illustrates a gradual enhancement in the success rate as proposals from proposal one to proposal five were applied. As demonstrated, after applying the first, second and third proposals, the success rate was 40%, implying that the first-level watermark (Le Merrer’s watermark [31]) persisted in only 4 out of 10 instances. To enhance this, the fifth proposal was applied, as mentioned before, which resulted in 100% of tested cases showing that the first-level's watermark still exists even after the attack. Likewise, the green line also shows the enhancement of the success rate but for the second-level watermark.
The list of potential limitations in the proposed hybrid two-level system are:
-
1.
The experiment's execution time was high during fine-tuning and pruning, which means the system had a high computational cost.
-
2.
The experiments of choosing an adversarial attack on the first proposal were done manually, which was inefficient and may have overlooked other optimal values.
6 Declarations
We, the authors, have no conflicts of interest to disclose. Also, we declare that we have no significant competing financial, professional or personal interests that might have influenced the performance or presentation of the work described in this manuscript.
7 Conclusion
Training deep neural networks (DNNs) requires substantial time and vast amounts of data, and often involves high computational costs. Unauthorised selling or distribution of these models poses a significant challenge, highlighting the crucial issue of copyright protection for DNNs. This paper presented a hybrid two-level protection system to preserve the ownership of pre-trained DNN models. The system ensures that if one level fails, the other will survive. The second-level of the proposed system includes five key proposals. The first proposal is to choose a suitable adversarial attack, specifically rotating the selected samples by 90°. The second proposal is to re-label the selected samples. The third proposal employs a pruning technique called “prune_low_magnitude.” The fourth proposal enhances the second-level's robustness against attacks by applying a zoom attack with a factor of 0.75 to the selected sample after the 90° rotation. The fifth proposal strengthens the first-level’s robustness by increasing the number of selected key samples from 20 to 40. These proposals create a powerful system capable of withstanding diverse types of adversarial attacks. The resilience of the proposed system was evaluated against seven types of attacks: Fast Gradient Method Attack, Auto Projected Gradient Descent Attack, Auto Conjugate Gradient Attack, Basic Iterative Method Attack, Momentum Iterative Method Attack, Square Attack and Auto Attack. After these attacks, the accuracy degradation was measured, showing a slight decrease ranging from 0.1 to 0.4. This minor reduction does not significantly impact the system's performance. The proposed two-level system proved to be more resilient, with less accuracy loss, and can survive adversarial attacks better than other state-of-the-art methods.
Data availability
The dataset used in this paper is the MNIST dataset [40].
The choice of MNIST dataset was because of these reasons:
(1)To compare the second-level protection to the first-level as it is published online with the code. (2) Simplicity and ease of use: The MNIST dataset is straightforward and requires minimal time and effort in data pre-processing and formatting. (3) Real-World Data: The dataset consists of real-world data, specifically images of handwritten digits. (4) Benchmarking: The MNIST dataset allows researchers to compare the performance of their algorithms with others in a consistent manner, as done in this paper. (5) Size: The dataset is large enough (60,000 training images and 10,000 testing images) for training robust models but not so large as to be computationally prohibitive. (6) Standardisation: The images in the MNIST dataset have been pre-processed to fit into a 28 × 28 pixel. This standardisation simplifies the task for the learning algorithm. (7) Public Availability: The MNIST dataset is publicly available.
References
Alzubaidi L, Zhang J, Humaidi AJ et al (2021) Review of deep learning: concepts, CNN architectures, challenges, applications, future directions. J Big Data 8:53. https://doi.org/10.1186/s40537-021-00444-8
Maheshwari A (2019) Digital transformation: building intelligent enterprises
LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521:436–444. https://doi.org/10.1038/nature14539
Wang B, Yao Y, Shan S, et al (2019) Neural Cleanse: Identifying and mitigating backdoor attacks in neural networks. In: Proceedings of 40th IEEE Symposium on Security and Privacy. pp 1–17
Krizhevsky A, Sutskever I, Hinton G (2012) imagenet classification with deep convolutional neural networks. In: Proceedings of International Conference on Neural Information Processing Systems. pp 1–9
Karpathy A, Toderici G, Shetty S, et al (2014) Large-scale video classification with convolutional neural networks. In: Proceedings of European Conference on Computer Vision
Oord A Den, Dieleman S, Schrauwen B (2013) Deep content-based music recommendation. In: Proceedings of International Conference on Neural Information Processing Systems. pp 2643–2651
Mikolov T, Karafiat M, Burget L, et al (2010) Recurrent neural network based language model. In: Proceedings of INTERSPEECH 1045–1048
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition. pp 770–778
Nagai Y, Uchida Y, Sakazawa S, Satoh S (2018) Digital watermarking for deep neural networks. Int J Multimed Info Retr 7:3–16. https://doi.org/10.1007/s13735-018-0147-1
Tramer F, Zhang F, Juels A (2016) Stealing machine learning models via prediction APIs. In: 25th USENIX Security Symposium. pp 601–618
Meng R, Cui Q, Yuan C (2018) A Survey of image information hiding algorithms based on deep learning. Comput Modeling Eng Sci 117:425–454. https://doi.org/10.31614/cmes.2018.04765
Li Z, Guo S (2019) DeepStego: Protecting intellectual property of deep neural networks by steganography
Naory D, Naorz M, Lotspiech J (2001) revocation and tracing schemes for stateless receivers. In: Proceedings of Annual International Cryptology Conference. pp 41–62
Fkirin A, Attiya G, El-Sayed A (2016) Steganography literature survey, classification and comparative study. Commun Appl Electron 5:13–22. https://doi.org/10.5120/cae2016652384
Fkirin A, Attiya G, El-Sayed A (2017) A new approach for colored watermarking image into gray scale image using wavelet fusion. Opt Quant Electron 49:284. https://doi.org/10.1007/s11082-017-1120-6
Fkirin A, Attiya G, El-Sayed A (2021) Two-level security approach combining watermarking and encryption for securing critical colored images. Opt Quant Electron 53:285. https://doi.org/10.1007/s11082-021-02875-2
Uchida Y, Nagai Y, Sakazawa S (2017) Embedding watermarks into deep neural networks. In: Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval. pp 269–277
Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In: Proceedings of International Conference on Learning Representations. pp 1–14
Adi Y, Baum C, Cisse M, et al. (2018) Turning your weakness into a strength: watermarking deep neural networks by backdooring. In: Proceedings of the 27th USENIX Security Symposium. pp 1615–1631
LeCun Y, Jackel L, Boser B et al (1989) Handwritten digit recognition: applications of neural network chips and automatic learning. IEEE Commun Mag 27:41–46
Han S, Pool J, Tran J, Dally WJ (2015) Learning both weights and connections for efficient neural networks. In: Proceedings of the 28th International Conference on Neural Information Processing Systems. pp 1135–1143
Molchanov P, Tyree S, Karras T, et al (2017) Pruning convolutional neural networks for resource efficient transfer learning. In: Proceedings of International Conference on Learning Representations. pp 1–17
Srinivas S, Babu RV (2015) Data-free Parameter pruning for deep neural networks. In: Proceedings of British Machine Vision Conference. pp 31.1–31.12
Pittaras N, Markatopoulou F, Mezaris V, Patras I (2017) Comparison of finetuning and extension strategies for deep convolutional neural networks. In: Proceedings of International Conference on Multimedia Modeling. pp 226–237
Yosinski J, Clune J, Bengio Y, Lipson H (2014) How transferable are features in deep neural networks? Adv Neural Inf Process Syst 4:3320–3328
Zhang J, Gu Z, Jang J, et al. (2018) Protecting intellectual property of deep neural networks with watermarking. In: Proceedings of the 2018 on Asia Conference on Computer and Communications Security - ASIACCS ’18. ACM Press, New York, New York, USA, pp 159–172
Zhong Q, B LYZ, Zhang J, et al (2020) Protecting IP of deep neural networks with watermarking: A new label helps. In: Lauw HW, Wong RC-W, Ntoulas A, et al (eds) Procedding of Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer International Publishing, Cham, pp 462–474
Fkirin A, Attiya G, El-Sayed A, Shouman MA (2022) Copyright protection of deep neural network models using digital watermarking: a comparative study. Multimed Tools Appl 81:15961–15975. https://doi.org/10.1007/s11042-022-12566-z
Rouhani B, Chen H, Koushanfar F (2018) DeepSigns: A generic watermarking framework for protecting the ownership of deep learning models. In: Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, New York, NY, USA, pp 485–497
Le Merrer E, Pérez P, Trédan G (2020) Adversarial frontier stitching for remote neural network watermarking. Neural Comput Appl 32:9233–9244. https://doi.org/10.1007/s00521-019-04434-z
Wang J, Wu H, Zhang X, Yao Y (2020) Watermarking in deep neural networks via error back-propagation. In: IS&T International Symposium on Electronic Imaging 2020 Media Watermarking, Security, and Forensics. pp 1–9
Gupta L, Gupta M, Meeradevi, et al (2021) Digital watermarking to protect deep learning model. In: proceding of International Conference on Intelligent and Smart Computing in Data Analytics, Advances in Intelligent Systems and Computing. Springer Singapore, pp 207–214
Bangyal WH, Qasim R, Rehman NU et al (2021) (2021) Detection of fake news text classification on COVID-19 using deep learning approaches. Comput Math Methods Med 1:5514220. https://doi.org/10.1155/2021/5514220
Contreras Hernández S, Tzili Cruz MP, Espínola Sánchez JM, Pérez Tzili A (2023) Deep learning model for COVID-19 sentiment analysis on twitter. N Gener Comput 41:189–212. https://doi.org/10.1007/s00354-023-00209-2
Bangyal WH, Iqbal M, Bashir A, Ubakanma G (2023) polarity classification of twitter data using machine learning approach. In: 2023 International Conference on Human-Centered Cognitive Systems (HCCS). IEEE, pp 1–6
Van Schyndel RG, Tirkel AZ, Osborne CF (1994) A digital watermark. In: Proceedings of 1st international conference on image processing (Vol 2, pp 86–90). IEEE.https://doi.org/10.1109/ICIP.1994.413536
Hartung F, Kutter M (1999) Multimedia watermarking techniques. Proc IEEE 87:1079–1107. https://doi.org/10.1109/5.771066
Goodfellow IJ, Shlens J, Szegedy C (2015) Explaining and harnessing adversarial examples. In: 3rd International Conference on Learning Representations, ICLR 2015—Conference Track Proceedings. pp 1–11
Lecun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86:2278–2324. https://doi.org/10.1109/5.726791
Chollet, F (2015) keras. https://keras.io
Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, Corrado GS, Davis A, Dean J, Devin M, Ghemawat S et al. (2015) TensorFlow: Large-scale machine learning on heterogeneous systems.
Croce F, Hein M (2020) Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In: 37th International Conference on Machine Learning, ICML 2020. pp 2184–2194
Yamamura K, Sato H, Tateiwa N, et al (2022) Diversified adversarial attacks based on conjugate gradient method. In: Proceedings of Machine Learning Research. pp 24872–24894
Kurakin A, Goodfellow IJ, Bengio S (2018) Adversarial Examples in the Physical World. In: Artificial Intelligence Safety and Security. Chapman and Hall/CRC, First edition. | Boca Raton, FL:CRC Press/Taylor & Francis Group, 2018., pp 99–112
Dong Y, Liao F, Pang T, et al (2018) Boosting adversarial attacks with momentum. In: Proceedings of the IEEE conference on computer vision and pattern recognition,(CVPR) 2018. pp 9185–9193
Andriushchenko M, Croce F, Flammarion N, Hein M (2020) Square Attack: A Query-Efficient Black-Box Adversarial Attack via Random Search. Artificial Intelligence and Bioinformatics 12368 LNCS: 484–501. https://doi.org/10.1007/978-3-030-58592-1_29
Funding
Open access funding provided by The Science, Technology & Innovation Funding Authority (STDF) in cooperation with The Egyptian Knowledge Bank (EKB).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Fkirin, A., Moursi, A., Attiya, G. et al. Hybrid two-level protection system for preserving pre-trained DNN models ownership. Neural Comput & Applic 36, 21415–21449 (2024). https://doi.org/10.1007/s00521-024-10304-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-024-10304-0
Keywords
Profiles
- Alaa Fkirin View author profile
- Ahmed Samy Moursi View author profile