research-article

Open access

Social Media Authentication and Combating Deepfakes Using Semi-Fragile Invisible Image Watermarking

Authors:

Aakash Varma Nadimpalli,

Ajita RattaniAuthors Info & Claims

Digital Threats: Research and Practice, Volume 5, Issue 4

Article No.: 40, Pages 1 - 30

https://doi.org/10.1145/3700146

Published: 09 December 2024 Publication History

PDF eReader

Abstract

With the significant advances in deep generative models for image and video synthesis, Deepfakes and manipulated media have raised severe societal concerns. Conventional machine learning classifiers for deepfake detection often fail to cope with evolving deepfake generation technology and are susceptible to adversarial attacks. Alternatively, invisible image watermarking is being researched as a proactive defense technique that allows media authentication by verifying an invisible secret message embedded in the image pixels. A handful of invisible image watermarking techniques introduced for media authentication have proven vulnerable to basic image processing operations and watermark removal attacks. In response, we have proposed a semi-fragile image watermarking technique that embeds an invisible secret message into real images for media authentication. Our proposed watermarking framework is designed to be fragile to facial manipulations or tampering while being robust to benign image-processing operations and watermark removal attacks. This is facilitated through a unique architecture of our proposed technique consisting of critic and adversarial networks that enforce high image quality and resiliency to watermark removal efforts, respectively, along with the backbone encoder-decoder and the discriminator networks. This allows images shared over the Internet to retain the verifiable watermark as long as facial manipulations or any other Deepfake modification technique is not applied. Thorough experimental investigations on SOTA facial Deepfake datasets demonstrate that our proposed model can embed a \(64\)-bit secret as an imperceptible image watermark that can be recovered with a high-bit recovery accuracy when benign image processing operations are applied while being non-recoverable when unseen Deepfake manipulations are applied. In addition, our proposed watermarking technique demonstrates high resilience to several white-box and black-box watermark removal attacks. Thus, obtaining state-of-the-art performance.

1 Introduction

Media authentication refers to the process of verifying the authenticity and integrity of digital media such as images, videos, audio recordings, or text documents [18]. With the advancement in generative models, combined with the widespread availability of vast datasets, there is a rise in digital manipulation tools and techniques that have enabled the creation of high-quality and convincing AI-generated synthetic media (such as face, audio, and text) known as Deepfakes [42, 59, 63]. Apart from many creative and artistic uses of Deepfakes [11], many harmful uses range from non-consensual pornography to disinformation campaigns intended to sow civil unrest and disrupt democratic elections. Deepfakes have been flagged as a top AI threat to society [30, 42].

In this context, several deepfake generation techniques based on facial manipulation (forgery) have been proposed [42, 63]. These facial manipulations or forgery techniques depict human subjects with altered identities (identity swap), attributes, or malicious actions and expressions (face reenactment) in a given image or video. Specifically, identity or face swapping is the task of transferring a face from the source to the target image [43]. Attribute manipulation [13, 24] is a fine-grained facial manipulation obtained by modifying simple attributes (e.g., hair color, skin tone, and gender). Similarly to identity swap, face reenactment [43] involves a facial expression swap between source and target facial images. These facial manipulation tools are easily abused by malicious users, with little to no technical knowledge, to manipulate facial images of the user, resulting in a threat to privacy, reputation, and security. In fact, several smartphone-based applications have such attribute modifications in the form of filters. For example, FaceApp,¹ a popular smartphone application, modifies an uploaded image based on the selected attribute that can be edited using a slider to regulate the magnitude of the change. The entire process of facial modification can be easily accomplished within five minutes using these applications or other pretrained models available in the online repositories.

Consequently, every year the volume of facial Deepfakes on social media has witnessed a significant rise. For instance, in \(2023\) alone, about 500,000 deepfake videos were added to social media, marking a substantial rise from previous years. In \(2021\), there were approximately 14,678 deepfake videos online, which itself was double the number from \(2018\) [65].

With this staggering growth of facial manipulation-based deepfake content in social media, it has become increasingly important to ensure the media’s authenticity against malicious tampering. The classical forensic approach for media authentication against facial manipulation includes running an automated deepfake detector [16, 31, 38]. Common Deepfake detectors include pre-trained machine learning-based binary baselines that aim to distinguish between real and deepfake data based on visual artifacts, blending boundaries, attention modules, and motion analysis [14, 21, 23, 30, 55, 67]. These passive Deepfake detection techniques, an ex-post forensics countermeasure, are still in their early stage [61, 62] as these techniques suffer from poor detection accuracy (DA) [14, 45], cross-dataset generalizability [38, 67], obtain differential performance across demographic attributes such as gender and race [37, 60], and are vulnerable to adversarial attacks [37, 60]. Further, they fail to cope with ever-evolving deepfake generation techniques.

Alternatively, watermarking is being actively researched as a proactive defense technique because it involves embedding invisible markers or signatures into authentic media content such as images or videos. These markers are unique to the creator and can help in identifying the authenticity of the content [25, 56, 70] by matching the watermark message retrieved from the media to the original embedded watermarked message. Thus, by watermarking content before it gets shared or distributed, creators can take preventive measures against malicious use or alteration of their work (as indicated by alteration to the watermark retrieved from the media). Hence, promoting and regulating the responsible and ethical use of AI is an important initiative of government leaders and the legislature [68]. Invisible watermarks are preferred because they preserve the image quality and it is less likely for a layperson to tamper with it. Traditional image watermarking [6] techniques typically transform domain coefficients of an image, using various transforms such as the discrete cosine transform (DCT) and Discrete Fourier Transform for watermark embedding. Deep-learning-based techniques such as StegaStamp [56], and HiDDeN [70] have emerged as an efficient solution over traditional image watermarking in terms of an end-to-end solution for efficient message embedding [4, 22, 35, 56, 70]. However, these aforementioned watermarking techniques are also either fragile (i.e., watermark message is altered) to basic image processing operations such as compression and color adjustments [25, 70] or overly robust to malicious transform, which deters manipulated media detection [56]. A study in [40] proposed GAN-based visible watermarking for media authentication. However, visible watermarks are more likely for the layperson to tamper with.

Importantly, for efficient media authentication and detection of manipulations, a semi-fragile invisible watermarking scheme is required that is robust to benign image transformations (such as contrast enhancement) and vulnerable to malicious transformations such as face-swapping-based Deepfakes. In other words, watermark content/messages retrieved from the media remain unaltered to the benign and altered to the malicious transforms (facial manipulations). Resiliency to benign transforms ensures that the authenticity of the digital media can be validated in the presence of basic image processing operations such as compression, resizing, and color adjustment, which are often applied during image sharing, editing, or storing. At the same time, the vulnerability of the embedded watermark to malicious transformations is required to detect potential forgery or unauthorized modifications. Traditional semi-fragile watermarking techniques [25, 56, 70] that function within the transform domain, such as semi-fragile DCT [25], are vulnerable to high-level semantic transformations to images/media and may struggle to keep up with new or advanced manipulation techniques as technology advances. Additionally, the effectiveness of these techniques largely relies on the specific transform domain selected and the parameters set for embedding the watermark. A recent work [41] proposes a first-of-its-kind deep-learning-based semi-fragile invisible watermarking scheme called FaceSigns, which is based on encoder-decoder style model and can withstand benign transformations but is vulnerable to malicious deepfake transformations for manipulated media detection. Although invisible watermarks are less likely for a layperson to tamper with, abusers are not laypersons. They will make a deliberate attempt to remove these watermarks. Therefore, the injected invisible watermark must be robust to evasion (watermark removal) attacks. However, there is a notable gap in this research regarding the model’s susceptibility to watermark removal attacks. Furthermore, there is also limited investigation into the model’s ability to generalize to unseen facial manipulations obtained using different generative techniques.

This paper aims to introduce a novel semi-fragile invisible watermarking scheme for social media authentication that can generate high-quality watermarked images. At the same time, demonstrate resilience (i.e., the retrieved watermark remains intact) to both known and unknown benign transformations. In addition, the watermarking is vulnerable to unknown malicious facial transformations. In addition, the embedded watermark using our proposed model is robust against watermark removal attacks, including white-box attacks [3, 10, 20] and black-box attacks [68]. Thus, addressing the limitation of the existing semi-fragile watermarking technique. This is facilitated through a unique architecture of our proposed model consisting of critic and adversarial networks together with their corresponding novel loss functions, along with the backbone encoder-decoder and the discriminator network. Semi-fragile watermarking was chosen for this authentication method because it effectively balances robustness and sensitivity, making it ideal for deepfake detection. In other words, this watermarking scheme withstands benign transformations, such as resizing or compression, when applied to the watermarked images without triggering false positives while remaining sensitive enough to detect malicious alterations such as Deepfakes. This ensures reliable and accurate detection of Deepfakes, which is crucial to preserving the integrity of social media images. Figure 1 illustrates the overview of the proposed approach in embedding semi-fragile invisible encrypted watermarks in facial images that withstand benign transformations and are vulnerable to malicious transformations for social media authentication. The technical contributions of our work are as follows:

Fig. 1.

(1)

A novel semi-fragile invisible facial image watermarking technique for social media authentication and for combating Deepfakes, that offers superior imperceptibility and is resilient to adversarial watermark removal attacks.

(2)

Evaluation of the model’s imperceptibility over other state-of-the-art (SOTA) watermarking models in terms of peak signal-to-noise ratio (PSNR) and structural similarity index metrics (SSIM).

(3)

Robustness analysis of the proposed model against unknown benign and malicious facial manipulation using different generative models.

(4)

Robustness analysis of the proposed model against watermark removal attacks using various white-box based (such as fast gradient sign method (FGSM) [20], Carlini & Wagner (C & W) [10], backward pass differentiable approximation (BPDA), and expectation over transformation (EOT) [3]) and black-box based watermark removal attacks (based on variational autoencoder (VAE) Embedding and Reconstruction [68]).

(5)

Through evaluation of the SOTA facial image datasets, namely, FaceForensics++ [51], CelebFaces attributes (CelebA) [34], and IMDB-WIKI [50], widely adopted for facial manipulation-based deepfake generation and detection.

(6)

Ablation study to better understand the impact of each module (network) used in our proposed model and threat model for the adversarial attacks against our proposed model.

The pros and cons of our proposed work are as follows: Our work presents a novel semi-fragile invisible watermarking scheme for social media authentication, generating high-quality watermarked images that withstand both known and unknown benign transformations while remaining vulnerable to malicious facial manipulations. This is facilitated through our proposed model’s unique architecture, which combines critic and adversarial networks with novel loss functions, a backbone encoder-decoder, and a discriminator network. This innovative design enables our watermarking scheme to overcome the shortcomings of previous watermarking methods, including susceptibility to watermark removal attacks, white-box attacks, and limited generalizability to unseen facial manipulations. Our work has two primary limitations. Firstly, the complexity of our model necessitates advanced hardware and GPU support, which we plan to address in future iterations by optimizing the model for practical deployment. Secondly, we were unable to simulate all the potential attacks described in the threat model in Section 7.2, but we intend to expand our model to better withstand these threats in future work.

This paper is organized as follows. Section 2 discusses the prior work on facial manipulation generation, passive deepfake detection, and image watermarking for media authentication. Section 3 discusses our proposed methodology of semi-fragile invisible watermarking technique. Section 4 discusses the implementation and experimental details, including the datasets used and the performance evaluation metrics. Section 5 discusses the imperceptibility and capability measures of various watermarking schemes. Section 6 discusses the comparative evaluation of the robustness and fidelity of different watermarking methods by exposing them to unseen benign and malicious transformations and calculating the retrieved bit recovery accuracy (BRA). Section 7 discusses the adversarial attacks and threat analysis of the proposed models against different white-box and black-box adversarial attacks along with the threat model. Section 8 discusses the ablation study to better understand the impact of each module used in our proposed model. Section 9 discusses the conclusion and future research directions.

2 Related Work

2.1 Facial Manipulation Generation and Passive Deepfake Detection

Facial manipulations are categorized primarily into three groups: identity swap [43], attribute manipulation [13, 24], and expression swap [57, 58]. Generative models such as auto encoders [5] (such as Faceswap), generative adversarial networks (GANs) [19] (as FSGAN [43] and AttGAN [24]), and Diffusion Models (stable diffusion) [26] are used to create highly realistic fake content, including non-existent faces or altering existing ones [47]. Among all, GAN is the most commonly adopted model for the generation of facial manipulation because it excels at generating high-quality realistic images that mimic the distribution of the original dataset.

The majority of methods currently used for DeepFake detection are based on convolutional neural networks (CNNs) (such as VGG16, ResNet50, ResNet101, ResNet152, and Xception) based binary baselines [32, 37, 38, 39, 59]. Other approaches include Long-Short-term Memory networks [12] to analyze spatio-temporal data, using facial and behavioral biometrics [1, 17, 49], examining inconsistencies in mouth movements [21], multi-attentional [67] models that focus on different parts of the image, and \(F^{3}\)-Net [48], which detects subtle manipulative patterns by analyzing frequency aspects of images. Additionally, an ensemble model [45] that combines two ConvNext models trained at varying epochs, and a Swin transformer has recently been proposed for enhanced deepfake detection.

2.2 Image Watermarking for Media Authentication

The existing digital watermarking techniques are utilized to embed three types of watermarks: fragile [8, 36], robust [9, 15, 46, 70], and semi-fragile [25, 33, 41, 54, 69]. Fragile watermarks are particularly sensitive, designed to invalidate the authentication of an image at the slightest modification, ensuring stringent authenticity checks. In contrast, robust watermarks are crafted to endure various forms of manipulation, thus allowing content creators to assert ownership over their media, even when it undergoes alterations. Semi-fragile watermarks combine features of both fragile and robust watermarks, i.e., fragile to manipulations and robust to genuine transformations. Traditional embedding techniques for semi-fragile watermarks have manipulated both the spatial, such as least significant bits [64], and frequency domains (such as DCT [25] and DWT [7, 29]) of digital media. However, these conventional approaches can either make watermarks perceptible, distort the media, or render them susceptible to image transformations, particularly JPEG compression. Thus, rendering them inefficient for media authentication against tampering and alteration.

Deep-learning-based watermarking techniques offer more efficient watermark encoding with high imperceptibility compared to traditional techniques. A robust watermarking technique called HiDDeN is proposed in [71], consisting of an encoder, a decoder, and a discriminator. However, this technique introduces distortions in the media and is not suitable for identifying manipulated media. Similarly, a watermarking technique called StegaStamp [56] encodes hyperlinks into image pixels using a trained neural network, imperceptible to human eyes. However, this model lacks vulnerability against malicious transformations and therefore is unsuitable for media authentication. A study in [41] introduces a semi-fragile deep-learning-based invisible watermark into the image pixels (FaceSigns) that utilizes a U-Net-based encoder-decoder architecture designed to be robust against benign image-processing operations yet fragile to any facial manipulation for media authentication. However, the proposed model is not resistant to adversarial attacks based on watermark removal, rendering it unsuitable under adversarial settings.

3 Proposed Methodology

Figure 2 illustrates an overview of our proposed proactive defense technique based on U-Net-based encoder-decoder architecture for invisible image watermarking. The five primary components of our proposed system are an encoder network \(E_{\alpha}\), a decoder network \(D_{\beta}\), an adversary network \(A_{adv}\), an adversarial discriminator network \(A_{\gamma}\), and a critic network \(C\). Training the encoder \(E_{\alpha}\) and decoder \(D_{\beta}\) networks involves embedding watermarks and encouraging message retrieval from watermarked images that have undergone benign modifications and discouraging retrieval from watermarked images that have undergone malicious changes. The adversary network \(A_{adv}\) makes an effort to mimic an intruder to remove the watermark, making it resistant to watermark removal approaches. The imperceptibility of the watermark is guaranteed by image reconstruction and adversarial loss from the discriminator \(A_{\gamma}\). The critic network, denoted as \(C\), is responsible for assessing the quality of images by evaluating their degree of authenticity or realism.

Fig. 2.

In detail, the encoder network \(E_{\alpha}\) takes an input image \(x\) and a bit string \(b\in\left\{0,1\right\}^{L}\) of length \(L\), and outputs a watermarked image \(x_{w}\) where \(x_{w}=E(x,b)\). These watermarked images undergo image transformations, which include benign as well as malicious transformations. In this context, the watermarked images generated from the encoder undergo benign image transformations \((g_{bt}\sim G_{bt})\) to obtain a benign image \(x_{bt}=g_{bt}(x_{w})\). Similarly, watermarked images of the encoder undergo malicious facial manipulation-based transformations \((g_{mt}\sim G_{mt})\) to obtain a malicious image \(x_{mt}=g_{mt}(x_{w})\). These transformed watermarked images are fed to the decoder network to retrieve the embedded watermarked message \(b_{bt}=D(x_{bt})\) and \(b_{mt}=D(x_{mt})\) (note that \(b^{{}^{\prime}}\) is the notation used to denote the bit string retrieved for any image in general in Section 4.3), respectively.

During training, we employ the \(L_{1}\) distortion between the retrieved and ground truth bit strings to optimize secret watermark retrieval. Further, the decoder is encouraged to minimize message distortion \(L_{1}(b,b_{bt})\) in order to make them resilient to benign transformations, and to maximize error \(L_{1}(b,b_{mt})\) to make them vulnerable to malicious manipulations. Therefore the secret retrieval error for an image \(L_{RE}(x)\) is given as follows:

\begin{align}L_{RE}(x)=L_{1}(b,b_{bt})-L_{1}(b,b_{mt}).\end{align}

(1)

Further, we calculate the image reconstruction loss between the original \(x\) and watermarked image \(x_{w}\) by optimizing three specific image distortion metrics: (\(L_{1},L_{2},L_{pips}\)). Each of these metrics measures different aspects of image distortion, helping to ensure that the watermarked image retains visual fidelity to the original while embedding the necessary data. For example, the \(L_{1}\) metric calculates the absolute differences between the corresponding pixel values of the original and watermarked images. Similarly, \(L_{2}\), also known as the mean squared error, calculates the square of the Euclidean distance between the original and watermarked images. Finally, the \(L_{pips}\) metric evaluates the perceptual similarity between two images based on their high-level features extracted from pre-trained deep networks. The pips loss is particularly effective in assessing how perceptually similar two images are beyond just their direct pixel-wise differences.

These metrics collectively contribute to \(L_{d}(x,x_{w})\), which is used to compute the image reconstruction loss. This optimization ensures that the watermarked image closely resembles the original image in terms of aesthetics. In addition, we incorporate an adversarial loss \(L_{G}(x_{w})=\log(1-A(x_{w}))\), derived from a discriminator that is concurrently trained to distinguish between the watermarked and original images. Consequently, the total image reconstruction loss is computed by combining these individual loss components.

\[L_{d}(x,x_{w})=L_{1}(x,x_{w})+L_{2}(x,x_{w}) +c_{pips}L_{pips}(x,x_{w}),\]

(2)

\[L_{image}(x,x_{w})=L_{d}(x,x_{w})+c_{g}L_{G} (x_{w}).\]

(3)

Finally, mini-batch gradient descent is used to train the encoder and decoder network’s parameters \(\alpha,\beta\) to maximize the following loss over the distribution of input messages and images:

\begin{align}\mathbb{E}_{x,b,g_{bt},g_{mt}}[L_{image}(x,x_{w})+c_{RE}L_{RE}(x)].\end{align}

(4)

Likewise, original images \(x\) and watermarked images \(x_{w}\) are trained using the discriminator parameters \(\gamma\) as follows:

\begin{align}\mathbb{E}_{x,b}[\log(1-A(x))+\log(A(x_{w}))].\end{align}

(5)

In the above equations, \(c_{pips},c_{g},c_{RE}\) are scalar coefficients for the respective loss terms which are obtained through empirical evidence.

In addition to the encoder \(E_{\alpha}\), decoder \(D_{\beta}\), and discriminator networks \(A_{\gamma}\), we also introduce the critic \(C\) and adversary networks \(A_{adv}\) in the overall model pipeline.

3.1 Critic

The critic network, denoted as \(C\), is responsible for assessing the quality of images by evaluating their degree of authenticity or realism. Motivating the encoder to watermark the images in a way that makes the distortion less obvious and deceives the observer, thus improving the quality of the watermarked images. The two convolutional blocks that comprise this module are followed by a linear classification layer that generates the critic score and an adaptive spatial pooling layer. The loss associated with the critic network is given as follows:

\begin{align}L_{c}=\mathbb{E}_{x,b}[C(E(x,b))].\end{align}

(6)

We further optimize the critic \(C\) module using the Wasserstein loss function that is employed to distinguish between real and watermarked images which generally provides a more stable gradient that helps in smoother and more reliable training.

\begin{align}L_{w}=\mathbb{E}_{x}[C(x)]-\mathbb{E}_{x,b}[C(E(x,b))].\end{align}

(7)

3.2 Adversary

The adversary network makes an effort to mimic an intruder to remove the watermark. To be more precise, an adversary network takes watermarked images and extracts the watermark to produce an additional set of unaffected images. This module is similar to the encoder module except that it does not have a data tensor. This module comprises two convolutional blocks followed by a linear layer that creates the residual mask. Subsequently, we employ a scaled TanH activation function to limit the maximum perturbation of each pixel to \(\pm\)0.01. We then combine the residual mask with the watermarked image to produce the final output. The loss associated with the Adversary network is given as follows:

\begin{align}L_{adv}=\mathbb{E}_{x,b}[CrossEntropy(b,D(A_{adv}(E(x,b))))].\end{align}

(8)

We further optimize the adversary module \(A_{adv}\), which incorporates the negative cross-entropy loss to instruct the adversary to remove the embedded watermark.

\begin{align}L_{r}=-\mathbb{E}_{x,b}[CrossEntropy(b,D(A_{adv}(E(x,b))))].\end{align}

(9)

Finally, the overall combined loss associated with the proposed model is given by:

\begin{align}L_{total}=\mathbb{E}_{x,b,g_{bt},g_{mt}}[L_{image}(x,x_{w})+c_{RE}L_{RE}(x)] + \mathbb{E}_{x,b}[\log(1-A(x))+\log(A(x_{w}))]+L_{w}+L_{r}. \end{align}

(10)

3.3 Message Encoding

Watermarking data is used by the encoder network as a bit string \(b\) with length \(L\). This watermarking data may include a secret message that may be used to verify the authenticity of the image or details about the camera that took the image. Using hashing, symmetric, or asymmetric encryption methods,² we can encrypt a target message to deter adversaries (who have obtained white-box access to the encoder network) from encoding it. In our experiments, we incorporate \(64\)-bit encrypted messages, enabling the network to encode \(2^{64}\) distinct messages. Encryption involves securing data by converting readable information, termed plaintext, into an encoded format called ciphertext. In our study, we employ symmetric encryption, where the same secret key is used for both the encryption and the decryption processes. Specifically, we utilize the data encryption standard (DES), a symmetric key algorithm designed for electronic data encryption. DES functions as a block cipher, encrypting data in blocks, typically \(64\)-bit blocks, using a \(56\)-bit secret key for message encryption or decryption [52].

3.4 Network Architectures

Following existing studies in [27, 41], the foundation of our encoder and decoder networks is the U-Net CNN architecture, which takes images of size \(224\times 224\). Initially, a fully trained trainable layer converts encrypted messages, which are represented as an L-length bitstring, to the \(84\times 84\) tensor \(b_{proj}\). The original RGB image is scaled to \(224\times 224\) using bilinear interpolation, and these tensors are then added as the fourth channel to form the encoder network’s input. There are eight downsampling and eight upsampling layers in the U-Net encoder. As recommended by [41, 44], we improve the original U-Net architecture by substituting convolutions followed by nearest-neighbor upsampling for transposed convolutions in the upsampling layers. The structure of the decoder network substantially resembles the encoder network. First, the U-Net decoder creates an intermediate output that is \(224\times 224\). The bilinear downsampling is then used to reduce the size of the intermediate output to \(84\times 84\), creating \(b_{Decoded}\). After that, a fully connected layer project \(b_{Decoded}\) onto a vector of size \(L\). A sigmoid layer is then used to scale the values between \(0\) and \(1\).

We used the patch discriminator described in [27] for the discriminator network. The discriminator’s job is to identify if each \(N\times N\) image patch is legitimate or fake. To obtain the output of the discriminator, we aggregate the discriminator responses across all patches. Three convolutional blocks with a stride of \(2\) are used for our discriminator network, making it easier to classify patches of size \(28\times 28\).

Transformation Functions. In our work, we used benign and malicious transformation functions to establish the robustness and fragility of the embedded watermark using our proposed model.

Benign Transforms. During training, we apply the diverse set of differentiable benign image transformations, denoted as (\(G_{bt}\)), to our watermarked images, in order to imitate usual image processing operation.

(1)

JPEG Compression: Recall that during training, we apply JPEG compression with quality of 25%, 50%, and 75%. We use the differentiable JPEG function introduced in [53] to approximate JPEG compression.

(2)

Gaussian Blur: We use a Gaussian kernel \(k\) to convolve the original image. The expression for this transform is t(x) \(=\) k * x, where \(*\) denotes the convolution operator. We employ kernel sizes between \(k=5\) and \(k=10\).

(3)

Saturation Settings: We randomly linearly interpolate between the original image and its grayscale version to allow for different color modifications from social media filters.

(4)

Contrast Settings: Using a contrast factor \(\sim u[0.8,1.8]\), we linearly rescale the histogram of the image.

(5)

Downsizing and Upsizing: Using bilinear upsampling, the image is first downscaled by a factor scale \(\sim u[3,8]\) and then upsampled by the same factor.

(6)

Translation and Rotation: We shift image both horizontally and vertically by \(n_{h}\) and \(n_{v}\) pixels where \(n_{h},n_{v}\sim u[-8,8]\) and rotated by \(\theta\) degrees where \(\theta\sim u[-8,8]\).

In general, Compression attacks, like those using JPEG algorithms, are deliberate attempts to degrade image quality or test system vulnerabilities, often introducing artifacts to obscure details. In contrast, social media platforms apply compression to optimize storage and improve loading times, aiming for efficient data use while maintaining acceptable visual quality. While both processes involve lossy compression, the former is typically used for exploiting or testing, whereas the latter is a routine practice for user experience.

During training, we selected one transformation function from the aforementioned list, together with an Identity transform, for every mini-batch iteration, and we applied it to every image in the batch.

Malicious Transforms. The embedded watermarks obtained using our proposed model should be vulnerable to all malicious attacks or generative techniques. To facilitate this, we assume that all Deepfake approaches operate by modifying facial features to mimic the appearance of the target identity. Consequently, we represent malicious manipulation as a transformation function (\(g_{mt}\)) that specifically involves changing the watermark within certain facial regions. To this front, the facial landmarks are detected using MTCNN [66], Then these points are used as vertices to create polygons. For example, the landmarks identifying the outline of the lips are connected to form a lip polygon. We used the Dlib library [28] to draw these polygons on the image by connecting the landmark points. This library provides functions to draw shapes based on specified points. Then, for every image, we create a mask \(M_{h\times w\times c}\) made up of all ones. Then, we locate the polygons that represent the lips, nose, and eyes on the face, and we set the pixel values inside these polygons to a preset watermark retention percentage, \(w_{r}\in[0,1]\). In other words, M[i, j,: ] \(=\) \(w_{r}\) for all pixels (i, j) inside the face feature polygons. Ultimately, the maliciously altered image, \(g_{mt}(x_{w})\), is determined based on the below equation:

\begin{align}g_{mt}(x_{w})=M.x_{w}+(1-M).x.\end{align}

(11)

Thus, based on the aforementioned configuration, we have implemented three different versions of this model as shown in Table 1 where U-Net denotes the encoder-decoder model with the adversarial network, \(C\) is the critic model, and \(A_{adv}\) is the adversarial network.

Table 1.

Model	Transformations used during training
Our-U-Net+C+\(A_{adv}\) (Baseline)	No Transformations
Our-U-Net+C+\(A_{adv}\) (\(g_{bt}\))	Only Benign Transformations
Our-U-Net+C+\(A_{adv}\) (\(g_{bt}\),\(g_{mt}\))	Both Benign and Malicious Transformations

Table 1. Implementation of Different Configurations of Our Proposed Model

4 Experimental Validations

In our experiments, we used three datasets, namely, FaceForensics++, CelebA, and IMDB-WIKI datasets for both intra-dataset and cross-dataset evaluations.

4.1 Datasets

—

FaceForensics++: FaceForensics\(++\) (FF\(++\)) [51] is an automated benchmark for facial manipulation detection. It consists of several manipulated videos created using two different generation techniques: Identity Swapping (FaceSwap, FaceSwap-Kowalski, FaceShifter, Deepfakes) and Expression Swapping (Face2Face and NeuralTextures). We used the FF\(++\) dataset’s \(c23\) version for both training and testing (\(80\%\) videos for training, \(20\%\) videos for testing, with 60 frames per video). We used the real images from this dataset for embedding watermarks using our model.

—

CelebA: The Large-scale CelebA Dataset [34] is the publicly available face dataset with more than \(200K\) celebrity images. In addition, this dataset covers large pose variations and background clutter with \(10k\) identities, 202,599 face images, \(5\) landmark locations, and \(40\) binary attribute annotations per image. This dataset is used to train our model (\(70\%\) used for training and \(30\%\) used for testing) to generate watermarked facial images.

—

IMDB-WIKI: IMDB-WIKI [50] is a highly curated dataset of popular celebrities that is created from both the IMDB website and Wikipedia. The dataset has rich annotations like DOB, year of photo taken, gender, name of the celebrity, and celebrity ID along with other essential information in the metadata file. Altogether, the dataset has 460,723 facial images representing 20,284 celebrities sourced from IMDb, and 62,328 images from Wikipedia, resulting in a combined total of 523,051 images.

4.2 Training Procedure

The training process involves 70,000 mini-batch iterations of images with a batch size of \(64\), employing an Adam optimizer with a fixed learning rate of \(0.0001\). All the implementation was performed using Python, and all the models were trained using images of size \(224\times 224\), obtained after cropping and resizing the images from the datasets. In our experiments, we consider a message length of L \(=\) 64 (i.e., \(2^{6}\)) bits. Here, the bit length is explicitly specified as L \(=\) 64, which is equivalent to \(2^{6}\) bits. This means that each message handled in these experiments consists of \(64\) binary digits (bits), allowing a substantial amount of information per message given that the total number of possible distinct messages that can be created with \(64\) bits is \(2^{64}\). This message length is quite significant because it offers a large combination of bits, enabling strong encryption standards or the capacity to handle complex data structures or identifiers in computational and cryptographic applications. The parameters are computed once for the model during the offline training stage, and not for each input image. Figure 3 shows the pictorial representation of watermarked output \(x_{w}\) when the original image \(x\) is input to our proposed model for watermarking.

We primarily evaluate image watermark embedding techniques based on the following criteria.

Fig. 3.

4.3 Evaluation Criteria

—

Imperceptibility and Capacity: For imperceptibility, we compare the original and watermarked images using two metrics: PSNR and SSIM. Both PSNR and SSIM are widely used to assess the quality and perceptibility of watermarked images. A higher PSNR signifies less distortion between the original and watermarked images, while a higher SSIM value indicates closer resemblance. Thus, higher values for both metrics are preferable, suggesting a more imperceptible watermark.

Capacity refers to the quantity of information that can be successfully embedded within an image. This metric is crucial in contexts like digital watermarking, where you need to hide data within visual content without affecting the perceptibility or integrity of the original image. The amount of bits of the encrypted message embedded in each pixel of the image is measured in terms of bits per pixel, or BPP which corresponds to the number of bits of the encrypted message embedded per pixel of the image. This is simply calculated as the ratio of the message length (L) to the total number of pixels in the image (\(H\times W\times C\)) defining the capacity.

The challenge is to embed enough data without impacting the imperceptibility of the image.

—

Robustness and Fragility: We quantify the BRA of the embedded watermarked images under the unknown benign and malicious image transformations to evaluate the robustness and fragility of the watermarking technique. A high BRA against unseen benign modifications would indicate that the watermarked or embedded data should remain detectable and recoverable even after the image has undergone common image processing operations. On the other hand, a low BRA is preferred for fragility against malicious transformations such as Deepfakes.

In this context, to calculate the BRA, we directly compare the original input bit string, denoted \(b\), with the decoded output, \(b^{{}^{\prime}}\!\), from the decoder. Recall that in our experiments, we use the same \(56\)-bit secret key for both encryption and decryption of \(64\)-bit message bit string using DES [52] (please refer to section 3.3). This \(56\)-bit secret key is used to decrypt the extracted \(64\)-bit string (\(b^{{}^{\prime}}\)) from the watermarked image.

Let \(n\) represent the total number of bits in \(b\) (and \(b^{{}^{\prime}}\), assuming they have the same length), and let \(m\) denote the number of bits that match between \(b\) and \(b^{{}^{\prime}}\) using distance metrics, hamming distance in our study. The BRA is then calculated using the following equation:

\begin{align}BRA=\frac{m}{n}\times 100\%.\end{align}

(12)

5 Imperceptibility and Capacity

We evaluate the imperceptibility and capacity of our proposed watermarking framework against four existing deep-learning-based image watermarking techniques: HiDDeN [70], StegaStamp [56], Semi-fragile-DCT [25], and FaceSigns [41] trained on benign and malicious transformations.

In Table 2, we present the image imperceptibility and capacity metrics of different watermarking techniques. Our observations reveal that our model achieves superior imperceptibility in encoding messages compared to those encoded by StegaStamp, HiDDeN, and Semi-fragile-DCT, as denoted by higher PSNR and SSIM. The imperceptibility of our proposed model is at par with the FaceSigns model (trained on both benign and malicious transformations) which is due to the similar backbone architecture, i.e., U-Net. We also observed that while it’s commonly believed that there is a trade-off between capacity and imperceptibility, this isn’t always the case. For instance, as shown in Table 2, the HiDDeN watermarking scheme with just a \(30\)-bit length did not yield better imperceptibility in terms of PSNR and SSIM. This suggests that the quality of imperceptibility depends significantly on how effectively the bit string is embedded rather than solely on the capacity of the data embedded.

Table 2.

Method	Capacity			Imperceptibility
Method	H,W	L	BPP	PSNR	SSIM
SemiFragile-DCT	128	256	5.2e-3	20.29	0.846
Hidden	128	30	6.1e-4	24.96	0.928
StegaStamp	400	100	2.0e-4	28.64	0.922
FaceSigns (Semi-Fragile)	256	128	6.5e-4	36.08	0.975
Our-U-Net (Baseline)	224	64	3.71e-4	34.24	0.959
Our-U-Net+C	224	64	3.71e-4	34.82	0.965
Our-U-Net+C+\(A_{adv}\)	224	64	3.71e-4	35.57	0.970

Table 2. Capability and Imperceptibility Measures of Various Invisible Image Watermarking Schemes

The input image’s width and height are denoted by \(H\) and \(W\).

The improved performance of our model is largely due to the difference in the network architecture. In addition, we utilized an intermediate message reconstruction loss, which encourages the network to preserve important features and details throughout its depth, which might otherwise be lost during downsampling in the contracting path. Furthermore, our model (Our-U-Net+C+\(A_{adv}\)) employs nearest-neighbor upsampling instead of transposed convolutions. This choice helps minimize upsampling artifacts, further improving the imperceptibility of the image with an embedded watermark, as also noted in [41]. We also did experiments with a \(128\)-bit length string using our proposed model and obtained similar imperceptibility as compared to a \(64\)-bit length string. Therefore, we used a \(64\)-bit length string for all the remaining experiments as a more computational-friendly alternative.

6 Robustness and Fidelity

In order to examine the resilience and susceptibility of different Deep Neural Network-based watermarking methods, we expose the watermarked images to unseen benign and malicious transformations and evaluate the retrieved BRA. The results highlighted in bold are the top performances.

6.1 Benign Transform

For benign transformations, we applied different levels of Gaussian blur, JPEG compression, and different Instagram filters, namely Aden, Brooklyn, and Clarendon. We utilized the open-source Pilgram library [2] to implement various Instagram filters, including Aden, Brooklyn, and Clarendon, to test their impact on our proposed watermarking scheme. This Pilgram [2] library provides a range of image processing filters, including Instagram-style filters. Figure 4(a) shows the illustration of Gaussian blur on invisible watermarked images using different kernel sizes and \(\sigma\) values. Figure 4(b) shows the illustration of JPEG compression on invisible watermarked images at different compression rates varying from \(25\%\to 75\)%, respectively. For Figure 4, the watermarked samples are generated using our U-Net+C+\(A_{adv}\) model trained only on benign transformations.

Fig. 4.

Table 3 tabulates the BRA in \(\%\) for different watermarking techniques after applying Gaussian blur as a benign transform at different kernel sizes and \(\sigma\) values. Table 4 tabulates BRA in \(\%\) for different watermarking techniques after applying JPEG compression (benign transform) at compression rates of 25\(\%\), 50\(\%\), and 75\(\%\), respectively. Table 5 tabulates the effect of unseen benign transformations on invisible watermarked images in terms of BRA using different Instagram filters. All models are trained and tested on the CelebA dataset.

Table 3.

Method	Gaussian Blur (BRA\(\%\))
Method	None	(Kernel_size \(=\) 3, \(\sigma\) \(=\) \(-\)1)	(Kernel_size \(=\) 5, \(\sigma\) \(=\) \(-\)1)	(Kernel_size \(=\) 19, \(\sigma\) \(=\) \(-\)1)	(Kernel_size \(=\) 23, \(\sigma\) \(=\) \(-\)1)
SemiFragile DCT	99.43	76.24	72.19	67.63	63.49
Hidden	97.65	84.97	82.29	79.45	73.62
StegaStamp	99.62	99.14	98.65	94.95	92.48
FaceSigns (Semi-Fragile)	99.49	98.24	96.58	94.12	91.65
U-Net\(+\)C\(+\) \(A_{adv}\) (Baseline)	99.29	95.56	92.61	89.37	85.95
U-Net\(+\)C\(+\) \(A_{adv}\) (\(g_{bt}\))	99.45	99.18	98.76	96.59	93.87
U-Net\(+\)C\(+\) \(A_{adv}\) (\(g_{bt}\)+\(g_{mt}\))	99.31	98.84	97.28	95.58	92.26

Table 3. The Effect of Gaussian Blur on the Invisible Watermarked Images, in Terms of BRA with Different Kernel Sizes and \(\sigma\) Values on Different Versions of Our Proposed Model as Shown in Table 1

Table 4.

Method	JPEG Compression (BRA\(\%\))
Method	None	25\(\%\)	50\(\%\)	75\(\%\)
SemiFragile DCT	99.43	60.13	58.28	53.86
Hidden	97.65	71.45	70.64	67.19
StegaStamp	99.62	98.38	97.28	95.24
FaceSigns (Semi-Fragile)	99.49	98.54	95.38	93.75
U-Net\(+\)C\(+\) \(A_{adv}\) (Baseline)	99.29	92.49	77.78	71.59
U-Net\(+\)C\(+\) \(A_{adv}\) (\(g_{bt}\))	99.45	98.68	95.87	94.52
U-Net\(+\)C\(+\) \(A_{adv}\) (\(g_{bt}\)+\(g_{mt}\))	99.31	96.68	92.25	89.45

Table 4. The Effect of JPEG Compression on the Invisible Watermarked Images in Terms of BRA at Different Compression Rates on Different Versions of Our Proposed Model as Described in Table 1

Table 5.

Method	Instagram Filters (BRA\(\%\))
Method	None	Aden	Brooklyn	Clarendon	Aden\(+\)Brooklyn	Brooklyn\(+\)Clarendon	Aden\(+\)Clarendon	Aden\(+\)Brooklyn\(+\)Clarendon
SemiFragile DCT	99.43	93.47	95.79	95.12	92.39	93.42	94.74	91.56
Hidden	97.65	94.61	93.82	94.13	91.37	91.29	92.23	89.79
StegaStamp	99.62	99.48	99.26	99.09	97.18	96.28	95.79	94.13
FaceSigns (Semi-Fragile)	99.49	99.45	99.22	99.15	98.53	97.86	97.51	96.36
U-Net\(+\)C\(+\) \(A_{adv}\) (Baseline)	99.29	99.27	98.87	98.64	97.24	96.02	95.09	94.61
U-Net\(+\)C\(+\) \(A_{adv}\) (\(g_{bt}\))	99.45	99.39	99.25	99.19	98.87	98.54	98.69	97.18
U-Net\(+\)C\(+\) \(A_{adv}\) (\(g_{bt}\)+\(g_{mt}\))	99.31	98.19	98.56	98.28	97.49	97.17	96.52	95.37

Table 5. The Effect of Unseen Benign Transformations on the Invisible Watermarked Images in Terms of BRA When Different Instagram Filters Are Applied on the Watermarked Images Obtained Using Different Versions of our Proposed Model as Described in Table 1

The overall performance in terms of BRA is \(97.1\%\) for the U-Net\(+\)C\(+\) \(A_{adv}\) on Gaussian blur, which is better than \(96.3\%\) BRA of the second-best model, StegaStamp, when only benign transformations \(g_{bt}\) are used during the training stage. Similar observations were seen on JPEG compression and unseen benign transformations.

Figure 5 illustrates the application of Instagram filters as unseen benign transforms on invisible watermarked images. In this study, Instagram filters such as Brooklyn, Clarendon, and Aden are used. These filters are designed to enhance facial images with unique aesthetic effects, each altering images in distinctive ways to cater to diverse visual preferences and styles. Figure 5 illustrates the application of multiple Instagram filters and their combined impact as unseen benign transforms on invisible watermarked images. In this study, Instagram filters such as Brooklyn, Clarendon, and Aden are used. The symbol (A \(+\) B) in the figure denotes the combined application of the Brooklyn and Aden filters to the watermarked image. Similarly, (B \(+\) C) in the figure denotes the combined application of Brooklyn and Clarendon filters to the watermarked image. Finally, (A \(+\) B \(+\) C) in the figure denotes the combined application of Aden, Brooklyn, and Clarendon filters to the watermarked image. The evaluation shows that the model incorporating U-Net with a Critic (\(C\)) and an Adversary (\(A_{adv}\)) network performs best in terms of BRA. The overall performance in terms of BRA is \(98.73\%\) for U-Net\(+\)C\(+\) \(A_{adv}\) on unseen benign transformations, which is better than \(98.29\%\) BRA of the second-best model, Facesigns (Semi-Fragile), when only benign transformations \(g_{bt}\) are used during the training stage. This superior performance is consistently observed across various settings that involve different image filters. The consistent performance of the model across all unknown benign transformations can be attributed to its specialized training exclusively on benign transformations such as cropping, compression, or subtle filtering. Training specifically on these types of transformations enables the model to become highly proficient in identifying and managing the specific patterns and distortions they introduce.

Fig. 5.

Overall, training the model using only benign transformations renders it robust to unseen benign transformations. Further, the integration of the adversary module during the training stage also plays a pivotal role in enhancing the robustness and imperceptibility of the watermarking process to unknown benign transformations. This process renders the watermark resilient against a variety of benign transformations, thereby preserving the integrity of the media content.

6.2 Malicious Transforms

For malicious transformations, we applied facial manipulations on the watermarked images using different generative models, namely, auto-encoder, GANs, and diffusion models and calculated the BRA. In this case, a low BRA is preferred for fragility against malicious transformations such as Deepfakes.

FaceSwap Based Malicious Transforms: The Faceswap model is a graphics-based method that aligns the facial landmarks to swap the faces between the source and the target using an encoder and decoder style model. This technology is widely used for various applications ranging from entertainment and media to more serious uses such as personalized advertisements and synthetic data generation for AI training. Figure 6 shows the sample watermarked facial images with identity swaps generated from the Faceswap model. The input to the Faceswap model is the source image (\(x_{sw}\)) (not watermarked) and target image (\(x_{tw}\)) (watermarked). The output is the maliciously transformed facial image \(x_{mt}=g_{mt}(x_{sw},x_{tw})\) with the identity swapped between the source and the target. For detailed technical description and implementation details, please refer to face-swap-based malicious transforms.³

Fig. 6.

Table 6 shows the effect of malicious transformations based on FaceSwap (based on the encoder-decoder model) on invisible watermarked target images obtained using our proposed model in terms of BRA. All models are trained on the FF\(++\) dataset and tested on the FF\(++\) and CelebA datasets. From the table, overall performance in terms of (BRA) is \(42.62\%\) for the U-Net\(+\)C\(+\) \(A_{adv}\) on Faceswap-based malicious transforms, which is lower than \(43.49\%\) BRA of second best model FaceSigns (Semi-Fragile), when both benign \(g_{bt}\) and malicious transformations \(g_{mt}\) are used during training.

Table 6.

Testing Dataset	Method	BRA(\(\%\))
Testing Dataset	Method	None	Faceswap
FF\(++\)	SemiFragile DCT	99.43	84.29
	Hidden	97.65	79.26
	StegaStamp	99.62	96.39
	FaceSigns (Semi-Fragile)	99.49	43.84
	U-Net\(+\)C\(+\) \(A_{adv}\) (Baseline)	99.29	52.71
	U-Net\(+\)C\(+\) \(A_{adv}\) (\(g_{bt}\))	99.45	63.39
	U-Net\(+\)C\(+\) \(A_{adv}\) (\(g_{bt}\)+\(g_{mt}\))	99.31	41.28
CelebA	SemiFragile DCT	99.28	85.45
	Hidden	97.46	71.64
	StegaStamp	99.51	95.56
	FaceSigns (Semi-Fragile)	99.27	43.14
	U-Net\(+\)C\(+\) \(A_{adv}\) (Baseline)	98.59	54.59
	U-Net\(+\)C\(+\) \(A_{adv}\) (\(g_{bt}\))	99.16	66.72
	U-Net\(+\)C\(+\) \(A_{adv}\) (\(g_{bt}\)+\(g_{mt}\))	98.82	43.97

Table 6. The Effect of Faceswap Encoder-Decoder-Based Malicious Transformation on the Invisible Watermarked Target Images, Obtained Using Our Proposed Models in Table 1, in Terms of BRA

All models are trained on the FF\(++\) dataset.

A lower BRA indicates an increased fragility, which is particularly valuable in the context of detecting malicious transformations like Deepfakes. These results are consistent for the FaceSwap model in both intra- and cross-dataset settings. The superior performance in terms of low BRA can be attributed to the malicious transform used during the training stage. This shows that the transform is able to mimic the facial manipulation process where the facial features are perturbed. Similarly, Table 7 shows the BRA for Faceswap model-based malicious transformations when trained on CelebA and tested on FF\(++\), and CelebA datasets. Again, overall performance in terms of (BRA) is \(39.62\%\) for the U-Net\(+\)C\(+\) \(A_{adv}\) on Faceswap-based malicious transforms, which are lower than \(41.95\%\) BRA of second best model FaceSigns (Semi-Fragile), when both benign \(g_{bt}\) and malicious transformations \(g_{mt}\) are used during training. These results are consistent for the Faceswap model in both intra- and cross-dataset settings.

Table 7.

Testing Dataset	Method	BRA(\(\%\))
Testing Dataset	Method	None	Faceswap
FF\(++\)	SemiFragile DCT	99.29	81.53
	Hidden	97.73	78.24
	StegaStamp	99.55	94.72
	FaceSigns (Semi-Fragile)	99.18	45.58
	U-Net\(+\)C\(+\) \(A_{adv}\) (Baseline)	99.36	52.79
	U-Net\(+\)C\(+\) \(A_{adv}\) (\(g_{bt}\))	99.25	64.37
	U-Net\(+\)C\(+\) \(A_{adv}\) (\(g_{bt}\)+\(g_{mt}\))	98.69	42.72
CelebA	SemiFragile DCT	99.51	84.69
	Hidden	98.12	75.42
	StegaStamp	99.67	96.83
	FaceSigns (Semi-Fragile)	99.38	38.32
	U-Net\(+\)C\(+\) \(A_{adv}\) (Baseline)	99.18	49.88
	U-Net\(+\)C\(+\) \(A_{adv}\) (\(g_{bt}\))	99.48	62.11
	U-Net\(+\)C\(+\) \(A_{adv}\) (\(g_{bt}\)+\(g_{mt}\))	99.27	36.52

Table 7. The Effect of Faceswap Encoder-Decoder-Based Malicious Transformation on the Invisible Watermarked Target Images, Obtained Using Our Proposed Models, in Terms of BRA

All models are trained on the CelebA dataset.

GAN-Based Malicious Transforms: For this experiment, we have used three popular GAN variants, namely, FSGAN [43] (identity swap), StarGAN [13], and AttGAN [24] (attribute manipulation) for the generation of manipulated images. These GANs are widely used for identity, expression, and attribute-based facial manipulation generation. For detailed technical description and implementation details, please refer to the original paper on FSGAN [43], StarGAN [13], and AttGAN [24]. This information is not included for the sake of space.

Table 8 shows the effect of FSGAN [43], StarGAN [13], and AttGAN [24] based facial manipulations, in terms of BRA, on invisible watermarked facial images. These manipulations are applied to all three different versions of our model, as mentioned in Table 1. Figure 7 gives an example of sample watermarked target facial images with identity swapped generated from the FSGAN model. The input to the FSGAN is the source image (\(x_{sw}\)) (not watermarked) and the target image (\(x_{tw}\)) (watermarked). The output is the malicious facial image transformed \(x_{mt}=g_{mt}(x_{w})\) with the identity swapped between the source and the target. Similarly, Figure 8 gives the example of sample watermarked facial images with attribute manipulations generated from the StarGAN and AttGAN models. The input to the StarGAN and AttGAN models is the watermarked image \(x_{w}\) and the output is the malicious transformed facial image \(x_{mt}=g_{mt}(x_{w})\) with the manipulated facial attributes such as eye glasses, facial expression and hair color.

Table 8.

Testing Dataset	Method	Generative technique (BRA\(\%\))
Testing Dataset	Method	None	FSGAN	StarGAN	AttGAN
FF\(++\)	SemiFragile DCT	99.43	68.54	59.78	64.68
	Hidden	97.65	74.96	67.45	72.86
	StegaStamp	99.62	96.52	97.41	96.38
	FaceSigns (Semi-Fragile)	99.49	51.49	50.89	48.14
	U-Net\(+\)C\(+\) \(A_{adv}\) (Baseline)	99.29	51.76	50.14	48.67
	U-Net\(+\)C\(+\) \(A_{adv}\) (\(g_{bt}\))	99.45	65.35	70.63	61.29
	U-Net\(+\)C\(+\) \(A_{adv}\) (\(g_{bt}\)+\(g_{mt}\))	99.31	50.29	48.27	45.71
CelebA	SemiFragile DCT	99.28	66.86	62.14	65.41
	Hidden	97.46	76.29	66.83	74.19
	StegaStamp	99.51	97.23	96.84	97.52
	FaceSigns (Semi-Fragile)	99.27	53.29	52.28	49.38
	U-Net\(+\)C\(+\) \(A_{adv}\) (Baseline)	98.59	53.43	52.08	50.73
	U-Net\(+\)C\(+\) \(A_{adv}\) (\(g_{bt}\))	99.16	67.29	72.73	64.58
	U-Net\(+\)C\(+\) \(A_{adv}\) (\(g_{bt}\)+\(g_{mt}\))	98.82	51.83	50.69	47.28

Table 8. The Effect of FSGAN, StarGAN, and AttGAN-Based Facial Manipulations on the Invisible Watermarked Images in Terms of BRA

These malicious transformations are applied to watermarked images obtained using different variants of the proposed model as described in Table 1. All the models are trained on the FF\(++\) Dataset.

Fig. 7.

Fig. 8.

All these models, including our U-Net-based and GAN-based models, were trained on the FF\(++\) dataset and then tested on both the FF\(++\) and CelebA datasets. As can be seen from the Table, overall performance in terms of (BRA) is \(49.01\%\) for the U-Net\(+\)C\(+\) \(A_{adv}\) on GAN-based malicious transforms, which is lower than \(50.91\%\) BRA of second-best model FaceSigns (Semi-Fragile), when both benign \(g_{bt}\) and malicious transformations \(g_{mt}\) are used during training. These results are consistent across FSGAN, StarGAN, and AttGAN models in the intra as well as cross-dataset settings. Similarly, in Table 9 we used the same FSGAN, StarGAN, and AttGAN models for malicious transformations. The models are trained on CelebA and tested on FF\(++\) and CelebA datasets. Again, the overall performance in terms of (BRA) is \(47.39\%\) for the U-Net\(+\)C\(+\) \(A_{adv}\) on GAN-based malicious transforms, which is lower than \(49.88\%\) BRA of the second-best model, FaceSigns (Semi-Fragile), when both benign \(g_{bt}\) and malicious transformations \(g_{mt}\) are used during training.

Table 9.

Testing Dataset	Method	Generative technique (BRA\(\%\))
Testing Dataset	Method	None	FSGAN	StarGAN	AttGAN
FF\(++\)	SemiFragile DCT	99.29	67.84	64.61	65.74
	Hidden	97.73	76.41	69.28	74.17
	StegaStamp	99.55	97.72	98.15	97.28
	FaceSigns (Semi-Fragile)	99.18	50.16	52.78	50.54
	U-Net\(+\)C\(+\) \(A_{adv}\) (Baseline)	98.69	52.57	54.26	52.25
	U-Net\(+\)C\(+\) \(A_{adv}\) (\(g_{bt}\))	99.39	68.28	73.86	71.52
	U-Net\(+\)C\(+\) \(A_{adv}\) (\(g_{bt}\)+\(g_{mt}\))	99.17	48.53	51.29	47.17
CelebA	SemiFragile DCT	99.51	69.86	62.64	63.82
	Hidden	98.12	78.14	68.18	72.69
	StegaStamp	99.67	97.58	97.64	95.48
	FaceSigns (Semi-Fragile)	99.38	47.27	50.97	47.56
	U-Net\(+\)C\(+\) \(A_{adv}\) (Baseline)	99.18	49.65	52.54	50.49
	U-Net\(+\)C\(+\) \(A_{adv}\) (\(g_{bt}\))	99.48	67.53	72.12	67.28
	U-Net\(+\)C\(+\) \(A_{adv}\) (\(g_{bt}\)+\(g_{mt}\))	99.27	45.63	48.48	43.29

Table 9. The Effect of FSGAN, StarGAN, and AttGAN-Based Facial Manipulations on the Invisible Watermarked Images in Terms of BRA

These malicious transformations are applied to watermarked target images obtained using different variants of the proposed model as described in Table 1. All the models are trained on the CelebA Dataset.

The superior performance of the model, indicated by a low BRA, can indeed be traced back to the inclusion of malicious transformations during the training phase. This approach allows the model to become adept at detecting and responding to manipulations akin to those encountered in real-world scenarios, such as Deepfakes and other forms of digital forgery. In addition, we trained the suggested model exclusively on malicious transformations; nonetheless, its performance is not as good as that of the model trained on both benign and malicious transformations. The reason could be that a model trained solely on malicious transformations tends to develop a narrow focus, optimizing specifically for certain types of data alterations. This specialization may limit the model’s capacity to generalize across a wider array of real-world scenarios, which could include benign transformations. Lacking exposure to these benign transformations, the model may struggle to accurately differentiate between genuinely malicious modifications and benign transformations in images, resulting in decreased overall performance.

Diffusion Model-Based Malicious Transforms: In this work, we use a Stable diffusion-based model which is a latent text-to-image/image-to-image diffusion model able to take any type of text input and produce realistic-looking images. Figure 9 shows an example of maliciously transformed facial images from Stable Diffusion V \(1.5\) and in-painting. The input to the Stable Diffusion V \(1.5\) and in-painting models is the watermarked image \(x_{w}\) and the output is a de-noised synthetic facial image \(x_{mt}=g_{mt}(x_{w})\).

Fig. 9.

In our experiments, we used stable diffusion V \(1.5\) [26] and stable diffusion inpainting [26] based diffusion models, which use the underlying concept of conditional mechanism and generative modeling of latent representation following a reverse Markov process. In order to make the training process more efficient and faster, we used low-rank adaptation (LoRA)⁴ which is a simple training method that drastically lowers the total number of trainable parameters of specific computationally complex layers. Instead of modifying the entire weight matrix of a layer, LoRA introduces two low-rank matrices \(A\) and \(B\). These matrices are much smaller in size compared to the original weight matrix \(W\) of the layer. As a result, LoRA training is considerably quicker and more memory-efficient, and smaller model weights are generated that are simpler to share and store. For detailed technical description and implementation details, please refer to the original work [26], not discussed for the sake of space.

Tables 10, 11, and 12 show the effect of stable diffusion V \(1.5\) [26] and stable diffusion inpainting [26] based malicious transformations on the invisible watermarked images in terms of BRA. The impact of these synthetic manipulations is evaluated on different versions of our proposed model as tabulated in Table 1. All the models are trained on IMDB-WIKI Dataset and tested on FF\(++\), IMDB-WIKI, and CelebA Datasets. From Table 10, the overall performance in terms of (BRA) is \(42\%\) for the U-Net\(+\)C\(+\) \(A_{adv}\) on Diffusion-based malicious transforms which is lower than \(52.74\%\) BRA of second best model FaceSigns (Semi-Fragile), when both benign \(g_{bt}\) and malicious transformations \(g_{mt}\) are used during training. These results are consistent across Stable diffusion V \(1.5\) and Stable Diffusion Inpainting models in the intra as well as cross-dataset settings.

Table 10.

Testing Dataset	Method	Model	Generative technique (BRA\(\%\))
Testing Dataset	Method	Model	None	Stable Diffusion
FF\(++\)	SemiFragile DCT	SD 1.5	98.78	52.59
	Hidden		97.56	54.74
	StegaStamp		99.16	61.87
	FaceSigns (Semi-Fragile)		98.58	49.81
	U-Net\(+\)C\(+\) \(A_{adv}\) (Baseline)		99.07	55.76
	U-Net\(+\)C\(+\) \(A_{adv}\) (\(g_{bt}\))		98.22	60.76
	U-Net\(+\)C\(+\) \(A_{adv}\) (\(g_{bt}\)+\(g_{mt}\))		98.31	37.29
	SemiFragile DCT	SD Inpainting	98.78	58.87
	Hidden		97.56	54.69
	StegaStamp		99.16	69.26
	FaceSigns (Semi-Fragile)		98.58	55.67
	U-Net\(+\)C\(+\) \(A_{adv}\) (Baseline)		99.07	59.52
	U-Net\(+\)C\(+\) \(A_{adv}\) (\(g_{bt}\))		98.22	67.85
	U-Net\(+\)C\(+\) \(A_{adv}\) (\(g_{bt}\)+\(g_{mt}\))		98.31	46.72

Table 10. The Effect of Stable Diffusion V \(1.5\) (SD 1.5) and Stable Diffusion-Based Inpainting Models (SD Inpainting) for Malicious Transformations on the Invisible Watermarked Images, in Terms of BRA, Using Different Versions of Our Proposed Models as Given in Table 1

All these models, including our U-Net-based models and Diffusion models, are trained on the IMDB-WIKI dataset and tested on the FF\(++\) dataset.

Table 11.

Testing Dataset	Method	Model	Generative technique (BRA\(\%\))
Testing Dataset	Method	Model	None	Stable Diffusion
CelebA	SemiFragile DCT	SD 1.5	99.24	54.29
	Hidden		97.09	57.24
	StegaStamp		99.52	70.17
	FaceSigns (Semi-Fragile)		99.11	50.52
	U-Net\(+\)C\(+\) \(A_{adv}\) (Baseline)		99.32	54.59
	U-Net\(+\)C\(+\) \(A_{adv}\) (\(g_{bt}\))		98.92	63.82
	U-Net\(+\)C\(+\) \(A_{adv}\) (\(g_{bt}\)+\(g_{mt}\))		98.56	38.74
	SemiFragile DCT	SD Inpainting	99.24	60.73
	Hidden		97.09	58.26
	StegaStamp		99.52	67.81
	FaceSigns (Semi-Fragile)		99.11	53.15
	U-Net\(+\)C\(+\) \(A_{adv}\) (Baseline)		99.32	57.18
	U-Net\(+\)C\(+\) \(A_{adv}\) (\(g_{bt}\))		98.92	65.39
	U-Net\(+\)C\(+\) \(A_{adv}\) (\(g_{bt}\)+\(g_{mt}\))		98.56	47.59

Table 11. The Effect of Stable Diffusion V 1.5 and Stable Diffusion Inpainting-Based Malicious Transformations on the Invisible Watermarked Images, Obtained Using Different Versions of Our Proposed Models as Given in Table 1, in Terms of BRA

All the models are trained on the IMDB-WIKI Dataset and tested on the CelebA Dataset.

Table 12.

Testing Dataset	Method	Model	Generative technique (BRA\(\%\))
Testing Dataset	Method	Model	None	Stable Diffusion
IMDB-WIKI	SemiFragile DCT	SD 1.5	99.16	54.57
	Hidden		97.23	52.84
	StegaStamp		99.41	68.25
	FaceSigns (Semi-Fragile)		98.91	49.24
	U-Net\(+\)C\(+\) \(A_{adv}\) (Baseline)		99.24	52.19
	U-Net\(+\)C\(+\) \(A_{adv}\) (\(g_{bt}\))		99.08	61.86
	U-Net\(+\)C\(+\) \(A_{adv}\) (\(g_{bt}\)+\(g_{mt}\))		98.85	36.61
	SemiFragile DCT	SD Inpainting	99.16	62.64
	Hidden		97.23	56.45
	StegaStamp		99.41	74.24
	FaceSigns (Semi-Fragile)		98.91	51.71
	U-Net\(+\)C\(+\) \(A_{adv}\) (Baseline)		99.24	57.79
	U-Net\(+\)C\(+\) \(A_{adv}\) (\(g_{bt}\))		99.08	64.65
	U-Net\(+\)C\(+\) \(A_{adv}\) (\(g_{bt}\)+\(g_{mt}\))		98.85	44.94

Table 12. The Effect of Stable Diffusion V \(1.5\) and Stable Diffusion Inpainting-Based Malicious Transformations on the Invisible Watermarked Images, Obtained Using Different Versions of Our Proposed Models as Given in Table 1, in Terms of BRA

All the models are trained and tested on the IMDB-WIKI dataset.

Overall, when exposed to malicious transformations, the proposed model that is trained on both benign and malicious transforms exhibits a lower BRA, which is actually desirable. This lower BRA indicates increased fragility, a characteristic that is vital for the detection of altered media generated using malicious transformations like Deepfakes. The model’s performance is attributed to its training with both benign and malicious transformations, enabling it to acquire a comprehensive understanding of such transforms overall. We also experimented by training the model solely on malicious transformations, but its performance was notably inferior compared to the model trained on both benign and malicious transformations.

7 Adversarial Attacks and Threat Model

7.1 Adversarial Attacks for Watermark Removal

To further understand the robustness and analyze potential threats against our proposed watermarking technique, we conducted evaluations of our model against specific adversarial attacks aimed at watermark removal. We did not reevaluate other existing watermarking methods under these adversarial conditions since our model had already demonstrated superior performance in terms of BRA under normal conditions over existing watermarking techniques. Further, a study in [68] documents the vulnerability of existing invisible watermarking techniques to watermark removal attacks.

Our evaluations have focused on both white-box and black-box scenarios, which are detailed as follows.

White-box attacks: In white-box attacks, the adversary has complete knowledge of the model, including its architecture and parameters [10, 20]. This access allows the attacker to precisely calculate the most effective perturbations to the input data to confuse the model. As the attacker has full information about the model, white-box attacks are generally considered more powerful and effective compared to black-box attacks. In this series of experiments involving adversarial attacks, we employ BRA and DA as evaluation metrics. The popular white-box attacks used in this study are gradient-based methods, namely the FGSM [20], C & W [10], BPDA, and EOT [3] that iteratively perturb input features to maximize the model’s prediction error. These attacks are applied to the watermarked facial images to evaluate the robustness of our model against white-box attacks in terms of BRA and DA). For detailed technical description and implementation details on white-box attacks, please refer to the original work [3, 10, 20].

Figure 10 shows the example of an adversarial attack with FGSM, given an input image \(x\), the FGSM method utilizes the gradients of the loss function of the individual classifiers from the Table 1 with respect to the input image to generate a new image \(x_{adv}\) that maximizes the loss function. Similarly, Figure 11 shows the samples of watermarked images generated from the FGSM-based adversarial attack. Here \(\epsilon\) is the multiplier to ensure the perturbations are small. In our experiments, we tested various values of \(\epsilon\) to determine the effectiveness of the attack. We ultimately selected an \(\epsilon\) value of \(0.010\) for all our experiments involving the FGSM-based white-box attack. This particular value was chosen because it introduces distortions that are not visible to the human eye, ensuring that the modifications remain imperceptible while still assessing the system’s robustness against adversarial attacks.

Fig. 10.

Fig. 11.

In our experiments, we combined both BPDA and EOT to render the attack very powerful. As BPDA can navigate through non-differentiable operations, EOT can ensure the adversarial example remains effective across a range of expected transformations [3]. This combination is especially useful in attacking systems where input preprocessing and dynamic transformations are common, such as in vision-based machine learning models used in real-world scenarios.

Table 13 shows the effect of FGSM, C & W, BPDA, and EOT based white-box based adversarial attacks on the invisible watermarked images generated using our proposed models (All the models are trained and tested on CelebA dataset) in terms of BRA. As can be seen from the table, the U-Net\(+\)C\(+\) \(A_{adv}\) (Baseline) model outperforms the U-Net and U-Net\(+\)C baselines in terms of overall BRA performance. From the results, it is evident that the models, particularly U-Net\(+\)C\(+\) \(A_{adv}\), maintain high BRA even under the challenging conditions posed by these white-box adversarial attacks. This enhanced performance of the U-Net\(+\)C\(+\) \(A_{adv}\) model can be attributed to its integrated approach, which combines the basic capabilities of U-Net with the advanced refinement processes offered by the Critic (C) and Adversary (\(A_{adv}\)) modules. These additional components help improve the model’s resilience against attacks by effectively learning to counteract the specific manipulations introduced by adversarial techniques, thus ensuring more robust watermark recovery.

Table 13.

Testing Dataset	Adversarial Attack	Method	BRA(%)
CelebA	None	U-Net	99.57
		U-Net\(+\)C	99.43
		U-Net\(+\)C\(+\) \(A_{adv}\)	99.18
	FGSM	U-Net	70.45
		U-Net\(+\)C	71.68
		U-Net\(+\)C\(+\) \(A_{adv}\)	74.69
	Carlini&Wagner	U-Net	65.29
		U-Net\(+\)C	66.53
		U-Net\(+\)C\(+\) \(A_{adv}\)	68.26
	BPDA&EOT	U-Net	56.62
		U-Net\(+\)C	58.79
		U-Net\(+\)C\(+\) \(A_{adv}\)	62.64

Table 13. The Effect of FGSM, C & W, BPDA, and EOT-Based White-Box Adversarial Attacks on the Invisible Watermarked Images in Terms of BRA on Our Proposed Models

All the models are trained and tested on CelebA dataset.

Black-box attacks: For black-box attacks, the attacker has limited or no access to the target model’s details, such as its architecture or parameters [68]. Instead, the attacker can only interact with the model by providing inputs and observing the corresponding outputs.

In this study, we used the latest regeneration attacks proposed in [68], which aim to remove invisible watermarks. These attacks work by first adding random noise to the image to disrupt the watermark and then using image reconstruction techniques to restore the image quality. The authors instantiated the regeneration attacks into three instances, namely, Identity Embedding with Denoising Reconstruction, VAE Embedding and Reconstruction, and Diffusion Embedding and Reconstruction. In our work, we employ VAE based embedding and reconstruction for black-box attacks proposed in [68] because it is computationally less expensive and more flexible and efficient. In VAE Embedding and Reconstruction, we have VAEs which are trained using two different losses: a prior matching loss that constrains the latent to follow a pre-specified prior distribution and a reconstruction loss that calculates the distance between the reconstructed and the original sample. For detailed technical description and implementation details, please refer to the original work [68].

Figure 12 shows the sample of watermarked images after the application of the VAE Embedding and Reconstruction-based black-box watermark removal attack. The VAE-based attack is successful in removing the watermark only to an extent and at the same time the attack over-smooths the image, resulting in blurriness.

Fig. 12.

Table 14 shows the VAE embedding and reconstruction-based black-box adversarial attack on the invisible watermarked images in terms of BRA on our proposed models. All the models are trained and tested on the CelebA dataset. As can be seen from the table, the U-Net\(+\)C\(+\) \(A_{adv}\) (Baseline) model outperforms the U-Net and U-Net\(+\)C baselines in terms of overall BRA performance. The results clearly demonstrate that the U-Net\(+\)C\(+\) \(A_{adv}\) model excels in maintaining high BRA even when faced with the rigorous demands of white-box adversarial attacks. The superior performance of this model can be traced back to its integrated design, which merges the foundational attributes of U-Net with the sophisticated enhancement capabilities provided by the Critic (C) and Adversary (\(A_{adv}\)) modules. These additional features enhance the model’s robustness by enabling it to effectively respond to and neutralize the specific types of manipulations typical of adversarial attacks. Consequently, this ensures a more resilient process for watermark recovery, preserving the integrity of the watermarked images under adversarial conditions.

Table 14.

Testing Dataset	Adversarial Attack (Black Box)	Method	BRA(%)
CelebA	None	U-Net	99.57
		U-Net\(+\)C	99.43
		U-Net\(+\)C\(+\) \(A_{adv}\)	99.18
	VAE Embedding	U-Net	71.25
		U-Net\(+\)C	70.89
		U-Net\(+\)C\(+\) \(A_{adv}\)	72.84

Table 14. The Effect of VAE Embedding and Reconstruction-Based Black-Box Adversarial Attack on the Invisible Watermarked Images in Terms of BRA of Our Proposed Models

All the models are trained and tested on the CelebA dataset.

Additionally, while the BRA is lower than it was without adversarial attacks, it remains above the threshold necessary for detecting authentic media. This indicates that despite the impact of the attacks, the system’s ability to verify authenticity through watermark recovery is still effective.

Overall, the effectiveness of our proposed model to adversarial attacks again stems from the implementation of adversarial training, in which the adversary network attempts to remove the watermark, while the encoder strives to preserve it. This adversarial interaction trains the model to embed watermarks that are more difficult to remove or manipulate. Thus, this dynamic architecture not only guarantees the presence of the watermark but also significantly boosts its robustness, making it capable of withstanding a variety of adversarial attacks. This robustness is crucial for maintaining the integrity and security of the embedded data across different scenarios.

7.2 Threat Model

Adversarial threats from attackers trying to avoid detection of altered media will be very likely encountered by our watermark embedding model. Next, we enlist a few potential threat scenarios that our model might face and discuss possible solutions.

Attack 1. Requesting information from the decoder network to launch hostile attacks: Using an image, the attacker can query the decoder network to obtain the decoded message. Once the decoded message matches the target message, the attacker can manipulate the query image in an adversarial manner.

Defense: The Attacker lacks knowledge of the specific target messages that validate media authenticity, as these messages may be kept confidential and regularly updated. Even if the attacker obtains access to the secret message by querying the decoder with a watermarked image, the secrecy of the encryption key can prevent the attacker from identifying the target encrypted message for the decoder. Additionally, the decoder network can be securely hosted and is only capable of producing a binary label indicating whether the image is authentic or manipulated by comparing the decoded secret with a list of trusted secrets. Consequently, the signal from the decoder becomes impractical for executing adversarial attacks to match a target message from the vast pool of \(2^{64}\) possible messages.

Attack 2. A proxy encoder’s training: The attacker can collect an original and watermarked image dataset and use it to train an encoder-decoder neural network that performs image-to-image translation. Any new image can be successfully mapped by this network to a watermarked image.

Defense: To keep an attacker from obtaining a pair of original and watermarked images, one protection strategy is to store only watermarked images on devices. Furthermore, in order to enable the adversary to learn a generator for watermarking new images with the same secret message, the attack outlined above can only be executed if all of the encoded images have the same secret message. In order to combat this, some message components can be kept dynamic. These can include device-specific codes and a distinct timestamp, ensuring that every embedded bit-string is distinct. Another defense against such attacks is to update the encryption key or trustworthy message on a regular basis.

Attack 3. Transferring the watermark perturbations between different images: To verify the altered media, the adversary can try to extract the added perturbations of the watermark and apply them to a Deepfake image.

Defense: We speculate that, as our model produces a perturbation that is specific to a particular message, the decoder should not be able to retrieve the same perturbation when it is applied to other images. Through an experiment, we extract added perturbations from 50 watermarked images and apply the extracted perturbation to 50 alternate images in order to prove this notion. Such an attack has a BRA of only \(18.5\%\), which is less accurate than random prediction.

Attack 4. Model Inversion Attacks: Attackers use output data from watermarking or Deepfake detection systems to reconstruct the original input data or sensitive attributes about the data, compromising privacy.

Defense: Incorporating differential privacy techniques during the training of watermarking and detection models helps safeguard the confidentiality of the input data by preventing the models from disclosing sensitive information. Additionally, introducing noise to the generated outputs of these systems further enhances privacy protection by ensuring that the output cannot be used to accurately reconstruct the input data, or reveal precise details about it.

Attack 5. Side-Channel Attacks: An attacker exploits side-channel information such as computation time, power consumption, or electromagnetic emissions to gain insights into the watermarking or detection algorithms, potentially revealing secret keys or operations.

Defense: To mitigate timing attacks, it is essential to design algorithms that execute in constant time, ensuring that their operation duration does not vary based on the input. This approach prevents attackers from deducing sensitive information based on how long the algorithm takes to process different inputs. Additionally, to safeguard against side-channel attacks, implementing physical security measures such as shielding techniques and restricting physical access to systems is crucial.

Attack 8. Reverse Engineering Attacks: Attackers deconstruct the watermarking or Deepfake detection system to understand its mechanism fully. With this knowledge, they could develop more effective methods to remove or bypass watermarks or to create more convincing Deepfakes that evade detection.

Defense: Applying code obfuscation techniques to make reverse engineering more difficult and time-consuming. Utilizing secure hardware environments like Trusted Execution Environments to run critical parts of the watermarking or detection processes, shielding them from reverse engineering attempts.

8 Ablation Study

In this section, an ablation study is conducted to assess the individual contributions of various modules within the proposed model, which consists of different configurations of the U-Net architecture enhanced with additional modules like Critic and Adversary. This methodical approach allows for a clearer understanding of how each component affects the overall performance of the model.

Note that in this series of experiments, we utilized the baseline model along with the critic and adversarial modules, which were not trained on either benign or malicious transformations. The purpose of employing only the baseline model without exposing it to these transformations was to clearly demonstrate the fundamental capabilities and limitations of each module prior to any influence from benign or malicious changes. This is crucial for analyzing the inherent effectiveness of each component within our proposed model, providing a foundational understanding of their impact before considering the additional complexities introduced by specific transformations. This kind of analysis is essential for systems where understanding the discrete contribution of each component is key to overall performance and reliability.

(1)

U-Net: Initially, the U-Net model along with the discriminator is trained without benign and malicious transformations and without the integration of any Critic or Adversary modules. This setup serves as the control group, providing a benchmark to measure the impact of adding the other modules.

(2)

U-Net\(+\)C: To this baseline, a Critic module is added, creating a second variant of the model. The Critic module is designed to assess the quality of the output and guide the network towards generating more realistic images.

(3)

U-Net\(+\)C\(+\) \(A_{adv}\): The most complex variant includes both the Critic and the Adversary modules alongside the baseline U-Net. The Adversary module simulates potential attacks or challenges the model might face, aiming to ensure that the watermarks are robust against various types of manipulations, particularly those that might be encountered in adversarial environments.

The results, as shown in Table 15, provide a comparative analysis of the impact of each configuration when tested on the CelebA dataset. Specifically, the study finds that the model equipped with both the Critic and Adversary modules (U-Net\(+\)C\(+\) \(A_{adv}\)) shows superior performance in BRA. This configuration notably excels when evaluated on malicious transformation.

Table 15.

Testing Dataset	Method	BRA(%)
Testing Dataset	Method	None	Faceswap
CelebA	U-Net	99.57	54.68
	U-Net\(+\)C	99.43	51.82
	U-Net\(+\)C\(+\) \(A_{adv}\)	99.18	49.88

Table 15. Ablation Study on the Impact of Each Module (Network) Used in Our Proposed Model

The introduction of the Critic and Adversary modules to the model influences the BRA in various scenarios. The Critic module, aimed at enhancing visual fidelity, can compromise the (BRA), as it may prioritize image quality over the exactness of watermark retrieval. This effect is noted both under normal conditions and for malicious transformation, i.e., Faceswap, where the Critic helps maintain realistic reconstructions but may lower BRA by prioritizing visual authenticity. Similarly, the Adversary module, which simulates attacks to test robustness, can lower BRA by making the watermark more secure but harder to decode in standard conditions. This module proves particularly useful in strengthening the system’s resilience against malicious changes, such as those in Faceswap, yet this robustness can also complicate watermark extraction, leading to lower BRA in typical detection scenarios. The addition of benign and malicious transformations during the training stage will add enhance the robustness and fragility of our model to benign and malicious transformations, respectively, as already discussed in the previous set of experiments.

Based on the enhanced performance observed, the U-Net model augmented with both Critic and Adversary modules is selected as the baseline model for all subsequent experiments. This decision is based on the model’s demonstrated ability to handle both benign and malicious transformations effectively, ensuring higher fidelity in watermark recovery under adversarial conditions. Thus, this methodical approach not only emphasizes the importance of each module but also demonstrates how integrating these modules can result in substantial enhancements in the robustness and accuracy of our proposed model for image watermarking. This is especially pertinent in situations where maintaining the integrity and authenticity of digital content is paramount.

9 Conclusion and Future work

With the volume of Deepfakes showing staggering growth, advanced proactive defense mechanisms are required for media authentication and to control misinformation spread in advance. In this paper, we introduce a novel deep learning-based semi-fragile invisible image watermarking technique as a proactive defense that allows media authentication by verifying an invisible secret message embedded in the image pixels. Our proposed approach systematically integrates a U-Net-based encoder-decoder style architecture with the discriminator, critic, and adversarial network for efficient watermark embedding and robustness against watermark removal. Thorough experimental investigations on popular facial Deepfake datasets demonstrate that our proposed watermarking technique generates highly imperceptible watermarks that are recoverable with high BRA under benign image processing operations. Further, the watermark is not recoverable when facial manipulations based Deepfakes, generated using different generative algorithms, are applied. Cross-comparison with the existing invisible image watermarking techniques proves the efficacy of our proposed approach in terms of imperceptibility and BRA. In addition, the watermarked images obtained using our proposed model are resilient to several white-box and black-box watermark removal attacks. This is attributed to the adversarial network used during the training stage that mimics the efforts of an adversary in removing the watermark embedded by the encoder, thereby obtaining resilience. Thus advancing the SOTA in watermarking as a proactive defense for media authentication and for combating Deepfakes. Our proposed watermarking technique can be vital to media authenticators in social media platforms, news agencies, and legal offices and help create more trustworthy and responsible platforms and establish consumer trust in digital media. Our work has two primary limitations. Firstly, the complexity of our model necessitates advanced hardware and GPU support, which we plan to address in future iterations by optimizing the model for improved generalizability. Secondly, we were unable to simulate all potential attacks outlined in the threat model discussed in Section 7.2. As a part of future work, we aim to address these limitations. Further, we plan to extend our proposed semi-fragile technique for watermarking multi-modal audio-visual data streams in videos.

Footnotes

https://www.faceapp.com/

https://www.enterprisenetworkingplanet.com/security/encryption-types/

https://github.com/deepfakes/faceswap

⁴

https://huggingface.co/blog/lora

References

[1]

Shruti Agarwal, Hany Farid, Yuming Gu, Mingming He, Koki Nagano, and Hao Li. 2019. Protecting world leaders against deep fakes. In Proceedings of the CVPR Workshops.

Abstract

1 Introduction

2 Related Work

2.1 Facial Manipulation Generation and Passive Deepfake Detection

2.2 Image Watermarking for Media Authentication

3 Proposed Methodology

3.1 Critic

3.2 Adversary

3.3 Message Encoding

3.4 Network Architectures

4 Experimental Validations

4.1 Datasets

4.2 Training Procedure

4.3 Evaluation Criteria

5 Imperceptibility and Capacity

6 Robustness and Fidelity

6.1 Benign Transform

6.2 Malicious Transforms

7 Adversarial Attacks and Threat Model

7.1 Adversarial Attacks for Watermark Removal

7.2 Threat Model

8 Ablation Study

9 Conclusion and Future work

Footnotes

References

Index Terms

Recommendations

FaceSigns: Semi-fragile Watermarks for Media Authentication

A Semi-Fragile Watermarking Scheme Using Weighted Vote with Sieve and Emphasis for Image Authentication

A watermarking-based image ownership and tampering authentication scheme

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

PDF

eReader

Login options

Full Access

Share

Share this Publication link

Share on social media

Affiliations