1 Introduction

In the field of information security steganography and steganalysis are considered as two important techniques [6, 10]. Steganography is used to conceal secret information (i.e. a message, a picture or a sound) also known as payload into another non-secret object (that can be an image, a sound or a text message) also known as cover object, such that both the secret message as well as its content remain invisible.

In image steganography, most of the work has been done to hide a specific text message into a cover image. Thus the focus of all the existing techniques has been finding either noisy regions or low-level image features such as edges [7], textures [4], etc., in cover image for embedding maximum amount of secret information without distorting the original image.

In this work, we propose a novel and completely automatic steganography method for hiding one image to another. For this, we design a deep learning network that automatically identifies the best features from both cover and payload images to merge information. The biggest advantage of our this approach is that its generic and can be used with any type of images, to validate this we test our approach on variety of publicly available datasets including ImageNet, MNIST, CIFAR10, LFW and PASCAL-VOC12.

Overall our main contributions are as follows: (i) we propose a deep learning based generic encoder-decoder architecture for image steganography; (ii) we design a new loss function that ensures joint end-to-end training of encoder-decoder networks; (iii) we perform extensive empirical evaluation of proposed architecture on range of challenging publicly available datasets and report state-of-the-art payload capacity at high PSNR and SSIM values. Specifically, using our proposed algorithm we can reliably embed a unary channel image (\(m\times n\) pixels) into a color image (\(m\times n \times 3 \) pixels). Our experiments show that we can achieve this payload of 33% (on average 8 bpp) with the average PSNR values of 32.9 db (SSIM = 0.96) for cover and 36.6 db (SSIM = 0.96) for recovered payload image.

Fig. 1.
figure 1

Pictorial representation of encoder and decoder networks architecture. Top row in encoder network represents the guest branch while bottom row represents host branch.

2 Methodology

We train end-to-end a pair of encoder and decoder Convolutional Neural Networks (CNNs) for creating the hybrid image from pair of input images, and recovering the payload image from input hybrid image – c.f. Fig. 1 for architecture details. Here, we make use of observation that CNN layers learns a hierarchy of image features from low-level generic to high-level domain specific features. Thus our encoder identifies specific features from cover image to hide the details from the payload images, while decoder learns to separate those hidden features from the “hybrid” image.

Specifically, the encoder network takes two images (i.e. a “host” cover image and a “guest” payload image) as input and produces a single hybrid output image. Thus, the goal of encoder network is to produce a hybrid image, that remains visually identical to the host image but should also contain the guest image content in it. The decoder network takes as input the encoder produced hybrid image and recovers the guest image from it. The goal of decoder network is to recover the guest image from the input hybrid that remains visually similar to input guest image of encoder.

Let \(I_h\) and \(I_g\) represent input host and guest images to encoder, while \(O_e\) and \(O_d\) represent the output hybrid image and output decoder image respectively, then the complete loss function for encoder and decoder network can be written as:

$$\begin{aligned} L(I_g,I_h)=\alpha ||I_h-O_e||^2 + \beta ||I_g-O_d||^2 + \lambda (||W_e||^2 + ||W_d||^2) \end{aligned}$$
(1)

Here \(W_e\) and \(W_d\) represent the learned weights for the encoder and decoder networks respectively while \(\alpha \) and \(\beta \) are controlling parameters for encoder and decoder. The first term in loss function defines encoder loss and the second one decoder loss.

2.1 Encoder Architecture

The encoder network at the input end has two parallel branches named as guest branch and host branch. Guest branch receives the input guest image \(I_g\) and uses a sequence of convolution and ReLU layers to decompose the input image into low-level (edges, colors, textures, etc.) and high level features. Host branch receives the input host image \(I_h\) and uses a sequence of convolution and ReLU layers (except the last layer which does not include ReLU layer) to decompose the input image into a hierarchy of feature representations and merge the extracted representations of guest image into host image.

Precisely, for merging the information from guest image, encoder concatenates the extracted feature maps from each alternating layer of guest branch (starting from input) to the corresponding output features maps of host branch. This procedure is repeated up to a layer of depth k (we found \(k=7\) as the best parameter), at this point we completely merge the guest branch features into host branch and guest branch cease to exist. After merging a further sequence of convolution and ReLU layers are used before the final convolution layer which produces as output hybrid image \(O_e\).

2.2 Decoder Architecture

Our decoder network receives the encoder produced hybrid image \(O_e\) as input and runs it through sequence of convolution and ReLU layers (except the last layer which does not include ReLU) to recover the concealed representation \(O_d\) of guest image \(I_g\).

We also experimented with other design choices, however in our initial experiments this architecture comes out as the best choice. During training both encoder and decoder are trained end-to-end using the joint loss function – c.f. Eq. (1). However during testing both encoder and decoder are used in disjoint manner.

Table 1. Comparison of bpp, PSNR and SSIM values for different runs of our algorithm on different datasets.
Table 2. Bpp, PSNR and SSIM values of our ImageNet trained algorithm on different datasets.

3 Experiments and Results

In this section, we report our experimental settings. We also report quantitative and qualitative results of our algorithm on a diverse set of publicly available datasets, that is on ImageNet  [1], CIFAR10  [8], MNIST  [9], LFW  [5] and PASCAL-VOC12  [2].

We randomly divided each dataset sample images into three datasets: training, validation and testing. All the configurations have been done using validation set and we report the final performance on test set.

For payload, we randomly select an image from the corresponding dataset and either convert it to gray-scale or just choose a single channel from the RGB channels. For cover, we randomly select an RGB image from the corresponding dataset.

For all experiments we use the same encoder and decoder architecture as explained in Sect. 2. However each input image is zero-centered. Encoder and decoder weights are randomly initialized using Xavier initialization [3]. For learning these weights we use Adam optimizer with a fixed learning rate of 1E−4 and a batch size of 32 where regularization parameter was set to 0.0001 and \(\alpha =\beta =1\). During each epoch, we disjointly sample images for cover and payload usage from the training set. All the filters in CNN layers are applied with stride of single pixel and using same padding.

We use Peak Signal to Noise Ratio (PSNR), Structural SIMilarity (SSIM) index and bits per pixel (bpp) to report the perceptual quality of images produced and embedding capacity of our algorithm.

For our initial experiment, we used cover images (\({32\times 32\times 3}\)) from CIFAR10 while payload images were taken (\(28\times 28\times 1\)) from MNIST dataset. For this experiment, we were able to hide approximately 29.1% payload (i.e. 7 bpp) in our cover images with average PSNR of 32.85 db and 32.0 db for encoder and decoder networks produced images respectively – c.f. Table 1. These results show that using our algorithm, we can successfully hide a huge payload with reasonably high PSNR and SSIM values. According to our best of knowledge, no one has been able to report such results on this dataset.

However, MNIST is a relatively simple dataset as majority of pixels in each image belong to plain background (white color) class. Thus, we conducted another experiment on CIFAR10 dataset – CIFAR10 being dataset of natural classes contains much larger variation in image foreground and background regions – with identical experimental settings.

In this experiment, both cover (\(32\times 32\times 3\)) and payload images (\(32\times 32\times 1\)) were randomly and disjointly sampled from CIFAR10 training batch. In this experiment we were able to hide a payload of 33.3% (i.e. 8 bpp) in our cover images with average PSNR of 30.9 db and 29.9 db for encoder and decoder networks produced images respectively.

From our these experiments, we can conclude that our proposed algorithm is extremely generic and one can, using the same architecture, reliably guarantee huge payloads and acceptable PSNR values for complex images as well – c.f. Table 1. For both these experiments we ran our algorithm for 50 epochs.

To further consolidate our findings and to evaluate our algorithm’s embedding capacity and reconstruction performance on images of large size, we designed another experiment using ImageNet dataset. A subset of 8,000 images was randomly chosen from one million images. These selected images were then divided into two disjoints sets: training (6,000 images) and testing (2,000 images) – no validation set was used here since we reuse the earlier experiments settings. To allow uniform sized images as cover and payload all of these images were then resized to \(300\times 300\) pixels. For our initial version of this experiment and to ensure a fair comparison with other results, we first ran our algorithm for 50 epochs.

For randomly sampled cover (\({300\times 300\times 3}\)) and guest images (\({300\times 300\times 1}\)) from our ImageNet test dataset, we were able to hide a payload of 33.3% (i.e. 8 bpp) in our cover images with average PSNR of 29.6 db and 31.3 db for encoder and decoder networks produced images respectively. As we were able to hide high payload for similar PSNR values to earlier experiments for this complex dataset as well, so we further explored different experimental settings.

Our final model on ImageNet was trained for 150 epochs further improving the PSNR values for encoder and decoder to 32.92 db (SSIM = 0.96) and 36.58 db (SSIM = 0.96) respectively from 29.6 db and 31.3 db while maintaining similar payload capacity of 33.3% (on average 8 bpp) – c.f. Table 1.

To further evaluate the generalization capacity of our algorithm, we ran the ImageNet trained algorithm on sample of 1,000 unseen images from PASCAL-VOC12  [2] and Labelled Faces in Wild (LFW) [5] datasets. Table 2 shows the results of our this experiment. Here, even though our algorithm is trained on different dataset, it is still being able to achieve high payload capacity at high PSNR and SSIM values which shows the generalization capabilities of our proposed algorithm.

Figure 2 shows a sample of result images from LFW, PASCAL-VOC12 and ImageNet datasets. Here once again we can verify using qualitative analysis that our method is being able to conceal and recover unseen complex payload images.

Therefore, given this quantitative and qualitative analysis, we can conclude that our algorithm is generic and robust to complex backgrounds and variations in objects appearance, thus can be reliably used for image steganography.

Fig. 2.
figure 2

Sample results of our algorithm on LFW (top row), PASCAL-VOC12 (middle row) and ImageNet (bottom row) images. In each subfigure, first column represents the cover image \(I_h\), second the payload \(I_g\), third the hybrid image \(O_e\) and fourth column represents the recovered guest image \(O_d\).

4 Conclusions

In this paper, we have presented a novel CNN based encoder-decoder architecture for image steganography. In comparison to earlier methods, which only consider binary representation as payload our algorithm directly takes an image as payload and uses a pair of encoder-decoder networks to embed and robustly recover it from the cover image. According to our best of knowledge, no such earlier work exists and we are the first one to introduce this method for image-in-image hiding using deep neural networks. We have performed extensive experiments and empirically proven the superiority of our proposed method by showing excellent results with strong payload capacity on a wide range of wild-image datasets.