Keywords

1 Introduction

In today’s era of digital technology, digital images have become a primary carrier of information. At the same time, an unprecedented involvement of the digital images with misinformation or fake news can be seen on the social media platforms [33]. When an image with false or misleading information goes viral on social media it can disrupt social harmony. Moreover, political parties are using social media for their election campaigning purpose frequently. From the study conducted by Machado et al. [16], it has been observed that more fake posts are shared on these platforms during such political campaigning. This study revealed that 13.1% of Whatsapp posts were fake during the Brazilian presidential elections. Recently, uncountable fake posts were shared globally on social media related to the novel coronavirus and COVID-19 (coronavirus disease 2019) pandemic [11]. There are various ways to tamper with an image. The most common types of forgeries are copy-move forgery [17] and image splicing forgery [18]. In the last decades, several methods [1, 19, 20], were developed to detect these forgeries.

The invention of Computer Generated (CG) imagery has enabled various new technologies like virtual reality, 3D gaming, and VFX (Visual Effects). These technologies are widely used in the fields of the film industry, education, and medicine. Though, there are so many good applications of CG images, the computer generated images that were created with malicious intent may create the problems. The problem becomes worse when the CG image is highly photorealistic, as human eyes cannot differentiate between the CG image and actual photographic (PG) image. GAN (Generative Adversarial Network) tool can generate CG images with high photorealism. A good collection of CG images that are generated using GAN is available on the website www.thisartworkdoesnotexist.com. CG detection has become an open area for research. In past, a good number of techniques were presented to distinguish the CG and PG images. Recently, Meena and Tyagi [21] have surveyed the existing methods that were developed to distinguish the CG images from PG images. This survey paper discussed various methods from the literature, and then these methods were grouped into four classes: acquisition process based, visual feature based, statistical feature based, and deep learning based.

The methods based on the deep Convolutional Neural Network (CNN) have gained unprecedented success for image classification. A series of deep convolutional neural networks were proposed in the last few years to solve the different challenging problems. This paper proposes a deep learning based technique to discriminate between CG and PG images. The contributions of this paper are twofold; first, a fully automated model based on the deep CNN DenseNet-201 and transfer learning is proposed to mitigate the laborious task of designing the hand-crafted features; second, to the best of our knowledge, first-time DenseNet-201 network is used to solve this task. The proposed technique shows comparatively better detection accuracy and lower time complexity.

2 Related Works

The recent survey paper [21] has discussed a total of 52 state-of-the-art techniques available in the literature, therefore, a brief summary of the related works is presented in this section. The existing techniques to identify PG and CG images can be categorized as traditional or hand-crafted feature based, and deep learning based. The existing hand-crafted feature based techniques have two basic steps, feature extraction, and classification. In the past, the authors have explored various feature extraction mechanisms and classifiers to improve the detection accuracy of their proposed methods. Conversely, in deep learning based techniques, the image features are leaned using a specific neural network. Generally, in deep learning based techniques feature extraction and classification steps are performed in a single step by CNN.

Lyu et al. [15] designed a statistical model based on the first-order and higher-order wavelet statistics to identify CG and PG images. Two supervised machine learning methods, linear discrimination analysis and Support Vector Machine (SVM) were used for the classification task. A low CG detection rate (71%) was the main drawback of this method. Wu et al. [35] put forward a technique based on histogram features to solve this problem. This method achieved a good detection accuracy of up to 95.3%. However, this method was evaluated on a comparatively small image dataset. Fan et al. [6] employed contourlet transform to propose an approach to discriminate between the CG and PG images. This method follows a statistical model similar to [15], but in place of wavelet transform the authors have used contourlet transform. The authors have recommended that the HSV color model can improve the detection accuracy. Wang et al. [34] designed a technique based on color quaternion wavelet transform. Recently, Meena and Tyagi [22] developed an approach to detect CG and PG images based on Tetrolet transform and Neuro-fuzzy classifier.

Cui et al. [4] proposed a deep learning based approach to distinguish the CG and PG images. This approach first applies high-pass filters to pre-process all the images in the dataset. After that, this model was trained on the pre-processed images. He et al. [9] combined CNN and recurrent neural network to propose a model to detect CG and PG images. The detection accuracy of this method was 93.87% on the image dataset comprising 6,800 CG and 6,800 PG images. A CNN based framework to classify CG and PG images was introduced by He [8]. In this method, the author has explored two different networks VGG-19 and ResNet-50. Rezende et al. [30] have proposed a deep learning based model where they have used ResNet-50 network as a feature extractor. In this method for classification purposes, several classifiers such as softmax, SVM, k-nearest-neighbor were investigated. Meanwhile, Quan et al. [29] developed a CNN model to classify CG and PG images. This model was trained from scratch on the Columbia image dataset [25]. Recently, Ni et al. [26] have presented a comprehensive survey of the deep learning based CG detection methods.

3 The Proposed Technique

The overview of the proposed technique is presented in Fig. 1. The following subsections will describe the major steps of the proposed technique.

Fig. 1.
figure 1

Overview of the proposed technique (Color figure online)

3.1 Pre-processing

Generally, the image datasets comprise the images of various pixel resolutions. However, the proposed technique can work on the images of size 224 × 224 pixels. The only reason for selecting image size as 224 × 224 pixels is that the DenseNet-201 is trained on the images of size 224 × 224 pixels. Therefore, as pre-processing we have resized all the images in the DSTok dataset [32] before training the network. Similar to [30], the mean RGB value that was computed over the ImageNet dataset is subtracted pixel-wise from each image in the DSTok dataset.

3.2 DenseNet-201 Network

Several numbers of deep CNNs are available online with the pre-trained models. Some of the popular deep CNNs are AlexNet (2012) [29], VGG-16 (2014) [8], VGG-19 (2014) [8], ResNet-50 (2015) [30], Inception-v3 (2015) [10], Xception (2016) [10], DenseNet-121 (2017) [10], and DenseNet-201 (2017) [10]. A large number of applications of these networks can be seen in areas like data science, image processing, computer vision, and digital image forensics [21]. More specifically, VGG-19 and ResNet-50 have been used for CG detection in [8], and [30] respectively. Recently, Cui et al. [4] also developed a CNN model for CG and PG image classification purposes. In this paper, many experiments were performed with a varying number of hidden layers in CNN. Based on these experiments the authors have suggested that the detection accuracy can be improved if a CNN with more number of layers is used. Therefore, in this paper, we have tried to solve the problem of CG detection by utilizing the recently proposed very deep CNN DenseNet-201.

The complete layer-wise architecture of the DenseNet-201 is shown as a gray rectangular box in the left part of Fig. 1. From the architecture, it can be observed that this network comprises a total of 201 layers. There are four dense blocks and three transition layers. Each dense block is a combination of a different number of convolutional layers. Whereas, each transition layer contains a 1 × 1 convolutional layer followed by a 2 × 2 average pooling layer. The second last layer is a 7 × 7 global average pooling layer. Finally, the last layer is a fully-connected softmax layer with 1000 neurons as this network was primarily designed to classify the images into 1000 categories.

3.3 Transfer Learning

Two main challenges arise if we try to train a deep CNN based technique from scratch. First, an enormous amount of data is required to train the model effectively; otherwise, if the model is trained on small data it may show the sign of overfitting; second, the training process of the model on a very large dataset requires a huge computation power and time. It may require very high power multiple Graphical Processing Units (GPUs) with a huge amount of physical memory. Even after using such high power computers, the training process of a deep CNN model may take several hours or days. To overcome these two limitations, a concept of transfer learning has gained much attention in the last few years. In the transfer learning, the parameters of the pre-trained neural network (source network) that was trained for one particular task are transferred to the new neural network (target network) designed to solve the somewhat similar tasks.

During the transfer learning, there is always a scope of adjusting the number of parameters used from the particular number of layers. The proposed technique uses a very deep CNN DenseNet-201 that comprises a total of 201 layers. It becomes impractical to train such a deep CNN from the beginning, hence we have used the weights of the first 200 layers of pre-trained DenseNet-201. The transferred part of DenseNet-201 is denoted by the red dotted box in Fig. 1. Note that the DenseNet-201 network was trained on the very large image dataset namely ImageNet [5] that comprises over 1.28 million images for object classification in 1000 classes.

3.4 Classifier

The problem of distinguishing CG and PG is a binary classification problem; therefore we have employed a non-linear binary SVM classifier in place of the last layer of DenseNet-201 which is a fully connected softmax layer with 1000 neurons. Though there exist several variants of an SVM, we have used a non-linear binary SVM with Radial Basis Function (RBF) kernel. In an SVM there are two parameters Cost (C) and gamma (γ) that need to be set appropriately to manage the trade-off between variance and bias. A high value of γ may give better accuracy but results may be more biased, the reverse is also true. Similarly, a large value of C shows poor accuracy but low bias. The optimum values of these two parameters can be found using the grid-search method. Based on the experiments we have set the value of C as 10.0 and the value of γ as 0.001.

4 Experimental Results

4.1 Datasets

For evaluating and analyzing the performance of any technique the availability of the image dataset plays a crucial role. There are very few image datasets available to asses the effectiveness of the methods that are developed to discriminate between the CG and PG images. Ng et al. created the Columbia image dataset [25] to evaluate their method of CG detection in 2004. Most of the early works were evaluated only on this image dataset; this is because no other image dataset was available until 2013. There are two main drawbacks of this dataset: first, less number of images (800 CG and 800 PG images) is available in the dataset, and second, the CG images are less photorealistic. As deep learning based methods need more image data to train the model effectively, hence Columbia dataset was less relevant for evaluating the proposed technique. Due to this reason, the proposed approach is assessed on the well-designed image dataset that was created by Tokuda et al. [32]. This dataset is commonly referred to as the ‘DSTok’ dataset in the literature.

There are 4,850 CG images and 4,850 PG images in the DSTok dataset. The computer graphics images in this dataset are having high photorealism as compared to Columbia dataset. The CG images were collected from various sources such as gaming websites and screenshots of the latest 3D computer games. All the images are stored in JPEG format, and the physical size of these images varies from 12 KB to 1.8 MB. Figure 2 shows some of the example images from the DSTok dataset. The top row in Fig. 2 illustrates CG images whereas the bottom row shows the PG images.

Fig. 2.
figure 2

Sample images from DSTok dataset [32], top row: computer generated images; bottom row: photographic images

4.2 Validation Protocol and Evaluation Metrics

Due to the hardware limitation, it becomes impractical to evaluate the proposed technique on the images in the actual size in the DSTok dataset. Thus, all the images in the DSTok dataset are resized to 224 × 224 pixels for all the experiments. The 5-fold cross-validation approach, similar to [32] and [30], is considered to analyze the proposed technique. Note that, the authors in [32] and [30] have also used the resized images of size 224 × 224 pixels. All 9,700 images (4,850 CG and 4,850 PG) are partitioned into five folds of equal size. Hence, each fold comprises 1940 images. In each cross-validation step, any four folds are used to train the model whereas the remaining other fold is used to test the model.

The proposed technique is assessed based on three metrics [3, 9]; True Positive Rate (TPR), True Negative Rate (TNR), and detection accuracy. These metrics are defined in Eq. 1–3.

$$ {\text{TPR}}\, = \,\frac{total \,number \,of\, correctly\, detected\, CG \,test\, images}{total\, number \,of\, CG \,test\, images } $$
(1)
$$ {\text{TNR}}\, = \,\frac{total\,number \,of\, correctly\, detected \,PG \,test\, images}{{total \,number \,of\, PG {\kern 1pt} \,test\, images }} $$
(2)
$$ {\text{Detection }}\,{\text{accuracy }}\left( {\text{Acc}} \right)\, = \,\frac{TPR\, + \,TNR}{2 } $$
(3)

The TPR represents the detection rate of computer generated images, and TNR represents the detection rate of photographic images. Whereas, the detection accuracy is a simple mean of TPR and TNR.

The ROC (Receiver Operating Characteristics) curve provides important visual information of a binary-classification model. The ROC curve is drawn between two metrics true-positive rates and false-positive rates. The value of AUC (Area Under Curve) is also used as an evaluation metric, and this metric is used to determine the effectiveness of the binary classification model.

4.3 Implementation Details

The proposed technique has been implemented using the Python deep learning library Keras v2.2.4 with Python v3.6.10. The TensorFlow-GPU v1.13.1 is used as a backend. A computer system with a configuration of 16 GB RAM and Quadro RTX 4000 GPU from NVIDIA is used for all the experiments.

4.4 Results of the Proposed Technique

The detection accuracy and training time of the proposed technique is reported in Table 1. It can be observed that the average detection accuracy is 94.12%, and the average training time of the model is 835.30 s when the model is trained on the DSTok dataset. As there are a total of 9700 images in the DSTok dataset the average time to process an image of size 224 × 224 pixels is only 0.0861 s. Therefore the proposed technique can be used to distinguish the CG images from PG images in real-time. The ROC curve is shown in Fig. 3, it also shows the encouraging performance of our method as the obtained value of AUC is 0.9486 with a very small standard deviation of \( \pm \)0.0181. The small value of the standard deviation is the indication of the better stability of our approach. Moreover, the learning curve of the proposed technique is shown in Fig. 4, from where it can be observed that the cross-validation score is increasing with the size of the training dataset. Hence, it can be believed that the accuracy of the proposed technique can further be enhanced if the size of the training dataset is increased.

Table 1. Detection accuracy and training time of the proposed technique
Fig. 3.
figure 3

ROC curve of the proposed technique on DSTok dataset

Fig. 4.
figure 4

The learning curve of the proposed technique on DSTok dataset

4.5 Comparison and Analysis of Results

The proposed technique can distinguish between the CG and PG images with the detection accuracy of 94.12%. The comparative results of our technique with the existing techniques are reported in Table 2. A total of 16 techniques were considered for this comparison, out of which two techniques, [8], and [30] are based on deep learning; whereas, the remaining 14 techniques are based on the hand-crafted features. As the validation protocol and experimental setup of the proposed technique are exactly same as [32], we have obtained the results of all the 14 hand-crafted based techniques from [32], whereas the results of [8], and [30] were taken from their respective original articles.

Table 2. Result comparison with the existing techniques that were proposed to distinguish between CG and PG images

The rows in Table 2 are sorted according to the values of Acc in increasing order. The TPR and TNR values corresponding to the technique proposed by Rezende et al. [30] were not provided in the paper, therefore we have reported only the detection accuracy for this technique. It can be noticed that this technique shows the second best detection accuracy among all the referenced techniques. It can also be noticed that the detection accuracy of the proposed technique is greater than the detection accuracies of all the reported techniques. The proposed technique obtains the values of TPR and TNR as 93.6% and 94.6% respectively. The simultaneous higher values of these two parameters indicate that the proposed technique has shown balanced behavior while correctly predicting the more accurate results in each category. Additionally, the proposed technique can be used for real-time classification of CG and PG images.

5 Conclusion

The challenges to solve the problem of differentiating between computer generated images and photographic images are growing with the development of multimedia tools. Therefore, the techniques proposed so far have become less powerful to address this problem. This paper has introduced a technique to address this problem using the concept of deep learning. The very deep convolutional neural network DenseNet-201 was used as a feature extractor and then the support vector machine is applied as a classifier. The proposed technique achieved a detection accuracy of 94.12% on the DSTok dataset, which is higher than the detection accuracies of the existing techniques in the literature.

Additionally, the proposed technique can be used for real-time applications as it can process an image of size 224 × 224 pixels in 0.0861 s. In the future, the detection accuracy of the proposed technique can be improved further if the model is trained on the large training dataset. Furthermore, the proposed technique can also be modified to classify the computer generated images and photographic images, when the images are post-processed by various operations such as noise addition, image blurring, and contrast enhancement.