Privacy-preserving human action recognition as a remote cloud service using RGB-D sensors and deep CNN

https://doi.org/10.1016/j.eswa.2020.113349Get rights and content

Abstract

Cloud-based expert systems are highly emerging nowadays. However, the data owners and cloud service providers are not in the same trusted domain in practice. For the sake of data privacy, sensitive data usually has to be encrypted before outsourcing which makes effective cloud utilization a challenging task. Taking this concern into account, we propose a novel cloud-based approach to securely recognize human activities. A few schemes exist in the literature for secure recognition. However, they suffer from the problem of constrained data and are vulnerable to re-identification attack, where advanced deep learning models are used to predict an object’s identity. We address these problems by considering color and depth data, and securing them using position based superpixel transformation. The proposed transformation is designed by actively involving additional noise while resizing the underlying image. Due to this, a higher degree of obfuscation is achieved. Further, in spite of securing the complete video, we secure only four images, that is, one motion history image and three depth motion maps which are highly saving the data overhead. The recognition is performed using a four stream deep Convolutional Neural Network (CNN), where each stream is based on pre-trained MobileNet architecture. Experimental results show that the proposed approach is the best suitable candidate in “security-recognition accuracy (%)” trade-off relation among other image obfuscation as well as state-of-the-art schemes. Moreover, a number of security tests and analyses demonstrate robustness of the proposed approach.

Introduction

Evolution of deep learning has strengthened the expert systems to perform human action recognition very precisely, (Ronao & Cho, 2016). Human action recognition has been studied for decades and is still a very popular topic due to broad real-world applications, such as video retrieval, visual surveillance, human-computer interaction, and robotics for human behavior characterization, (Mabrouk & Zagrouba, 2018). However, rich hardware and software resources along with a team of specialized persons are required to maintain an expert system for human action recognition. This restricts the real world utility of this system as mid-level organizations are not capable enough to maintain an in-house system for a long time. Recently, cloud service providers have addressed this problem by introducing pay-as-you-go models. Examples include Azure Machine Learning,1 IBM Watson Machine Learning,2 etc. These services relieve users from infrastructure maintenance responsibility by outsourcing their data for deep learning based advanced facilities over the cloud. However, the massive data collection required for deep learning presents privacy issues.

The personal and highly sensitive user data, such as photos and video recordings, are indefinitely stored by companies that collect them. The images and video recordings often contain accidentally captured sensitive items including faces, license plates, computer screens, etc. which lead to privacy loss. For example, an organization that may want to apply cloud-based deep learning techniques to identify suspicious actions in its critical areas is prevented by privacy concerns from sharing its surveillance data and thus benefiting from large-scale deep learning. The emerging privacy-paradox studies, (Ooi, Hew, & Lin, 2018) and (Hew, Tan, Lin, & Ooi, 2017), have also shown ambiguous user behavior regarding their privacy. In this situation, privacy and confidentiality restrictions significantly reduce the use of necessary facilities offered by cloud service providers. To overcome this problem, strong cryptographic techniques can be used to secure user data before sending it to the cloud. The solution works well for secure storage. However, it introduces the challenge of data processing in encrypted form, known as Encrypted Domain (ED). We address these problems and propose a novel approach for secure outsourcing of user data for a cloud-based expert system.

The proposed approach ensures data privacy and enables user to access deep learning model for human action recognition over the cloud. The application of the proposed approach can be found in a wide range of situations, where surveillance data contains sensitive information. These include secure and automatic identification of suspicious human actions in a parking lot, Intensive Care Unit (ICU) and other sensitive places, secure monitoring of human actions at traffic signals, bank counters, Active and Assisted Living (AAL) systems for smart homes, etc.

In the context of multimedia data, there exist various methods to assure image privacy, including mosaic, blurring, scrambling, and encryption. Each method is unique and finds its scope in situation specific solutions. For example, automatic face blurring was introduced by YouTube3 in 2012. The big software giant assured identity preservation by blurring human’s faces in the video content. For example, securing the identity of activists involved in a protest march. However, this approach does not fulfill our privacy goal as there can be multiple video frames with similar visual information that may enable an adversary to obtain significant information from non-protected parts, such as background objects. The next method, namely image mosaicing, secures image information by creating big pixel-like patches known as pixelation. Though, there is a risk as some recent attack models have concentrated to re-identify identity related information from obfuscated image parts, (McPherson, Shokri, & Shmatikov, 2016). The remaining two methods, namely, image scrambling and encryption, focuses on securing full image information with standard cryptographic methods. However, pixel information is found to be distorted to a greater extent which results an infeasible situation for data processing.

In this paper, our primary objective is to design a novel approach for full image obfuscation in a manner such that automatic human action recognition can be performed without revealing any identity related information over the cloud. We achieve this by transforming the secret image into extremely low resolution, e.g. 224 × 224 to 14 × 14, using a Position based superpixel transformation. The proposed transformation is designed to concentrate a group of pixels into a single composition value in accumulation with random noise. As a result, the identity related spatial information of the underlying image is reduced to a greater extent, making it a challenging task to relate two obfuscated images for an adversary. This successfully removes the chances of re-identification attack. As compared to existing image obfuscation methods, the proposed approach successfully improves image security with higher recognition accuracy in “security-recognition accuracy (%)” trade-off relation.

Recently, a few research works, (Dai, Saghafi, Wu, Konrad, Ishwar, 2015, Ryoo, Kim, Yang, 2018, Ryoo, Rothrock, Fleming, Yang, 2017) and Chou et al. (2018), have been proposed that use extremely low resolution images to achieve privacy. Their main idea is to resize the original high resolution image into extremely low resolution so that the sensitive information, such as face of the person, number plate of a vehicle, etc. can be obfuscated. The authors used the resized training data sample to train a deep learning model for secure recognition. However, some drawbacks are observed, which are described as follows -

  • 1.

    Intrinsic information is lost due to extreme low resolution, restricting the model to learn in a constrained manner as only RGB (Red-Green-Blue) images are considered, and

  • 2.

    There is a risk of privacy as resulting low resolution images are generated in an analogue manner that falls in the category of full image mosaicing. This makes cloud service provider to identify similar images, even if they are claimed to be secure. Recently, McPherson et al. (2016) described this as a potential risk in their article entitled Defeating Image Obfuscation with Deep Learning by demonstrating how deep learning models can be used to re-identify the sensitive information. The authors claimed higher re-identification accuracies over four standard datasets, namely MNIST, CIFAR-10, AT&T, and FaceScrub.

We address the first drawback by utilizing RGB and depth data along with a four channel deep Convolutional Neural Network (CNN). Considering several advantages of depth data as compared to using RGB solely, we use depth data as the second modality to overcome the data restriction problem.

In the context of re-identification attack which has been discussed as the second drawback, we propose a non-invertible position-based superpixel transformation for image obfuscation. McPherson et al. (2016) demonstrated that neural networks can automatically discover relevant features, and can learn to exploit correlation in the obfuscated image. This problem is further increased with low resolution images used in state-of-the-art schemes as they are generated in an analogue manner. On the other hand, unlike existing privacy-preserving schemes of (Dai, Saghafi, Wu, Konrad, Ishwar, 2015, Ryoo, Kim, Yang, 2018, Ryoo, Rothrock, Fleming, Yang, 2017) and (Chou et al., 2018) that used simple image resizing for obfuscation, the proposed transformation is accumulated with random noise. Due to this, an adversary cannot identify two encrypted images, generated from the same secret image. To summarize, the major contributions of this paper are as follows -

  • As compared to state-of-the-art works of (Dai, Saghafi, Wu, Konrad, Ishwar, 2015, Ryoo, Kim, Yang, 2018, Ryoo, Rothrock, Fleming, Yang, 2017) and (Chou et al., 2018), the proposed transformation is accumulated with additional noise. This results in improved security, the robustness of which is validated using several statistical and differential tests.

  • Unlike previous schemes that utilized RGB data, we use depth maps in integration with RGB as the second modality. Therefore, for a video V of t number of frames, only four images, that is, one Motion History Image (MHI) and three Depth Motion Maps (DMMs) are secured by transforming them into extremely low resolution. As a result, the data overhead is significantly reduced.

  • A four channel deep CNN is used corresponding to MHI and DMMs. The respective output is then fused, resulting in more accurate recognition.

  • The proposed approach outperforms other image obfuscation methods in “security-recognition accuracy (%)” trade-off relation.

We review the existing privacy-preserving schemes in Section 2. A brief overview of the proposed approach is provided in Section 3. Section 4 presents detailed description of the proposed transformation, followed by complete methodology and model description in Section 5. Recognition results and security analysis are presented in Sections 6 and 7 respectively. Analysis of data efficiency, achieved by the proposed approach is discussed in Section 8. Section 9 provides a comprehensive discussion of all results reported in the paper with future directions. Finally the paper is concluded in Section 10.

Section snippets

Related work

In this section, we first provide a brief overview of existing schemes that support secure data processing and then move towards the recent developments for privacy-preserving human action recognition.

System overview

The proposed approach is designed to run in cloud environment, and operations required for its functioning are classified as per treat model and cloud server modalities. Any user accessible device, located at the actual field where the multimedia is temporarily stored and secured for further transmission to the cloud server, falls in the category of treat model. The rest of the tasks, that is, secure storage and recognition, are performed at the cloud server.

The cloud server needs to deploy a

Position-based superpixel transformation for image obfuscation

In order to secure image information, we propose a position-based superpixel transformation f. Superpixel is the process of clustering connected pixels in an image with similar features so that an abstract image can be obtained. It finds wide scope in various applications such as medical image segmentation, (Kitrungrotsakul, Han, & Chen, 2015), image retrieval, (Haas, Donner, Burner, Holzer, Langs, 2011, Stutz, Hermans, Leibe, 2018), dataset annotation, (Liu et al., 2013), etc. However, unlike

Privacy-preserving human action recognition as a service

The two primary objectives of the proposed approach are: (i) Obfuscating image information at treat model, and (ii) performing secure human action recognition over cloud server. The functionality is organized into two phases which are described as follows -

Recognition results

In this section, we provide network details and validate effectiveness of the proposed approach over standard UTD-MHAD dataset.

Security analysis

Motivated by the recent work of (McPherson et al., 2016), demonstrating the risk of re-identification, we propose to obfuscate full image information using position based superpixel transformation in this paper. A scheme is considered as secure if no adversary can break it with likelihood altogether more noteworthy than irregular speculating. We achieve higher obfuscation by accumulating random noise during position based super pixel transformation, which results an extremely low resolution

Data overhead

The existing cloud-based paradigm requires the active involvement of communication network for data transmission. Therefore, the data overhead should be very less so that a real time system can be supported. The state-of-the-art privacy-preserving human action recognition schemes utilized extremely low resolution images to achieve obfuscation. Also, the data expansion, caused due to huge multimedia dimensions is reduced. For example, a colored video V with 500 frames each of size 224 × 224

Discussion

In this paper, we have emphasized on the development of a novel approach for secure human action recognition using position based superpixel transformation. The primary objective of the proposed approach is to obfuscate image information by resizing it into extremely low resolution with added noise. This prevents re-identification attack. A four stream deep CNN is utilized for recognition purpose, where each stream is based on pretrained MobileNet. After a series of experiments, we found a

Conclusion

In this paper, a privacy-preserving human action recognition approach has been proposed. We have emphasized the feasibility to perform effective human action recognition with higher security and least storage overhead. Initially, recent vulnerabilities related to image obfuscation are discussed along with existing work and their problems. Next, image obfuscation is formulated by computing MHI and DMMs of the underlying video sequences and transforming them to extremely low resolution with

Declaration of Competing Interest

All authors declare that they have no conflict of interest regarding the publication of this manuscript.

CRediT authorship contribution statement

Amitesh Singh Rajput: Conceptualization, Methodology, Writing - original draft. Balasubramanian Raman: Supervision, Validation, Resources. Javed Imran: Methodology, Writing - review & editing, Validation.

Acknowledgments

We would like to thank the editor and external reviewers for their thoughtful and detailed comments on our paper. We would also like to thank Information Security Education and Awareness (ISEA) Project (phase II), MeitY, Government of INDIA for the necessary support.

References (42)

  • M. Ziaeefard et al.

    Semantic human activity recognition: a literature review

    Pattern Recognition

    (2015)
  • A.F. Bobick et al.

    The recognition of human movement using temporal templates

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2001)
  • E. Bresson et al.

    A simple public-key cryptosystem with a double trapdoor decryption mechanism and its applications

    International conference on the theory and application of cryptology and information security

    (2003)
  • C. Chen et al.

    Utd-mhad: A multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor

    International conference on image processing (icip)

    (2015)
  • L. Chen et al.

    Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning

    Conference on computer vision and pattern recognition (cvpr)

    (2017)
  • E. Chou et al.

    Privacy-preserving action recognition for smart hospitals using low-resolution depth images

  • J. Dai et al.

    Towards privacy-preserving recognition of human activities

    International conference on image processing (icip)

    (2015)
  • J. Deng et al.

    Imagenet: A large-scale hierarchical image database

    Conference on computer vision and pattern recognition (cvpr)

    (2009)
  • N.E.D. Elmadany et al.

    Information fusion for human action recognition via biset/multiset globality locality preserving canonical correlation analysis

    IEEE Transactions on Image Processing

    (2018)
  • S. Haas et al.

    Superpixel-based interest points for effective bags of visual words medical image retrieval

    Miccai international workshop on medical content-based retrieval for clinical decision support

    (2011)
  • K. He et al.

    Deep residual learning for image recognition

    Conference on computer vision and pattern recognition (cvpr)

    (2016)
  • Cited by (22)

    • Human activity recognition using temporal convolutional neural network architecture

      2022, Expert Systems with Applications
      Citation Excerpt :

      One drawback of HAR recognition using RGB-D is the lack of privacy in obtaining the information. For this reason, Rajput et al. (2020) developed a deep CNN method in the cloud with the priority of preserving privacy. In their method, they obfuscated the information by using Motion History Images (MHI) and three Depth Motion Maps (DMM).

    • Complex Network-based features extraction in RGB-D human action recognition

      2022, Journal of Visual Communication and Image Representation
      Citation Excerpt :

      Boissiere, et al. [52] proposed a modular network combining skeleton and infrared data and a pre-trained 2D convolutional neural network (CNN) expressed as a pose module to extract features from skeleton data. Rajput, et al. [53] proposed a novel cloud-based approach to securely recognize human activities. They considered color and depth data, and secured them using position based super pixel transformation.

    View all citing articles on Scopus
    View full text